robots.txt:robots.txt即服务。 抓取robots.txt文件,下载并解析它们以通过API检查规则

  • y3_873114
  • 2MB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-06-13 01:18
:robot: robots.txt即服务 :robot: :construction: 开发中的项目 通过API访问权限的分布式robots.txt解析器和规则检查器。 如果您正在使用分布式Web爬网程序,并且想要礼貌行事,那么您会发现此项目非常有用。 另外,该项目可用于集成到任何SEO工具中,以检查机器人是否正确索引了内容。 对于第一个版本,我们正在尝试遵守Google用于分析网站的规范。 你可以看到它。 期待其他机器人规格的支持! 为什么要这个项目? 如果您要构建分布式Web搜寻器,则要知道从网站管理robots.txt规则是一项艰巨的任务,并且以可扩展的方式进行维护可能会很复杂。 您需要关注您的业务需求。 robots.txt可
# :robot: `robots.txt` as a service :robot: [![License](]( ![Twitter Follow]( >:construction: Project in development Distributed *robots.txt* parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be **polite** in your action, then you will find this project very useful. Also, this project can be used to integrate into any *SEO* tool to check if the content is being indexed correctly by robots. >For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it [here]( Expect support from other robot specifications soon! ## Why this project? If you are building a distributed web crawler, you know that manage *robots.txt* rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. `robots.txt` can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second! ## Requirements In order to build this project in your machine you will need to have installed in your system: * `Java 11` and [Kotlin]( * [Docker]( * [docker-compose]( * `make` ## Getting started If you want to test this project locally, then you will need to be installed in your system [Docker](, [docker-compose]( and `Make`. When done, then execute the following command to compile all projects, build docker images and run it: >:point_right: Be patient! ```bash $ make start-all ``` >You can execute `make logs` to see how things have gone Now you can send some URL's to the crawler system to download the rules found in the *robots.txt* file and persist it in the database. For example, you can invoke the crawl API using this command: ```bash $ curl -X POST http://localhost:9081/v1/send \ -d 'url=' \ -H 'Content-Type: application/x-www-form-urlencoded' ``` >Also, there is another method in the API to make a crawl request but using a `GET` method. If you want to check >all methods this application expose, import this [Postman collection](postman/robots.txt.postman_collection.json). This command will send the URL to the streaming service, and when received, the `robots.txt` file will be downloaded, parsed and saved into the database. The next step is to check if you can access any resource of a known host using a `user-agent` directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the `newest` resource from [hacker news]( You will execute: ```bash $ curl -X POST http://localhost:9080/v1/allowed \ -d '{"url": "","agent": "AwesomeBot"}' \ -H 'Content-Type: application/json' ``` The response will be: ```json { "url":"", "agent":"AwesomeBot", "allowed":true } ``` This is like saying: *Hey!, you can crawl content from ``* When you finish your test, execute the next command to stop and remove all docker containers: ```bash $ make stop-all ``` >:fire: Happy Hacking! :fire: