robots.txt:robots.txt即服务。 抓取robots.txt文件,下载并解析它们以通过API检查规则

  • y3_873114
    了解作者
  • 2MB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-06-13 01:18
    上传日期
:robot: robots.txt即服务 :robot: :construction: 开发中的项目 通过API访问权限的分布式robots.txt解析器和规则检查器。 如果您正在使用分布式Web爬网程序,并且想要礼貌行事,那么您会发现此项目非常有用。 另外,该项目可用于集成到任何SEO工具中,以检查机器人是否正确索引了内容。 对于第一个版本,我们正在尝试遵守Google用于分析网站的规范。 你可以看到它。 期待其他机器人规格的支持! 为什么要这个项目? 如果您要构建分布式Web搜寻器,则要知道从网站管理robots.txt规则是一项艰巨的任务,并且以可扩展的方式进行维护可能会很复杂。 您需要关注您的业务需求。 robots.txt可
robots_txt-master.zip
内容介绍
# :robot: `robots.txt` as a service :robot: [![License](https://img.shields.io/badge/License-GPLv3%202.0-brightgreen.svg?style=for-the-badge)](https://www.gnu.org/licenses/gpl-3.0) ![Twitter Follow](https://img.shields.io/twitter/follow/robotstxt_?color=blue&label=twitter&logo=twitter&style=for-the-badge) >:construction: Project in development Distributed *robots.txt* parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be **polite** in your action, then you will find this project very useful. Also, this project can be used to integrate into any *SEO* tool to check if the content is being indexed correctly by robots. >For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it [here](https://developers.google.com/search/reference/robots_txt?hl=en). Expect support from other robot specifications soon! ## Why this project? If you are building a distributed web crawler, you know that manage *robots.txt* rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. `robots.txt` can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second! ## Requirements In order to build this project in your machine you will need to have installed in your system: * `Java 11` and [Kotlin](https://kotlinlang.org/docs/tutorials/command-line.html) * [Docker](https://docs.docker.com/install/) * [docker-compose](https://docs.docker.com/compose/install/) * `make` ## Getting started If you want to test this project locally, then you will need to be installed in your system [Docker](https://www.docker.com/), [docker-compose](https://docs.docker.com/compose/install/) and `Make`. When done, then execute the following command to compile all projects, build docker images and run it: >:point_right: Be patient! ```bash $ make start-all ``` >You can execute `make logs` to see how things have gone Now you can send some URL's to the crawler system to download the rules found in the *robots.txt* file and persist it in the database. For example, you can invoke the crawl API using this command: ```bash $ curl -X POST http://localhost:9081/v1/send \ -d 'url=https://news.ycombinator.com/newcomments' \ -H 'Content-Type: application/x-www-form-urlencoded' ``` >Also, there is another method in the API to make a crawl request but using a `GET` method. If you want to check >all methods this application expose, import this [Postman collection](postman/robots.txt.postman_collection.json). This command will send the URL to the streaming service, and when received, the `robots.txt` file will be downloaded, parsed and saved into the database. The next step is to check if you can access any resource of a known host using a `user-agent` directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the `newest` resource from [hacker news](https://news.ycombinator.com/newest). You will execute: ```bash $ curl -X POST http://localhost:9080/v1/allowed \ -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \ -H 'Content-Type: application/json' ``` The response will be: ```json { "url":"https://news.ycombinator.com/newest", "agent":"AwesomeBot", "allowed":true } ``` This is like saying: *Hey!, you can crawl content from `https://news.ycombinator.com/newest`* When you finish your test, execute the next command to stop and remove all docker containers: ```bash $ make stop-all ``` >:fire: Happy Hacking! :fire:
评论
    相关推荐