crawling-framework

所属分类:博客
开发工具:Java
文件大小:837KB
下载次数:0
上传日期:2022-11-15 23:31:06
上 传 者sh-1993
说明:  使用Storm Crawler轻松抓取新闻门户或博客网站。
(Easily crawl news portals or blog sites using Storm Crawler.)

文件列表:
Dockerfile.base (193, 2021-05-24)
Dockerfile.crawler (628, 2021-05-24)
Dockerfile.es (627, 2021-05-24)
Dockerfile.ui (592, 2021-05-24)
LICENSE (555, 2021-05-24)
Makefile (728, 2021-05-24)
administration-ui (0, 2021-05-24)
administration-ui\conf (0, 2021-05-24)
administration-ui\conf\development.properties (386, 2021-05-24)
administration-ui\conf\docker-compose.properties (390, 2021-05-24)
administration-ui\pom.xml (7550, 2021-05-24)
administration-ui\src (0, 2021-05-24)
administration-ui\src\main (0, 2021-05-24)
administration-ui\src\main\java (0, 2021-05-24)
administration-ui\src\main\java\lt (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\Application.java (2660, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\CrawlerAdminUI.java (786, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\HttpSourceTestsCache.java (1394, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\utils (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\utils\CSVUtils.java (1303, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\utils\GridUtils.java (2226, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\utils\HttpSourceCSVUtils.java (4095, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\utils\HttpSourceTestCSVUtils.java (1776, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view (0, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\BaseView.java (1567, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\HttpSourceForm.java (11957, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\HttpSourceStatsWindow.java (2606, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\HttpSourceTestWindow.java (7674, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\HttpSourcesView.java (10065, 2021-05-24)
administration-ui\src\main\java\lt\tokenmill\crawling\adminui\view\ImportExportView.java (845, 2021-05-24)
... ...

# Crawling Framework [![Maven Central](https://img.shields.io/maven-central/v/lt.tokenmill.crawling/crawling-framework.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22lt.tokenmill.crawling%22%20AND%20a:%22crawling-framework%22) [![pipeline status](https://gitlab.com/tokenmill/crawling-framework/badges/master/pipeline.svg)](https://gitlab.com/tokenmill/crawling-framework/commits/master) Crawling Framework aims at providing instruments to configure and run your [Storm Crawler](http://stormcrawler.net/) based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can: 1. Specify which sites to crawl. 1. Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched. 1. Specify which elements of the page provide information about article publication name, its title and main body. 1. Define tests which validate that extraction rules are working. Once configuration is done the Crawling Framework runs [Storm Crawler](http://stormcrawler.net/) based crawling following the rules specified in the configuration. ## Introduction We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube. [![Crawling Framework Intro](https://img.youtube.com/vi/AvO4lmmIuis/0.jpg)](https://www.youtube.com/watch?v=AvO4lmmIuis) ## Requirements Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project [install ElasticSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) (Crawling Framework is tested to work with Elastic v7.x). Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed. ### Using password protected ElasticSearch Some providers hide ElasticSearch under authentification step (Which makes sense). Just set environment variables `ES_USERNAME` and `ES_PASSWORD` accordingly, everything else can remain the same. Authentification step will be done implicitly if proper credentials are there ## Configuring and Running a crawl See [Crawling Framework Example](https://github.com/tokenmill/crawling-framework-example) project's documentation. ## License Copyright © 2017-2019 [TokenMill UAB](http://www.tokenmill.ai). Distributed under the The Apache License, Version 2.0.

近期下载者

相关文件


收藏者