WebArchiver

所属分类:数据采集/爬虫
开发工具:Python
文件大小:52KB
下载次数:0
上传日期:2018-08-07 15:28:31
上 传 者sh-1993
说明:  去中心化的web归档
(Decentralized web archiving)

文件列表:
add_job.py (1335, 2018-08-07)
main.py (2193, 2018-08-07)
test.py (337, 2018-08-07)
webarchiver (0, 2018-08-07)
webarchiver\__init__.py (2231, 2018-08-07)
webarchiver\config.py (1041, 2018-08-07)
webarchiver\dashboard (0, 2018-08-07)
webarchiver\dashboard\__init__.py (1192, 2018-08-07)
webarchiver\dashboard\templates (0, 2018-08-07)
webarchiver\dashboard\templates\base.html (489, 2018-08-07)
webarchiver\dashboard\templates\configuration.html (123, 2018-08-07)
webarchiver\dashboard\templates\job.html (961, 2018-08-07)
webarchiver\dashboard\templates\jobs.html (274, 2018-08-07)
webarchiver\dashboard\templates\main.html (452, 2018-08-07)
webarchiver\database.py (4566, 2018-08-07)
webarchiver\database_test.py (1361, 2018-08-07)
webarchiver\dicts.py (234, 2018-08-07)
webarchiver\extractor (0, 2018-08-07)
webarchiver\extractor\__init__.py (0, 2018-08-07)
webarchiver\extractor\simple.py (3631, 2018-08-07)
webarchiver\job (0, 2018-08-07)
webarchiver\job\__init__.py (6257, 2018-08-07)
webarchiver\job\archive.py (4370, 2018-08-07)
webarchiver\job\settings.py (4526, 2018-08-07)
webarchiver\job\settings_example (0, 2018-08-07)
webarchiver\job\settings_example\list (27, 2018-08-07)
webarchiver\job\settings_example\settings_example.cfg (668, 2018-08-07)
webarchiver\log.py (2374, 2018-08-07)
webarchiver\request.py (3181, 2018-08-07)
webarchiver\server (0, 2018-08-07)
webarchiver\server\__init__.py (124, 2018-08-07)
webarchiver\server\base.py (8438, 2018-08-07)
webarchiver\server\crawler.py (28138, 2018-08-07)
webarchiver\server\job (0, 2018-08-07)
webarchiver\server\job\__init__.py (213, 2018-08-07)
webarchiver\server\job\crawler.py (7537, 2018-08-07)
webarchiver\server\job\stager.py (17890, 2018-08-07)
... ...

WebArchiver ==== **WebArchiver** is a decentralized web archiving system. It allows for servers to be added and removed and minimizes data-loss when a server is offline. This project is still being developed. Usage ---- WebArchiver has the following dependencies: * ``flask`` * ``requests`` * ``warcio`` Install these by running ``pip install flask requests warcio`` or use ``pip3`` in case your default Python version is Python 2. ``wget`` is also required, this can be installed using:: sudo apt-get install wget To run WebArchiver: #. ``git clone`` this repository, #. ``cd`` into it, #. Run ``python main.py`` with options or use ``python3`` if your default Python version is Python 2. Options ~~~~ The following options are available for setting up a server in a network or creating a network. * ``-h`` ``--help`` * ``-v`` ``-version``: Get the version of WebArchiver. * ``-S SORT`` ``--sort=SORT``: The sort of server to be created. ``SORT`` can be ``stager`` for a stager or ``crawler`` for a crawler. This argument is required. * ``-SH HOST`` ``--stager-host=HOST``: The host of the stager to connect to. This should not be set if this is the first stager. * ``-SP PORT`` ``--stager-port=PORT``: The port of the stager to connect to. This should not be set if this is the first stager. * ``-H HOST`` ``--host=HOST``: The host to use for communication. If not set the scripts will try to determine the host. * ``-P PORT`` ``--port=PORT``: The port to use for communication. If not set a random port between 3000 and 6000 will be chosen. * ``--no-dashboard``: Do not create a dashboard. * ``--dashboard-port=PORT``: The port to use for the dashboard. Default port is 5000. Add a job ~~~~ A crawl of a website or a list of URLs is called a job. To add a job a configuration file needs to be processed and added to WebArchiver. The configuration file has the identifier and the following possible options. * ``url``: URL to crawl. * ``urls file``: Filename of a file containing a list of URLs. * ``urls url``: URL to a webpage containing a raw list of URLs. * ``rate``: URL crawl rate in URLs per second. * ``allow regex``: Regular expression a discovered URL should match. * ``ignore regex``: Regular expression a discovered URL should not match. * ``depth``: Maximum depth to crawl. For all settings except ``rate`` and ``depth`` multiple entries are possible. An example of a configuration file is .. code:: ini [identifier] url = https://example.com/ url = https://example.com/page2 urls file = list urls url = https://pastebin.com/raw/tMpQQk7B rate = 4 allow regex = https?://(?:www)?example\.com/ allow regex = https?://[^/]+\.london ignore regex = https?://[^/]+\.nl depth = 3 To process the configuration file and add it to WebArchiver, run ``python add_job.py FILENAME``, where ``FILENAME`` is the name of the configuration file. Servers ---- WebArchiver consists of stagers and crawlers. Stagers divide the work among crawlers and other stagers. Stager ~~~~ The stager distributes new jobs and URLs and received WARCs from crawlers. Crawling ~~~~ The crawler received URLs from the stager it is connected to, crawls these URLs and send back the WARC and new found URLs.

近期下载者

相关文件


收藏者