WebArchiver
所属分类:数据采集/爬虫
开发工具:Python
文件大小:52KB
下载次数:0
上传日期:2018-08-07 15:28:31
上 传 者:
sh-1993
说明: 去中心化的web归档
(Decentralized web archiving)
文件列表:
add_job.py (1335, 2018-08-07)
main.py (2193, 2018-08-07)
test.py (337, 2018-08-07)
webarchiver (0, 2018-08-07)
webarchiver\__init__.py (2231, 2018-08-07)
webarchiver\config.py (1041, 2018-08-07)
webarchiver\dashboard (0, 2018-08-07)
webarchiver\dashboard\__init__.py (1192, 2018-08-07)
webarchiver\dashboard\templates (0, 2018-08-07)
webarchiver\dashboard\templates\base.html (489, 2018-08-07)
webarchiver\dashboard\templates\configuration.html (123, 2018-08-07)
webarchiver\dashboard\templates\job.html (961, 2018-08-07)
webarchiver\dashboard\templates\jobs.html (274, 2018-08-07)
webarchiver\dashboard\templates\main.html (452, 2018-08-07)
webarchiver\database.py (4566, 2018-08-07)
webarchiver\database_test.py (1361, 2018-08-07)
webarchiver\dicts.py (234, 2018-08-07)
webarchiver\extractor (0, 2018-08-07)
webarchiver\extractor\__init__.py (0, 2018-08-07)
webarchiver\extractor\simple.py (3631, 2018-08-07)
webarchiver\job (0, 2018-08-07)
webarchiver\job\__init__.py (6257, 2018-08-07)
webarchiver\job\archive.py (4370, 2018-08-07)
webarchiver\job\settings.py (4526, 2018-08-07)
webarchiver\job\settings_example (0, 2018-08-07)
webarchiver\job\settings_example\list (27, 2018-08-07)
webarchiver\job\settings_example\settings_example.cfg (668, 2018-08-07)
webarchiver\log.py (2374, 2018-08-07)
webarchiver\request.py (3181, 2018-08-07)
webarchiver\server (0, 2018-08-07)
webarchiver\server\__init__.py (124, 2018-08-07)
webarchiver\server\base.py (8438, 2018-08-07)
webarchiver\server\crawler.py (28138, 2018-08-07)
webarchiver\server\job (0, 2018-08-07)
webarchiver\server\job\__init__.py (213, 2018-08-07)
webarchiver\server\job\crawler.py (7537, 2018-08-07)
webarchiver\server\job\stager.py (17890, 2018-08-07)
... ...
WebArchiver
====
**WebArchiver** is a decentralized web archiving system. It allows for servers to be added and removed and minimizes data-loss when a server is offline.
This project is still being developed.
Usage
----
WebArchiver has the following dependencies:
* ``flask``
* ``requests``
* ``warcio``
Install these by running ``pip install flask requests warcio`` or use ``pip3`` in case your default Python version is Python 2.
``wget`` is also required, this can be installed using::
sudo apt-get install wget
To run WebArchiver:
#. ``git clone`` this repository,
#. ``cd`` into it,
#. Run ``python main.py`` with options or use ``python3`` if your default Python version is Python 2.
Options
~~~~
The following options are available for setting up a server in a network or creating a network.
* ``-h``
``--help``
* ``-v``
``-version``: Get the version of WebArchiver.
* ``-S SORT``
``--sort=SORT``: The sort of server to be created. ``SORT`` can be ``stager`` for a stager or ``crawler`` for a crawler. This argument is required.
* ``-SH HOST``
``--stager-host=HOST``: The host of the stager to connect to. This should not be set if this is the first stager.
* ``-SP PORT``
``--stager-port=PORT``: The port of the stager to connect to. This should not be set if this is the first stager.
* ``-H HOST``
``--host=HOST``: The host to use for communication. If not set the scripts will try to determine the host.
* ``-P PORT``
``--port=PORT``: The port to use for communication. If not set a random port between 3000 and 6000 will be chosen.
* ``--no-dashboard``: Do not create a dashboard.
* ``--dashboard-port=PORT``: The port to use for the dashboard. Default port is 5000.
Add a job
~~~~
A crawl of a website or a list of URLs is called a job. To add a job a configuration file needs to be processed and added to WebArchiver. The configuration file has the identifier and the following possible options.
* ``url``: URL to crawl.
* ``urls file``: Filename of a file containing a list of URLs.
* ``urls url``: URL to a webpage containing a raw list of URLs.
* ``rate``: URL crawl rate in URLs per second.
* ``allow regex``: Regular expression a discovered URL should match.
* ``ignore regex``: Regular expression a discovered URL should not match.
* ``depth``: Maximum depth to crawl.
For all settings except ``rate`` and ``depth`` multiple entries are possible.
An example of a configuration file is
.. code:: ini
[identifier]
url = https://example.com/
url = https://example.com/page2
urls file = list
urls url = https://pastebin.com/raw/tMpQQk7B
rate = 4
allow regex = https?://(?:www)?example\.com/
allow regex = https?://[^/]+\.london
ignore regex = https?://[^/]+\.nl
depth = 3
To process the configuration file and add it to WebArchiver, run ``python add_job.py FILENAME``, where ``FILENAME`` is the name of the configuration file.
Servers
----
WebArchiver consists of stagers and crawlers. Stagers divide the work among crawlers and other stagers.
Stager
~~~~
The stager distributes new jobs and URLs and received WARCs from crawlers.
Crawling
~~~~
The crawler received URLs from the stager it is connected to, crawls these URLs and send back the WARC and new found URLs.
近期下载者:
相关文件:
收藏者: