heritrix-2.0.0-src
所属分类:搜索引擎
开发工具:Java
文件大小:3024KB
下载次数:26
上传日期:2008-04-14 13:45:28
上 传 者:
justinquan
说明: Heritrix: Internet Archive Web Crawler
The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
文件列表:
heritrix-2.0.0 (0, 2008-02-20)
heritrix-2.0.0\project (0, 2008-02-20)
heritrix-2.0.0\project\.classpath (5349, 2007-11-27)
heritrix-2.0.0\project\.project (366, 2007-11-27)
heritrix-2.0.0\project\commons (0, 2008-02-20)
heritrix-2.0.0\project\commons\.classpath (2524, 2007-11-27)
heritrix-2.0.0\project\commons\.project (414, 2007-11-27)
heritrix-2.0.0\project\commons\.settings (0, 2007-11-27)
heritrix-2.0.0\project\commons\.settings\org.eclipse.jdt.core.prefs (203, 2007-11-27)
heritrix-2.0.0\project\commons\pom.xml (10357, 2008-02-20)
heritrix-2.0.0\project\commons\src (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com\sleepycat (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com\sleepycat\collections (0, 2008-02-02)
heritrix-2.0.0\project\commons\src\main\java\org (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient (0, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie (0, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\Cookie.java (18552, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\CookieSpec.java (10645, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\CookieSpecBase.java (26731, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\IgnoreCookiesSpec.java (4608, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpConnection.java (48440, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpMethodBase.java (87048, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpParser.java (8596, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpState.java (25768, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\FairGenericObjectPool.java (22266, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\FairGenericObjectPoolTest.java (3649, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\GenericObjectPool.java (57447, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\ConfigurableX509TrustManager.java (6759, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderGetMethod.java (4648, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderMethod.java (3619, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderPostMethod.java (3285, 2007-11-27)
... ...
-------------------------------------------------------------------------------
$Id: README.txt 5782 2008-02-16 21:59:05Z Gojomo $
-------------------------------------------------------------------------------
0.0 Contents
1.0 Introduction
2.0 Online Reference
3.0 Getting Started Tips
4.0 License
1.0 Introduction
Heritrix is the Internet Archive's open-source, extensible, web-scale,
archival-quality web crawler project. Heritrix (sometimes spelled
heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix)
is an archaic word for heiress (woman who inherits). Our crawler seeks
to collect and preserve the digital artifacts of our culture for the
benefit of future researchers and generations.
2.0 Online Reference
The most up-to-date information about Heritrix is on the project wiki:
http://webteam.archive.org/confluence/display/Heritrix/2.0.0
3.0 Getting Started Tips
The shell script 'heritrix' in the 'bin' directory is usually
sufficient to launch Heritrix. You must use the '-a' launch flag to set
an authentication password on the web user interface. You must use the
'-b' launch flag if you want the web user interface to accept non-local
connections.
The bundled job profiles are good starting points for designing your
own crawl configurations. However, they each require several changes
before they will work for crawling:
- You must configure an 'operator-contact-url' on the job's global
settings sheet. This URL will be added to the 'User-Agent' included
on your crawl's outbound traffic, and should be an HTTP URL supplying
information about the purpose of your crawl and containing contact
information if visited sites need to report problems.
- You must supply one or more 'seed' URLs to serve as crawl starting
points.
4.0 License
Heritrix is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser Public License as published by the
Free Software Foundation; either version 2.1 of the License, or any
later version.
Heritrix is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser Public License for more details.
You should have received a copy of the GNU Lesser Public License
along with Heritrix (See LICENSE.txt); if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA
Heritrix includes a variety of other open source libraries under the
terms of their respective licenses. Please consult those individual
licenses to learn whether the libraries are usable and redistributable
in contexts other than the Heritrix distribution.
近期下载者:
相关文件:
收藏者: