heritrix-2.0.0-src

所属分类:搜索引擎
开发工具:Java
文件大小:3024KB
下载次数:26
上传日期:2008-04-14 13:45:28
上 传 者justinquan
说明:  Heritrix: Internet Archive Web Crawler The archive-crawler project is building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.

文件列表:
heritrix-2.0.0 (0, 2008-02-20)
heritrix-2.0.0\project (0, 2008-02-20)
heritrix-2.0.0\project\.classpath (5349, 2007-11-27)
heritrix-2.0.0\project\.project (366, 2007-11-27)
heritrix-2.0.0\project\commons (0, 2008-02-20)
heritrix-2.0.0\project\commons\.classpath (2524, 2007-11-27)
heritrix-2.0.0\project\commons\.project (414, 2007-11-27)
heritrix-2.0.0\project\commons\.settings (0, 2007-11-27)
heritrix-2.0.0\project\commons\.settings\org.eclipse.jdt.core.prefs (203, 2007-11-27)
heritrix-2.0.0\project\commons\pom.xml (10357, 2008-02-20)
heritrix-2.0.0\project\commons\src (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com\sleepycat (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\com\sleepycat\collections (0, 2008-02-02)
heritrix-2.0.0\project\commons\src\main\java\org (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient (0, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie (0, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\Cookie.java (18552, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\CookieSpec.java (10645, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\CookieSpecBase.java (26731, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\cookie\IgnoreCookiesSpec.java (4608, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpConnection.java (48440, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpMethodBase.java (87048, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpParser.java (8596, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\httpclient\HttpState.java (25768, 2007-11-28)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\FairGenericObjectPool.java (22266, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\FairGenericObjectPoolTest.java (3649, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\apache\commons\pool\impl\GenericObjectPool.java (57447, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient (0, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\ConfigurableX509TrustManager.java (6759, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderGetMethod.java (4648, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderMethod.java (3619, 2007-11-27)
heritrix-2.0.0\project\commons\src\main\java\org\archive\httpclient\HttpRecorderPostMethod.java (3285, 2007-11-27)
... ...

------------------------------------------------------------------------------- $Id: README.txt 5782 2008-02-16 21:59:05Z Gojomo $ ------------------------------------------------------------------------------- 0.0 Contents 1.0 Introduction 2.0 Online Reference 3.0 Getting Started Tips 4.0 License 1.0 Introduction Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations. 2.0 Online Reference The most up-to-date information about Heritrix is on the project wiki: http://webteam.archive.org/confluence/display/Heritrix/2.0.0 3.0 Getting Started Tips The shell script 'heritrix' in the 'bin' directory is usually sufficient to launch Heritrix. You must use the '-a' launch flag to set an authentication password on the web user interface. You must use the '-b' launch flag if you want the web user interface to accept non-local connections. The bundled job profiles are good starting points for designing your own crawl configurations. However, they each require several changes before they will work for crawling: - You must configure an 'operator-contact-url' on the job's global settings sheet. This URL will be added to the 'User-Agent' included on your crawl's outbound traffic, and should be an HTTP URL supplying information about the purpose of your crawl and containing contact information if visited sites need to report problems. - You must supply one or more 'seed' URLs to serve as crawl starting points. 4.0 License Heritrix is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation; either version 2.1 of the License, or any later version. Heritrix is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser Public License for more details. You should have received a copy of the GNU Lesser Public License along with Heritrix (See LICENSE.txt); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Heritrix includes a variety of other open source libraries under the terms of their respective licenses. Please consult those individual licenses to learn whether the libraries are usable and redistributable in contexts other than the Heritrix distribution.

近期下载者

相关文件


收藏者