heritrix-1.6.0-src

所属分类:搜索引擎
开发工具:Unix_Linux
文件大小:9203KB
下载次数:25
上传日期:2006-05-03 03:44:16
上 传 者konmythos
说明:  非常优秀的搜索引擎 LInux下 java版本的 robot
(excellent search engine LInux under java version of the robot)

文件列表:
heritrix-1.6.0 (0, 2005-12-01)
heritrix-1.6.0\build.xml (278, 2005-12-01)
heritrix-1.6.0\lib (0, 2005-12-01)
heritrix-1.6.0\lib\ant-1.6.2.jar (999966, 2005-12-01)
heritrix-1.6.0\lib\colt-subset-1.2.0.kb.jar (2718, 2005-12-01)
heritrix-1.6.0\lib\commons-cli-1.0.jar (30117, 2005-12-01)
heritrix-1.6.0\lib\commons-codec-1.3.jar (46725, 2005-12-01)
heritrix-1.6.0\lib\commons-collections-3.1.jar (559366, 2005-12-01)
heritrix-1.6.0\lib\commons-httpclient-3.0-rc3.jar (279317, 2005-12-01)
heritrix-1.6.0\lib\commons-logging-1.0.4.jar (38015, 2005-12-01)
heritrix-1.6.0\lib\commons-pool-1.2.jar (42492, 2005-12-01)
heritrix-1.6.0\lib\concurrent-1.3.2.jar (852814, 2005-12-01)
heritrix-1.6.0\lib\dnsjava-1.6.2.jar (244766, 2005-12-01)
heritrix-1.6.0\lib\dsi-unimi-it-1.0.0.kb.jar (268008, 2005-12-01)
heritrix-1.6.0\lib\itext-1.2.0.jar (1334258, 2005-12-01)
heritrix-1.6.0\lib\j5compat-0.1.0.jar (7240, 2005-12-01)
heritrix-1.6.0\lib\jasper-compiler-tomcat-4.1.30.jar (181664, 2005-12-01)
heritrix-1.6.0\lib\jasper-runtime-tomcat-4.1.30.jar (72406, 2005-12-01)
heritrix-1.6.0\lib\javaswf-CVS-SNAPSHOT-1.jar (175777, 2005-12-01)
heritrix-1.6.0\lib\je-2.0.90.jar (705552, 2005-12-01)
heritrix-1.6.0\lib\jetty-4.2.23.jar (580195, 2005-12-01)
heritrix-1.6.0\lib\jmxri-1.2.1.jar (365858, 2005-12-01)
heritrix-1.6.0\lib\jmxtools-1.2.1.jar (102394, 2005-12-01)
heritrix-1.6.0\lib\junit-3.8.1.jar (121070, 2005-12-01)
heritrix-1.6.0\lib\libidn-0.5.9.jar (109153, 2005-12-01)
heritrix-1.6.0\lib\MirrorJNDI-1.0.jar (11029, 2005-12-01)
heritrix-1.6.0\lib\poi-2.0-RC1-20031102.jar (619594, 2005-12-01)
heritrix-1.6.0\lib\poi-scratchpad-2.0-RC1-20031102.jar (188928, 2005-12-01)
heritrix-1.6.0\lib\servlet-tomcat-4.1.30.jar (79265, 2005-12-01)
heritrix-1.6.0\LICENSE.txt (26980, 2005-12-01)
heritrix-1.6.0\maven.xml (20611, 2005-12-01)
heritrix-1.6.0\project.properties (4230, 2005-12-01)
heritrix-1.6.0\project.xml (31555, 2005-12-01)
heritrix-1.6.0\src (0, 2005-12-01)
heritrix-1.6.0\src\articles (0, 2005-12-01)
heritrix-1.6.0\src\articles\crawler_overview1.dia (2059, 2005-12-01)
heritrix-1.6.0\src\articles\crawler_overview1.png (23699, 2005-12-01)
heritrix-1.6.0\src\articles\credentials.gif (18691, 2005-12-01)
heritrix-1.6.0\src\articles\credentials.zargo (11239, 2005-12-01)
... ...

------------------------------------------------------------------------------- $Id: README.txt,v 1.22 2005/04/29 01:36:11 stack-sf Exp $ ------------------------------------------------------------------------------- 0.0 Contents 1.0 Introduction 2.0 Webmasters! 3.0 System Runtime Requirements 4.0 Getting Started 5.0 Developer Documentation 6.0 Release History 7.0 License 8.0 Dependencies 1.0 Introduction Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. 2.0 Webmasters! Heritrix is designed to respect the robots.txt exclusion directives and META robots tags . If you notice our crawler behaving poorly, please send us email at archive-crawler-agent *at* lists *dot* sourceforge *dot* net. 3.0 System Runtime Requirements 3.1. Java Runtime Environment The Heritrix crawler is implemented purely in java. This means that the only true requirement for running it is that you have a JRE installed. The Heritrix crawler makes use of Java 1.4 features so your JRE must be at least of a 1.4.0 pedigree. We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. They are listed along with pointers to their licenses in Section 8. Dependencies below. 3.2. Hardware Default heap size is 256MB RAM. This should be suitable for crawls that range over hundreds of hosts. 3.3. Linux The Heritrix crawler has been built and tested primarily on Linux. It has seen some informal use on Macintosh, Windows 2000 and Windows XP, but is not tested, packaged, nor supported on platforms other than Linux at this time. 4.0 Getting Started See the User Manual at ./docs/articles/user_manual.html or at . 5.0 Developer Documentation See ./docs/articles/developer_manual.html or . 6.0 Release History See the Heritrix Release Notes in the local directory docs/articles/releasenotes.html if this is a binary release or at http://crawler.archive.org/articles/releasenotes.html. 7.0 License Heritrix is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation; either version 2.1 of the License, or any later version. Heritrix is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser Public License for more details. You should have received a copy of the GNU Lesser Public License along with Heritrix (See LICENSE.txt); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA For the licenses for libraries used by Heritrix and included in its distribution, see below in section '8.0 Dependencies'. 8.0 Dependencies 8.1. bdb Version: 1.7.1 Url: http://www.sleepycat.com/products/je.shtml Description: Berkeley DB Java Edition. Copyright (c) 1990-2004 Sleepycat Software. All rights reserved. See above link for how to obtain source. License: http://www.sleepycat.com/download/jeoslicense.html 8.2. commons-httpclient Version: 3.0-beta1 Url: http://jakarta.apache.org/commons/httpclient/ Description: This package is used for fetching URIs via http. License: Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0 8.3. commons-logging Version: 1.0.4 Url: http://jakarta.apache.org/commons/logging.html Description: Provides logging adapters. License: Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0 8.4. commons-codec Version: 1.3 Url: http://jakarta.apache.org/commons/codec/ Description: Commons Codec provides implementations of common encoders and decoders such as Base***, Hex, various phonetic encodings, and URLs. License: Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0 8.5. dnsjava Version: 1.6.2 Url: http://www.dnsjava.org/ Description: DNS Lookups. License: BSD 8.6. jetty Version: 4.2.23 Url: http://jetty.mortbay.com/jetty/ Description: The Jetty servlet container. License: Jetty license, http://jetty.mortbay.org/jetty/LICENSE.html 8.7. servlet Version: 2.3 Url: http://jakarta.apache.org/tomcat/ Description: Taken from tomcat. License: http://jakarta.apache.org/site/legal.html 8.8. jasper-runtime Version: 4.1.30 Url: http://jakarta.apache.org/tomcat/ Description: Taken from tomcat. License: http://jakarta.apache.org/site/legal.html 8.9. jasper-compiler Version: 4.1.30 Url: http://jakarta.apache.org/tomcat/ Description: Taken from tomcat. License: http://jakarta.apache.org/site/legal.html 8.10. jmxri Version: Url: http://java.sun.com/products/JavaManagement/index.jsp Description: JMX Reference Implementation. License: SUN Binary Code License http://java.com/en/download/license.jsp 8.11. jmxtools Version: Url: http://java.sun.com/products/JavaManagement/index.jsp Description: JMX tools. License: SUN Binary Code License http://java.com/en/download/license.jsp 8.12. poi Version: 2.0-RC1-20031102 Url: http://jakarta.apache.org/poi/ Description: For parsing PDFs. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.13. poi-scratchpad Version: 2.0-RC1-20031102 Url: http://jakarta.apache.org/poi/ Description: For parsing PDFs. Has the org.apache.poi.hdf.extractor.WordDocument. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.14. javaswf Version: Url: http://www.anotherbigidea.com/javaswf Description: JavaSWF2 is a set of Java packages that enable the parsing, manipulation and generation of the Macromedia Flash(TM) file format known as SWF ("swiff"). Added jar was made by unzipping javaswf-CVS-SNAPSHOT-1.zip download, compiling the java classes therein, and then making a jar of the product. License: The JavaSWF BSD License, http://anotherbigidea.com/javaswf/JavaSWF2-BSD.LICENSE.html 8.15. itext version: 1.2 url: http://www.lowagie.com/iText/ truedescription: A library for parsing PDF files. license: MPL (http://www.lowagie.com/iText/MPL-1.1.txt) 8.16. ant Version: 1.6.2 Url: http://ant.apache.org Description: Build tool. An ant task is used to compile the jspc pages at build time and then for the selftest at runtime. License: Apache 1.1. http://ant.apache.org/license.html 8.17. junit Version: 3.8.1 Url: http://www.junit.org/ Description: A framework for implimenting the unit testing methology. License: IBM's Common Public License Version 0.5. 8.18. commons-pool Version: 1.2 Url: http://jakarta.apache.org/site/binindex.cgi#commons-pool Description: For object pooling. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.19. commons-collections Version: 3.1 Url: http://jakarta.apache.org/site/binindex.cgi#commons-collections Description: Needed by commons-pool. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.20. commons-cli Version: 1.0 Url: http://jakarta.apache.org/site/binindex.cgi Description: Needed doing Heritrix command-line processing. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.21. concurrent Version: 1.3.2 Url: http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html Description: Concurrency utilities. License: Public Domain 8.22. commons-net Version: 1.1.0 Url: http://jakarta.apache.org/commons/net/ Description: This is an Internet protocol suite Java library originally developed by ORO, Inc. This version supports Finger, Whois, TFTP, Telnet, POP3, FTP, NNTP, SMTP, and some miscellaneous protocols like Time and Echo as well as BSD R command support. Heritrix uses its FTP implementation. License: Apache 1.1 http://www.apache.org/LICENSE.txt 8.23. dsi-unimi-it Version: 0.9.1 Url: http://mg4j.dsi.unimi.it/ Description: This JAR supplies alternatives to String, StringBuffer, and unsynchronized I/0. This JAR was made from subsets of mg4j-0.9.1 and from fastutil-4.4.0 -- two jars that came out of the ubicrawler project, http://ubi0.iit.cnr.it/projects/ubi/ -- using autojar. Here is how I made this jar: % java -jar autojar-1.2.2/autojar-1.2.2.jar -v -o \ dsi.unimi.it-mg4j-0.9.1_fastutil-4.4.0.jar -c \ mg4j-0.9.1/mg4j-0.9.1.jar:fastutil-4.4.0/fastutil-4.4.0.jar it.unimi.dsi.mg4j.util.MutableString.class it.unimi.dsi.mg4j.io.FastBufferedInputStream.class it.unimi.dsi.mg4j.io.FastBufferedOutputStream.class it.unimi.dsi.mg4j.io.FastBufferedReader.class it.unimi.dsi.mg4j.io.FastByteArrayInputStream.class it.unimi.dsi.mg4j.io.FastByteArrayOutputStream.class it.unimi.dsi.mg4j.io.FastMultiByteArrayInputStream.class License: Both MG4J and fastutils are LGPL

近期下载者

相关文件


收藏者