crawler

所属分类:人工智能/神经网络/深度学习
开发工具:Java
文件大小:9871KB
下载次数:37
上传日期:2013-07-02 23:10:47
上 传 者zqq1016
说明:  爬虫分布式版本实现,基于Map-Reduce进行了实现,非常有用
(Reptile distributed version achieved, based on Map-Reduce was realized very useful)

文件列表:
crawler\build.xml (4891, 2010-02-01)
crawler\LISENCE (17775, 2010-01-13)
crawler\seeds-csb.txt (25, 2010-02-01)
crawler\seeds-hadoop.txt (25, 2010-01-13)
crawler\seeds-hadoopcn.txt (48, 2010-01-15)
crawler\seeds-hi.txt (21, 2010-02-01)
crawler\seeds-localhost.txt (17, 2010-02-01)
crawler\seeds-nyt.txt (23, 2010-01-15)
crawler\seeds-scst.txt (24, 2010-01-13)
crawler\seeds-wiki.txt (42, 2010-01-13)
crawler\bin\crawler.sh (276, 2010-02-01)
crawler\conf\configuration.xsl (1311, 2010-01-13)
crawler\conf\joycrawler-csb.xml (2132, 2010-02-01)
crawler\conf\joycrawler-default.xml (2248, 2010-02-01)
crawler\conf\joycrawler-hadoop.xml (1971, 2010-02-01)
crawler\conf\joycrawler-hadoopcn.xml (1998, 2010-02-01)
crawler\conf\joycrawler-hi.xml (391, 2010-02-01)
crawler\conf\joycrawler-localhost.xml (1969, 2010-02-01)
crawler\conf\joycrawler-nyt.xml (2179, 2010-02-01)
crawler\conf\joycrawler-scst.xml (1965, 2010-02-01)
crawler\conf\joycrawler-wiki.xml (2153, 2010-02-01)
crawler\conf\log4j.properties (2955, 2010-01-13)
crawler\lib\commons-cli-2.0-SNAPSHOT.jar (258337, 2010-01-13)
crawler\lib\commons-httpclient-3.1.jar (305001, 2010-01-13)
crawler\lib\commons-logging-1.0.4.jar (38015, 2010-01-13)
crawler\lib\db.jar (632527, 2010-01-13)
crawler\lib\hadoop-0.20.1-core.jar (2682112, 2010-01-13)
crawler\lib\log4j-1.2.15.jar (391834, 2010-01-13)
crawler\lib\lucene-core-3.0.0.jar (1021623, 2010-01-13)
crawler\lib\lucene-smartcn-3.0.0.jar (3590447, 2010-02-01)
crawler\lib\lucene-snowball-3.0.0.jar (115093, 2010-02-01)
crawler\lib\nekohtml.jar (121981, 2010-01-13)
crawler\lib\xercesImpl.jar (1229289, 2010-01-13)
crawler\lib\xercesMinimal.jar (41531, 2010-01-13)
crawler\lib\xml-apis.jar (194354, 2010-01-13)
crawler\lib\native\libdb_java48.dll (131072, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\Analyzer.java (1657, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\Document.java (2458, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\DocumentCreationException.java (248, 2010-01-13)
... ...

Installation and Deploy Guide 1. Installation (Stand Alone) Check the requisite softwares: JDK 1.6, Apache Ant, Oracle Berkeley DB (A little tricky to install on linux, download from http://www.oracle.com/technology/software/products/berkeley-db/db/index.html), and follow the instructions Cygwin (if Windows) Download the latest Joycrawler from http://code.google.com/p/joycrawler/ Unzip it to any directory, and "cd" to that directory. 2. Run Examples Each example will be configured in two files, the seeds-*.txt in top dir and configuration file joycrawler-*.xml in conf/. The * will be replaced by Example name, which you will use in the command line. To download the webpages and index them $ ant index -Dconf=EXAMPLE -Dnative.path=YOUR_BERKELEY_DB_INSTALLATION_DIR To start your search server on local machine $ ant server -Dnative.path=YOUR_BERKELEY_DB_INSTALLATION_DIR -Ddbfolder=DB_FOLDER DB_FODLER will be described in org.joy.index.DBFolder property in configration file. The initialization will finish if it prompts "Listening to 1***7" To start search, launch another terminal, cd to the installation directory $ ant searcher -Dhosts=localhost Then you can type the search string 3. Custmize your search Try to modify the configuration file and seeds file on your demand, and name it as example's, then repeat the steps in section 2.

近期下载者

相关文件


收藏者