crawler 联合开发网

Pudn.com > 下载中心 > 人工智能/神经网络/深度学习 > crawler

crawler

crawlerDriv reduce crawler map reduce 分布式爬虫 java分布式爬虫

所属分类：人工智能/神经网络/深度学习
开发工具：Java
文件大小：9871KB
下载次数：37
上传日期：2013-07-02 23:10:47
上传者：zqq1016

说明：爬虫分布式版本实现，基于Map-Reduce进行了实现，非常有用
(Reptile distributed version achieved, based on Map-Reduce was realized very useful)

文件列表:

crawler\build.xml (4891, 2010-02-01)
crawler\LISENCE (17775, 2010-01-13)
crawler\seeds-csb.txt (25, 2010-02-01)
crawler\seeds-hadoop.txt (25, 2010-01-13)
crawler\seeds-hadoopcn.txt (48, 2010-01-15)
crawler\seeds-hi.txt (21, 2010-02-01)
crawler\seeds-localhost.txt (17, 2010-02-01)
crawler\seeds-nyt.txt (23, 2010-01-15)
crawler\seeds-scst.txt (24, 2010-01-13)
crawler\seeds-wiki.txt (42, 2010-01-13)
crawler\bin\crawler.sh (276, 2010-02-01)
crawler\conf\configuration.xsl (1311, 2010-01-13)
crawler\conf\joycrawler-csb.xml (2132, 2010-02-01)
crawler\conf\joycrawler-default.xml (2248, 2010-02-01)
crawler\conf\joycrawler-hadoop.xml (1971, 2010-02-01)
crawler\conf\joycrawler-hadoopcn.xml (1998, 2010-02-01)
crawler\conf\joycrawler-hi.xml (391, 2010-02-01)
crawler\conf\joycrawler-localhost.xml (1969, 2010-02-01)
crawler\conf\joycrawler-nyt.xml (2179, 2010-02-01)
crawler\conf\joycrawler-scst.xml (1965, 2010-02-01)
crawler\conf\joycrawler-wiki.xml (2153, 2010-02-01)
crawler\conf\log4j.properties (2955, 2010-01-13)
crawler\lib\commons-cli-2.0-SNAPSHOT.jar (258337, 2010-01-13)
crawler\lib\commons-httpclient-3.1.jar (305001, 2010-01-13)
crawler\lib\commons-logging-1.0.4.jar (38015, 2010-01-13)
crawler\lib\db.jar (632527, 2010-01-13)
crawler\lib\hadoop-0.20.1-core.jar (2682112, 2010-01-13)
crawler\lib\log4j-1.2.15.jar (391834, 2010-01-13)
crawler\lib\lucene-core-3.0.0.jar (1021623, 2010-01-13)
crawler\lib\lucene-smartcn-3.0.0.jar (3590447, 2010-02-01)
crawler\lib\lucene-snowball-3.0.0.jar (115093, 2010-02-01)
crawler\lib\nekohtml.jar (121981, 2010-01-13)
crawler\lib\xercesImpl.jar (1229289, 2010-01-13)
crawler\lib\xercesMinimal.jar (41531, 2010-01-13)
crawler\lib\xml-apis.jar (194354, 2010-01-13)
crawler\lib\native\libdb_java48.dll (131072, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\Analyzer.java (1657, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\Document.java (2458, 2010-01-13)
crawler\src\contrib\java\org\joy\analyzer\DocumentCreationException.java (248, 2010-01-13)
... ...

Installation and Deploy Guide 1. Installation (Stand Alone) Check the requisite softwares: JDK 1.6, Apache Ant, Oracle Berkeley DB (A little tricky to install on linux, download from http://www.oracle.com/technology/software/products/berkeley-db/db/index.html), and follow the instructions Cygwin (if Windows) Download the latest Joycrawler from http://code.google.com/p/joycrawler/ Unzip it to any directory, and "cd" to that directory. 2. Run Examples Each example will be configured in two files, the seeds-*.txt in top dir and configuration file joycrawler-*.xml in conf/. The * will be replaced by Example name, which you will use in the command line. To download the webpages and index them $ ant index -Dconf=EXAMPLE -Dnative.path=YOUR_BERKELEY_DB_INSTALLATION_DIR To start your search server on local machine $ ant server -Dnative.path=YOUR_BERKELEY_DB_INSTALLATION_DIR -Ddbfolder=DB_FOLDER DB_FODLER will be described in org.joy.index.DBFolder property in configration file. The initialization will finish if it prompts "Listening to 1***7" To start search, launch another terminal, cd to the installation directory $ ant searcher -Dhosts=localhost Then you can type the search string 3. Custmize your search Try to modify the configuration file and seeds file on your demand, and name it as example's, then repeat the steps in section 2.

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：