pz

所属分类:Java编程
开发工具:Java
文件大小:5631KB
下载次数:119
上传日期:2009-03-12 10:07:31
上 传 者xyf_84
说明:  垂直搜索的网络爬虫,收集新闻信息的爬虫,采用java编写,附带源代码.
(Vertical search network reptiles, reptiles to collect news and information, using java to prepare, with the source code)

文件列表:
WebNewsCrawler-1.0 (0, 2007-02-15)
WebNewsCrawler-1.0\bin (0, 2007-02-15)
WebNewsCrawler-1.0\bin\conf (0, 2007-02-15)
WebNewsCrawler-1.0\bin\conf\commons-logging.properties (74, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\crawler.properties (7581, 2007-02-13)
WebNewsCrawler-1.0\bin\conf\file.types (130, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\html2xml.properties (2524, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\jtidy.properties (327, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\keycontent.properties (791, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\log4j.properties (897, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\mime.types (8625, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\preffered.encodings (306, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts (0, 2007-02-08)
WebNewsCrawler-1.0\bin\conf\scripts\battellemedia.com.script (883, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blog.ask.com.script (711, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blog.outer-court.com.script (773, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blog.searchenginewatch.com.script (786, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blogs.forrester.com.script (499, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blogs.msdn.com.script (882, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\blogs.zdnet.com.script (716, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\clickz.com.script (1010, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\google.blognewschannel.script (890, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\google.weblogsinc.com.script (815, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\googleblog.blogspot.com.script (775, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\high-search-engine-ranking.com.script (927, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\imediaconnection.com.script (773, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\internetnews.com.script (871, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\isedb.com.script (639, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\jeremy.zawodny.com.script (866, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\keepmedia.com.script (987, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\news.com.com.script (962, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\news.zdnet.com.script (1022, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pandia.com.script (538, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pandia.com.script~2 (638, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pandia.com.script~3 (517, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pandia.com.script~4 (636, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pandia.com.script~5 (546, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\pcworld.com.script (928, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\promotiondata.com.script (612, 2007-02-07)
WebNewsCrawler-1.0\bin\conf\scripts\promotiondata.com.script~2 (844, 2007-02-07)
... ...

WebNews Crawler version 1.0 Feb 8 2007 http://senews.sourceforge.net/ mailto: porva@users.sourceforge.net (Vladimir Poroshin) ====================================================== WebNews Crawler is a java application to crawl (download, fetch) resources via HTTP. You can use it as a generic crawler to download WEB pages from Internet. It has a set of filters to limit and focus your crawling process. In addition WebNews Crawler comes with powerful HTML2XML library that can extract desired data from HTML pages and represent it in XML format. Together with ability to parse RSS feeds this crawler is useful for acquiring and cleaning WEB news articles. This software is used as a part of ALVIS search engines: http://wikipedia.hiit.fi/searchenginenews/front http://wikipedia.hiit.fi/searchenginenews/background.html See more about ALVIS project here: http://www.alvis.info/alvis/ INSTALLATION WebNews Crawler is a pure java application. You need Java Runtime Enviroment (jre) version 1.5.x or higher installed and configured to run it. This software is platform independent and tested on Windows and Linux boxes. No special installation for the package itself is required. All you need is to unpack the package archive into some location on your hard drive. DISTRIBUTION STRUCTURE bin/ top directory of the program to run bin/conf configuration files bin/conf/scripts HTML2XML parser site specific scripts bin/conf/scripts/sesq SESQ parser site specific scripts doc/ documentation licenses/ licenses of used libraries LICENSE.txt license of WebNews Crawler README.txt this file src.tar.bz2 source code of WebNews Crawler GETTING STARTED 1. Prepare list of URLs to crawl. Fill 'bin/news-rss.crl' file by your tasks. 2. Modify configuration files according to your needs. Open and edit bin/conf/crawler.properties file. It contains all important parameters related to the crawling process. 3a.Start the crawler in console (shell# here is your command promt) cd to the bin/ directory of the package: shell# cd unpacked_pakage_dir/bin Start the crawler process and wait until it is finished: shell# java -jar webnews-crawler.jar -cmd start 3b.If you plan to crawl a lot of URLs then it consider to use a server mode of the crawler instead of step 3a. To do this start the crawler server (change 12345 to some port number): shell# java -jar webnews-crawler.jar -cmd server -p 12345 Use TanaSend.jar to send a 'start' command to the server: shell# java -jar TanaSend.jar localhost 12345 cmd:start Now you can control and monitor the crawling process by sending commands listed below: 'start' - start a crawling process 'pause' - pause a crawling process 'resume' - resume a crawling process after it has been stopped 'stop' - stop a crawling process. Cannot be resumed 'shutdown' - shutdown the crawler 'exit' - interrupt a crawling process 'stat' - get statistic from the crawler 'log' - change log4j properties file and reload it. Requires a parameter 'file' 4. In step 3 the crawler has downloaded some content and stored it inside its internal database. Here is a way to export this database: shell# java -jar webnews-crawler.jar -cmd export After the exporting process you will see a directory 'export-' where is a Unix time. In this directory each exported resource has one 'meta' and one 'original' file with the same number as a name. In addition if HTML2XML processing is successful for this resource then there will be an additional 'xml' file. 'meta' file consists of meta information of a resource, such as URL, HTML title if there is some, detected encoding, time stamp of crawling, and HTTP headers prefixed by 'Meta-'. 'original' file is a content of a fetched resource as it comes from socket, i.e. without any conversion or modifications. 'xml' file (if any) is a result of HTML2XML processing. There is also a GUI version of the crawler. It doesn't support some features of console version but still can be useful for a simple crawling. Start the crawler with GUI like this: shell# java -jar webnews-crawler.jar -cmd gui INTEGRATION WITH ALVIS TOOLS There is a possibility to convert all exported resources into ALVIS XML format. To do so, download and install ALVIS support tools: shell# sudo perl -MCPAN -e "install Alvis::Convert" And run news_xml2alvis script. COPYRIGHT AND LICENCE Copyright (c) 2005 by Vladimir Poroshin All Rights Reserved. See LICENCE.txt file.

近期下载者

相关文件


收藏者