Crawler

所属分类:数据采集/爬虫
开发工具:Others
文件大小:0KB
下载次数:0
上传日期:2014-04-04 11:52:42
上 传 者sh-1993
说明:  该存储库包含web爬行进程和mysql数据库连接池的框架,以及新浪新闻和...的实现...,
(This repository contains a framework of web crawling process and mysql database connection pools, and an implementation of sina news and weibo crawling)

文件列表:
lib/ (0, 2014-04-04)
lib/commons-beanutils-1.7.0.jar (188671, 2014-04-04)
lib/commons-collections.jar (571259, 2014-04-04)
lib/commons-lang-2.6.jar (284220, 2014-04-04)
lib/commons-logging-1.1.1.jar (60841, 2014-04-04)
lib/ezmorph-1.0.6.jar (86487, 2014-04-04)
lib/htmlparser.jar (288098, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/ (0, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/LICENSE.txt (10349, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/NOTICE.txt (189, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/RELEASE_NOTES.txt (72466, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/ (0, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/commons-codec-1.6.jar (232771, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/commons-logging-1.1.3.jar (62050, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/fluent-hc-4.3.1.jar (22764, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/httpclient-4.3.1.jar (585603, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/httpclient-cache-4.3.1.jar (148300, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/httpcore-4.3.jar (282160, 2014-04-04)
lib/httpcomponents-client-4.3.1-bin/lib/httpmime-4.3.1.jar (37078, 2014-04-04)
lib/javabase64-1.3.1.jar (4364, 2014-04-04)
lib/json-lib-2.2.3-jdk13.jar (148271, 2014-04-04)
lib/jsoup-1.6.1.jar (281579, 2014-04-04)
lib/mysql-connector-java-3.1.10-bin.jar (418698, 2014-04-04)
lib/servlet-api.jar (242641, 2014-04-04)
src/ (0, 2014-04-04)
src/com/ (0, 2014-04-04)
src/com/zoe/ (0, 2014-04-04)
src/com/zoe/DBProcess/ (0, 2014-04-04)
src/com/zoe/DBProcess/DBBasic.java (4691, 2014-04-04)
src/com/zoe/DBProcess/DBConfig.java (789, 2014-04-04)
src/com/zoe/DBProcess/DBConnectionManager.java (3127, 2014-04-04)
src/com/zoe/DBProcess/DBConnectionPool.java (4947, 2014-04-04)
src/com/zoe/crawler/ (0, 2014-04-04)
src/com/zoe/crawler/implement/ (0, 2014-04-04)
src/com/zoe/crawler/implement/CrawlConfig.java (840, 2014-04-04)
src/com/zoe/crawler/implement/CrawlThread.java (3473, 2014-04-04)
src/com/zoe/crawler/implement/CrawlerImp.java (2538, 2014-04-04)
... ...

## INTRUDUCTION ## This java project is a web crawler tools. It has four packages which contain a framework of the web craler and a database management pools and the implementation of the framework. And we use it mainly to crawl www.news.sina.com.cn pages and other branches of it. The fourth package is the implementation of the sina weibo crawling containing weibo mock sign in, crawling of the specific weibo reposts and comments and the contents of the s.weibo.com pages. _________________________________________________________________ ## MODULE INSTRUCTIONS ## + package [com.zoe.crawler.model] ```java public abstract class SchedulerModel; ``` It contains a frame work of crawling process. Therefore it has many abstract classes. If you want to use this frame you should extends the classes and override their functions. It's mainly used to initiate the url seeds, call crawlProcess and define time tasks like crawling pages according to the regular time period if you want to. ```java public abstract class CrawlModel; public abstract class TimerModel; public abstract class CrawlThreadModel; public abstract class LogModel; public abstract class DownloadModel; ``` The abstract class CrawlModel defines the number of the thread using to crawl pages and call the Log operations to initialize the crawl queue and download the file that meet the needs. TimerModel usually should contains two inner classes that implements ServeletContextListner and TimerTask interface which have been embeded in the java lib. The CrawlThreadModel should instantiate class DownloadModel and call the log operations and define the url recheck rules. The DownloadModel should deal with the specific url source code and decide the ways to save the extracted contents. The LogModel should document the url status during the crawling process, and dealing with the restoration process when the program break down. Also could use it to define url rechecking process. + package [com.zoe.DBProcess] this package defines some database pools and the basic operations of operating database. ```java public class DBConnectionManager; public class DBConnectionPool; public class DBBasic; public class DBConfig; ``` DBConnectionManager initalize different DBConnectionPools with different names and the Pools could have their own connection pools using different connection ports. DBConnectionPool define a database connection pool and some operations like get connection from pool and so on. DBBasic contains a lot of database operations such as insert, query, delete and so on. DBConfig is used to initialize the parameters related to database pools and construct database connections. + package [com.zoe.crawler.implementation] This package extends the package [com.zoe.crawler.model], and implement an instantiation used to crawl news.sina.com.cn pages and its subpages.Then using MySQL to save the extracted contents and some information.Using log to initialize url seeds and using bloom filter to recheck url queue. The package contains main function and could run if you configure all the parameters. + package [com.zoe.crawler.weibo] This package contains some operations about sinaweibo crawling, such as crawling comments and reposts for the specific weibo id, and crawling s.weibo.com with keywords etc. ________________________________________________________________ ## HOW TO USE ## You could take the package [com.zoe.crawler.implementation] as an example to crawl, and replace the class DownloadImp and url seeds and filter rules to costomize your own mini crawler. __________________________________________________________________ ## DEPENDENCY ## The .jar files which are required for the package have been contained in the file lib. ______________________________________________________________________ ## BUG REPORT && SUGGESTIONS ## + zchgeek@gmail.com

近期下载者

相关文件


收藏者