spiderq-master
所属分类:搜索引擎
开发工具:C/C++
文件大小:31KB
下载次数:7
上传日期:2013-12-28 18:32:19
上 传 者:
jackstephie
说明: 网络爬虫,c语言版大体实现
(Web crawler, c language version generally achieved)
文件列表:
LICENSE.md (1571, 2013-01-11)
Makefile (159, 2013-01-11)
modules (0, 2013-01-11)
modules\Makefile (1163, 2013-01-11)
modules\domainlimit.cpp (1900, 2013-01-11)
modules\headerfilter.cpp (742, 2013-01-11)
modules\maxdepth.cpp (335, 2013-01-11)
modules\savehtml.cpp (856, 2013-01-11)
modules\saveimage.cpp (2406, 2013-01-11)
spiderq.conf (1899, 2013-01-11)
src (0, 2013-01-11)
src\Makefile (823, 2013-01-11)
src\bloomfilter.cpp (2530, 2013-01-11)
src\bloomfilter.h (166, 2013-01-11)
src\confparser.cpp (2799, 2013-01-11)
src\confparser.h (853, 2013-01-11)
src\crc32.cpp (991, 2013-01-11)
src\crc32.h (97, 2013-01-11)
src\dso.cpp (736, 2013-01-11)
src\dso.h (1289, 2013-01-11)
src\hashs.cpp (985, 2013-01-11)
src\hashs.h (257, 2013-01-11)
src\md5.cpp (6329, 2013-01-11)
src\md5.h (1179, 2013-01-11)
src\qstring.cpp (1756, 2013-01-11)
src\qstring.h (571, 2013-01-11)
src\sha1.cpp (9759, 2013-01-11)
src\sha1.h (348, 2013-01-11)
src\socket.cpp (6067, 2013-01-11)
src\socket.h (641, 2013-01-11)
src\spider.cpp (7221, 2013-01-11)
src\spider.h (1010, 2013-01-11)
src\threads.cpp (1392, 2013-01-11)
src\threads.h (237, 2013-01-11)
src\url.cpp (9064, 2013-01-11)
src\url.h (1127, 2013-01-11)
What is spiderq?
================
Spiderq is a Web Spider to crawl webpage(html) by Qteqpid. The performance depends on your server configuration and network. I will continue maintain it and list some TODOs at the end of this file. More people are welcome to join!
Building spiderq
================
Spiderq can be compiled and used on Centos 5.8 .
It is as simple as:
% make
% make install
Then you will get an executable file named spider. After configurating spiderq.conf, run program:
% ./spider
For more informations, see Makefile.
Contact
================
For any question, just contact me at any time. Enjoy!
mailto: qteqpid
blog: http://hi.baidu.com/qteqpid_pku
TODO
===============
@线程池
@信号处理
@网页内容排重
@同一ip间隔抓取
@层次结构存储网页
@是否遵守robots.txt
@支持更新抓取,不重复抓
@定义对外api和html类,方便用户自定义处理html,动态加载方式
近期下载者:
相关文件:
收藏者: