spiderq-master

所属分类:搜索引擎
开发工具:C/C++
文件大小:31KB
下载次数:7
上传日期:2013-12-28 18:32:19
上 传 者jackstephie
说明:  网络爬虫,c语言版大体实现
(Web crawler, c language version generally achieved)

文件列表:
LICENSE.md (1571, 2013-01-11)
Makefile (159, 2013-01-11)
modules (0, 2013-01-11)
modules\Makefile (1163, 2013-01-11)
modules\domainlimit.cpp (1900, 2013-01-11)
modules\headerfilter.cpp (742, 2013-01-11)
modules\maxdepth.cpp (335, 2013-01-11)
modules\savehtml.cpp (856, 2013-01-11)
modules\saveimage.cpp (2406, 2013-01-11)
spiderq.conf (1899, 2013-01-11)
src (0, 2013-01-11)
src\Makefile (823, 2013-01-11)
src\bloomfilter.cpp (2530, 2013-01-11)
src\bloomfilter.h (166, 2013-01-11)
src\confparser.cpp (2799, 2013-01-11)
src\confparser.h (853, 2013-01-11)
src\crc32.cpp (991, 2013-01-11)
src\crc32.h (97, 2013-01-11)
src\dso.cpp (736, 2013-01-11)
src\dso.h (1289, 2013-01-11)
src\hashs.cpp (985, 2013-01-11)
src\hashs.h (257, 2013-01-11)
src\md5.cpp (6329, 2013-01-11)
src\md5.h (1179, 2013-01-11)
src\qstring.cpp (1756, 2013-01-11)
src\qstring.h (571, 2013-01-11)
src\sha1.cpp (9759, 2013-01-11)
src\sha1.h (348, 2013-01-11)
src\socket.cpp (6067, 2013-01-11)
src\socket.h (641, 2013-01-11)
src\spider.cpp (7221, 2013-01-11)
src\spider.h (1010, 2013-01-11)
src\threads.cpp (1392, 2013-01-11)
src\threads.h (237, 2013-01-11)
src\url.cpp (9064, 2013-01-11)
src\url.h (1127, 2013-01-11)

What is spiderq? ================ Spiderq is a Web Spider to crawl webpage(html) by Qteqpid. The performance depends on your server configuration and network. I will continue maintain it and list some TODOs at the end of this file. More people are welcome to join! Building spiderq ================ Spiderq can be compiled and used on Centos 5.8 . It is as simple as: % make % make install Then you will get an executable file named spider. After configurating spiderq.conf, run program: % ./spider For more informations, see Makefile. Contact ================ For any question, just contact me at any time. Enjoy! mailto: qteqpid blog: http://hi.baidu.com/qteqpid_pku TODO =============== @线程池 @信号处理 @网页内容排重 @同一ip间隔抓取 @层次结构存储网页 @是否遵守robots.txt @支持更新抓取,不重复抓 @定义对外api和html类,方便用户自定义处理html,动态加载方式

近期下载者

相关文件


收藏者