spider

所属分类:Java编程
开发工具:Java
文件大小:2228KB
下载次数:31
上传日期:2010-07-04 23:51:08
上 传 者lengyuzhong00
说明:  网络爬虫,主要根据种子网页抓取连接的网页
(spider)

文件列表:
spider\.classpath (542, 2008-10-31)
spider\.project (388, 2008-11-05)
spider\bin\com\mingzi\participle\LikeMap.class (3286, 2008-11-05)
spider\bin\com\mingzi\participle\Participle.class (2582, 2008-11-05)
spider\bin\com\mingzi\participle\StringParticiple.class (1142, 2008-11-05)
spider\bin\com\mingzi\spider\Check.class (4042, 2008-11-05)
spider\bin\com\mingzi\spider\DBOperator.class (2753, 2008-11-05)
spider\bin\com\mingzi\spider\DBOperatorToMySql.class (2768, 2008-11-05)
spider\bin\com\mingzi\spider\Dictionary.class (1640, 2008-11-05)
spider\bin\com\mingzi\spider\Spider.class (8419, 2008-11-05)
spider\bin\com\mingzi\spider\UrlNode.class (798, 2008-11-05)
spider\sql driver\sqljdbc_1.0\chs\help\default.htm (562, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\0041f9e1-09b6-4487-b052-afd636c8e89a.htm (7550, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\00f9e25a-088e-4ac6-aa75-43eacace8f03.htm (11356, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\027edab7-9b5c-4f5f-9469-fe00cf7798b6.htm (4948, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\028b8d61-9557-4c9f-b732-29e87a962de8.htm (5644, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\030ad599-0431-4242-9428-e9ead7b75b1d.htm (5255, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\0330ca1d-5e24-4ce3-9d2a-b931f20a0fcf.htm (5566, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\03649d56-3319-4867-bef1-559dfd221b8b.htm (4345, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\04178645-915f-4569-8907-d45e299bbe7d.htm (8476, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\048fe245-157f-4fd8-be75-ce54b83e02b3.htm (5816, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\04d36a25-7f95-4675-9690-4462671b3d67.htm (6225, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\04ea83b2-db5e-4b46-b016-9e496363827e.htm (5132, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\04eebd6a-016f-4462-82f5-ab34b945eec4.htm (5000, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\050548ca-c708-4224-8014-8b7830a860dd.htm (8163, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\053549ee-2018-47ab-9538-789dac2b150a.htm (5446, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\05bdb61f-26e8-480f-a1c1-1e46a8ed4b70.htm (5302, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\0610d667-a16d-4201-a14b-0a40048911e1.htm (6064, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\06225a1a-a58d-4bff-b2ef-be303f051644.htm (5321, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\069c68ff-442d-4104-917f-3445a3ad264a.htm (5959, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\079c7eb7-71e4-4109-83de-f6d785433c95.htm (7860, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\08223a62-6489-44e4-85e8-b45bfbb11cfc.htm (5384, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\085461de-367b-4832-88aa-010813d2bc41.htm (6322, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\08cfc4e0-83f0-4f2f-ac55-b381f34fe67f.htm (5811, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\093f6c3b-49a6-4043-9993-bd0482de04dd.htm (5459, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\09491a8a-1931-411e-9b35-2b269c1b7f12.htm (5861, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\097434fd-2b74-411c-a5ed-eba04481dde5.htm (4489, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\099dd0bf-b017-479d-9696-f5b06f4c6bf9.htm (6876, 2006-04-12)
spider\sql driver\sqljdbc_1.0\chs\help\html\09dca1f9-225a-4acb-9857-9a947e0829be.htm (5764, 2006-04-12)
... ...

1、要想运行次爬虫程序首先要有个前提条件,因为本爬虫程序采用了分词的技术,所以需要加载一个字典文件,此字典文件在爬虫程序中要求放在此目录结构下:c:\dictionary\dictionary.txt,词典默认认为是按照词语长到短的顺序排列的 2、此爬虫程序爬到的网页内容存储到数据库中,运用的是SQL Server 2005,在此数据库下有一个叫javadatabase的数据库,下面有个叫做webinfo的表格,表格的列有:id(int),url(varchar(50)),title(varchar(max)),body(text),bodyppl(text) 3、程序中运用了基于字符串匹配的分此方法中的正向最大匹配法 4、此爬虫程序采用的是广度优先的搜索方法搜索网络中的网页,仅仅实现了搜索“河北大学”的网络资源,程序中采用队列的方式存储已经访问过的网页的URL,以避免重复的抓取同一个网页的现象发生,待抓取的网页存储在一个链表中,链表的每个结点存储着待抓取的网页的URL信息

近期下载者

相关文件


收藏者