ContentAnalyzer

所属分类:搜索引擎
开发工具:C#
文件大小:71KB
下载次数:176
上传日期:2009-03-11 01:18:27
上 传 者bloodxia
说明:  搜索引擎正文提取程序,通过html分析和正则,去掉html代码,保留网页正文,只针对中文有效。英文稍加修改即可使用。
(The body of the search engine extraction process, through analysis and regular html remove html code to retain the page text, only effective against the Chinese. Slightly modified to use English.)

文件列表:
_UpgradeReport_Files\UpgradeReport.css (3348, 2009-03-02)
_UpgradeReport_Files\UpgradeReport.xslt (12579, 2007-06-27)
_UpgradeReport_Files\UpgradeReport_Minus.gif (69, 2009-03-02)
_UpgradeReport_Files\UpgradeReport_Plus.gif (71, 2009-03-02)
obj\TestAnalyzer.csproj.FileListAbsolute.txt (2012, 2008-05-24)
obj\Release\ResolveAssemblyReference.cache (6169, 2008-05-23)
obj\Release\TestAnalyzer.exe (16384, 2008-05-23)
obj\Debug\ResolveAssemblyReference.cache (6182, 2009-03-02)
obj\Debug\TestAnalyzer.csproj.FileListAbsolute.txt (713, 2009-03-03)
obj\Debug\TestAnalyzer.exe (6144, 2009-03-02)
obj\Debug\TestAnalyzer.pdb (13824, 2009-03-02)
Properties\AssemblyInfo.cs (1171, 2008-05-23)
release\HtmlAgilityPack.dll (90112, 2008-05-24)
release\Net.LikeShow.ContentAnalyze.dll (36864, 2008-05-24)
release\TestAnalyzer.exe (6144, 2009-03-02)
release\TestAnalyzer.pdb (13824, 2009-03-02)
release\TestAnalyzer.vshost.exe (14328, 2009-03-03)
release\TestAnalyzer.vshost.exe.manifest (490, 2007-07-21)
Program.cs (3836, 2009-03-02)
TestAnalyzer.csproj (2241, 2009-03-02)
TestAnalyzer.sln (913, 2009-03-02)
TestAnalyzer.suo (16896, 2009-03-03)
UpgradeLog.XML (1095, 2009-03-02)
obj\Release\TempPE (0, 2009-03-03)
obj\Release\Refactor (0, 2009-03-03)
obj\Debug\TempPE (0, 2009-03-03)
bin\Debug (0, 2009-03-03)
obj\Release (0, 2009-03-03)
obj\Debug (0, 2009-03-03)
_UpgradeReport_Files (0, 2009-03-03)
bin (0, 2009-03-03)
obj (0, 2009-03-03)
Properties (0, 2009-03-03)
release (0, 2009-03-03)

***************************************************************** * Net.LikeShow.ContentAnalyze * code by King * http://www.likeshow.net * qq:5088300 MSN:yy_8354@hotmail.com * 正文抽取类 提供基本的网页正文分析抽取 返回正文标题 发布时间 正文内容 及正文类型 * 正文类型分为: news bbs blogs * 该组件内部算法主要应用了规则模型抽取,所有规则基本使用正则表达式实现,具体正则可参考我的BLOG上《正*文抽取正则》以及《聊聊网页正文抽取》内容。 http://www.likeshow.net/article.asp?id=60 http://www.likeshow.net/article.asp?id=55 ***************************************************************** 测试代码: using Net.LikeShow.ContentAnalyze; using Net.LikeShow.ContentAnalyze.DataClass; Html myhtml = new Html(); myhtml.Web = str; myhtml.Url = url; CommonAnalyze ca = new CommonAnalyze(); ca.LoadHtml(myhtml); Document doc = ca.GetResult(); watch.Stop(); Console.WriteLine(url); Console.WriteLine("\r\n"); Console.WriteLine(doc.Title); //标题 Console.WriteLine("\r\n"); Console.WriteLine(doc.UpTime); //发布时间 Console.WriteLine("\r\n"); Console.WriteLine(doc.Doc); //正文 Console.WriteLine("\r\n"); Console.WriteLine(doc.SiteType); //正文类型 Console.WriteLine("\r\n"); Console.WriteLine(watch.Elapsed); Console.WriteLine("\r\n");

近期下载者

相关文件


收藏者