html-extractor

所属分类:WEB开发
开发工具:PHP-PERL
文件大小:5KB
下载次数:11
上传日期:2010-12-23 17:00:37
上 传 者xjtdy888
说明:  发布一个HTML正文提取程序HTMLExtractor, 程序主要是基于内容统计的方法,暂不包含自学习能力,仅是 一个分析程序而以,网上也有别人实现了的正文提取程序,不过 大部人都当宝,都不愿意公开完整代码,有些大人实现了一些简 单的,不过分析能力和识别能力都不太理想。所以自己做了一个 简单的,本来想用PHP DOM分析器,不过大部份网页都不规范, 缺个标签啥的都很正常,所以自已又造了个简单的轮子分析HTML标 签,功能比较简单,每个元素都生成一个对象,内存方面占用比较 高,不过在这里我只是为了实现,并没去做优化。因为我并不是在 做应用,所以希望不要让我改改成什么样去适用你们的业务(以前经常 有QQ加上让我把我的例子怎么改,很无语), 如果你们喜欢,可以和我一起开发完善他。 补充一下,因为写的着急,现在几个类的耦合性还比较大,下来再守善吧。 项目代码 http://code.google.com/p/html-extractor/ 在线例子 http://dev.psm01.cn/c/html-extractor.php
(HTML text extraction procedure to release a HTMLExtractor, Program is mainly based on the content of statistical methods, including self-learning capability temporarily, only An analytical procedure to, the Internet also has the body of someone else realized the extraction process, but When the treasure most people are reluctant to open the complete code, some adults to achieve a number of simple Single, but analysis and recognition are not ideal. So do yourself a Simple, had wanted to use PHP DOM parser, but most of the pages are not standardized, Han s missing tags are normal, so their own and made the wheels of a simple HTML standards Sign, function is relatively simple, each element generates an object, the memory area occupied by comparison High, but I m just here to achieve, it did not do optimization. Because I am not Do apply, so I hope I do not what to change into for your business (before the regular I had QQ with examples of how to change my very silent), If you p)

文件列表:
html-extractor.php (15168, 2010-12-23)

近期下载者

相关文件


收藏者