htmlparser1_6_20060319

所属分类:Java编程
开发工具:Java
文件大小:6823KB
下载次数:8
上传日期:2009-04-27 22:28:27
上 传 者chrishow
说明:  本程序用于对页面信息进行提取并分析,类似于网络爬虫的功能。
(This procedure used to extract information on the page and analysis, similar to the function of network reptiles.)

文件列表:
htmlparser1_6_20060319\htmlparser1_6\bin\beanybaby (1653, 2004-01-03)
htmlparser1_6_20060319\htmlparser1_6\bin\beanybaby.cmd (1784, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\filterbuilder (1316, 2005-02-13)
htmlparser1_6_20060319\htmlparser1_6\bin\filterbuilder.cmd (2027, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\lexer (1646, 2003-09-22)
htmlparser1_6_20060319\htmlparser1_6\bin\lexer.cmd (1773, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\linkextractor (1670, 2003-12-30)
htmlparser1_6_20060319\htmlparser1_6\bin\linkextractor.cmd (1805, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\parser (1644, 2003-10-25)
htmlparser1_6_20060319\htmlparser1_6\bin\parser.cmd (1772, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\sitecapturer (1669, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\sitecapturer.cmd (1806, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\stringextractor (1672, 2004-01-03)
htmlparser1_6_20060319\htmlparser1_6\bin\stringextractor.cmd (1809, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\thumbelina (1305, 2005-02-13)
htmlparser1_6_20060319\htmlparser1_6\bin\thumbelina.cmd (2000, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\bin\translate (1652, 2004-02-08)
htmlparser1_6_20060319\htmlparser1_6\bin\translate.cmd (1783, 2005-04-10)
htmlparser1_6_20060319\htmlparser1_6\docs\articles\index.html (464, 2004-01-03)
htmlparser1_6_20060319\htmlparser1_6\docs\articles\quest.html (3632, 2004-01-03)
htmlparser1_6_20060319\htmlparser1_6\docs\bug.html (1400, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\changes.txt (50152, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\contributors.html (20917, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\htmlparser.jpg (10072, 2004-05-31)
htmlparser1_6_20060319\htmlparser1_6\docs\htmlparserlogo.jpg (4405, 2004-05-31)
htmlparser1_6_20060319\htmlparser1_6\docs\index.html (806, 2004-06-02)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\allclasses-frame.html (22393, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\allclasses-noframe.html (19293, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\constant-values.html (33900, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\deprecated-list.html (16941, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\doc-files\building.html (5401, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\doc-files\overview.html (4998, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\doc-files\using.html (4536, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\help-doc.html (10188, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\index-all.html (723400, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\index.html (1322, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\org\htmlparser\Attribute.html (56377, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\org\htmlparser\beans\BeanyBaby.html (97531, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\org\htmlparser\beans\class-use\BeanyBaby.html (6257, 2006-03-19)
htmlparser1_6_20060319\htmlparser1_6\docs\javadoc\org\htmlparser\beans\class-use\FilterBean.html (8696, 2006-03-19)
... ...

HTMLParser Version 1.6 (Integration Build Mar 19, 2006) ********************************************* Contents of the distribution ---------------------------- (i) jar files - lib directory HTML Parser jars: htmlparser.jar, lexer.jar, thumbelina.jar and filterbuilder.jar. Also thirdparty jar files checkstyle-all-3.1.jar, fit.jar and junit.jar. (ii) source code - src.zip Also contains necessary resources, and build file. Unzip this and you should be all set to build the parser from its source. You would need Jakarta Ant installed. (iii) documentation - docs directory (includes javadocs) Point your browser at index.html in the docs directory. (iv) executing scripts - bin directory Batch/script files assume that Java is visible in your path. Most require Java 1.2 (or upwards), except for lexer. (v) license.txt (GNU Lesser General Public License) (vi) this file, readme.txt Changes since Version 1.5 ------------------------- New Functionality ----------------- Support has been added for commonly requested composite tags, P and H1-H6. Definition list tags (dl, dt, dd), are also now included in the standard set of tags recognized by the parser. The node interface has been augmented with get first/last child and get previous/next sibling methods to ease traversing the HTML document. The TextNode class has an added isWhiteSpace method that returns true when it contains no printable characters. NodeTreeWalker, a utility class to traverse a tree of Node objects using either depth-first or breadth-first tree order has been added. Refactoring ----------- The FilterBean now has a 'recursive' property to control descent through children when applying filters. The NodeList class is a little more standard now with a remove(node) method. Some refactoring to allow the htmllexer jar file to be compiled by gcj. Bug Fixes --------- #1445795 return as TextNode when processing jsp #1445309 XML processing instructions are returned as text #1376851 Null-valued cookies cause exception #1375230 some javascript breaks stringbean #1344687 A bug when set cookies #1334408 Exception occurs based on string length #1322686 when illegal charset specified #1227213 Particular SCRIPT tags close too late Patches ------- #1338534 Support get first/last child, previous/next sibling Changes since Version 1.4 ------------------------- New APIs Implement rudimentary sax parser. Currently exposes DOM parser via sax project A new http package is added, the primary class being Connectionmanager which handles proxies, passwords and cookies. Some testing still needed. Also removed some line separator cruft. Added parseCDATA to the Lexer, used in script and style scanners. Note that this is significantly new behaviour that now adheres to appendix B.3.2 Specifying non-HTML data of the HTML reference: http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data Configuration Management Removed the need for the Translate class to be packaged with htmllexer.jar. This results in a lighter weight component. Updated the logo and included the LGPL license. Fixed the Windows batch files. Added optional "classes" property to build.xml. This directory is where class files are put. It defaults to src. To use: ant -Dclasses=classdir where classdir is/will-be a peer directory to src. Fixed various end user experience issues. Refactoring Added static STRICT flag to ScriptScanner to revert to legacy handling of broken ETAGO (
近期下载者

相关文件


收藏者