RoadRunner-0.02.11

所属分类:其他
开发工具:Java
文件大小:2259KB
下载次数:88
上传日期:2008-05-09 10:46:09
上 传 者LittleJohny
说明:  一个经典的页面数据采集工具RoadRunner.其关键思想是通过处理页面比较得到的mismatch来不断地修改当前的模板,最终推导出能够覆盖例子页面的模板,然后根据模板来实现对类似 页面的信息抽取。
(A classic page data collection tool for RoadRunner. The key idea is to be compared through the pages deal with the mismatch to continue to modify the current template, and ultimately derived from examples to cover page template, and then come to realize in accordance with the template page for similar information extraction.)

文件列表:
RoadRunner (0, 2004-03-13)
RoadRunner\docs (0, 2004-03-13)
RoadRunner\docs\.nbattrs (334, 2004-02-22)
RoadRunner\docs\PROPERTIES.TXT (9108, 2004-02-17)
RoadRunner\etc (0, 2004-03-13)
RoadRunner\etc\defaults.xml (1646, 2004-03-13)
RoadRunner\etc\logging.properties (1729, 2004-02-25)
RoadRunner\etc\prefix.xml (683, 2004-01-28)
RoadRunner\etc\tidy.cfg (500, 2003-12-21)
RoadRunner\examples (0, 2004-03-13)
RoadRunner\examples\fifaworldcup.yahoo.com (0, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro.html (44336, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files (0, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\155957.jpg (9112, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\336232509.jpg (1977, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\back2.gif (37, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\black.gif (35, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\cn0.gif (84, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\da20.gif (1082, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\de0a.gif (109, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\en1a.gif (102, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\eng12090.gif (2809, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\en_gamezone120x30.gif (1525, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\es0a.gif (108, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\et20.gif (1025, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fcb.gif (173, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fr0a.gif (107, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fra98_logo.gif (678, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fwccom120x30.gif (758, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\gray.gif (35, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\hm30.gif (768, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\ITA.gif (369, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\ita.jpg (12673, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\jp0.gif (112, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\kr0.gif (85, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\o.gif (43, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\pf30.gif (1048, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\search4.gif (213, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\teams4.gif (204, 2004-02-16)
... ...

INSTALLATION - Unzip the distribution and consider the main directory "RoadRunner" - Add to your CLASSPATH these libraries: lib/roadrunner.jar lib/nekohtml.jar lib/xercesImpl.jar lib/xmlParserAPIs.jar In order to run RoadRunner you can either cd to the main RoadRunner directory or set the system property "rr.home" to point to the main directory. (Note: the system property "rr.home" can be set by adding the option -Drr.home=path/to/directory/RoadRunner to the JVM invocation) USAGE The main class is roadrunner.Shell. For help on its usage type "java roadrunner.Shell" without args: Usage: java roadrunner.Shell [-trace[level]] [-O] [-N] ( -Finputfiles.txt | file0 [file1 ...] ) -O a xml configuration file -trace[level] enable tracing at log level -F files.txt lists the input filenames -N set a for output (default "out"): wrapper file(s): "Wrapper.xml" data file(s): "_DataSet.xml" RR disposes inferred wrappers and extracted dataset under subdirectories of the "output" directory. Each subdirectory is named after the argument following the option -N. The option -O specifies an XML configuration file whose sintax is briefly described in docs/PROPERTIES.TXT EXAMPLES Three examples (bash sintax): 1) Soccer players java roadrunner.Shell -Nplayers \ -Oexamples/fifaworldcup.yahoo.com/players.xml \ examples/fifaworldcup.yahoo.com/Cannavaro_xy_.xhtml \ examples/fifaworldcup.yahoo.com/Nesta_xy_.xhtml \ examples/fifaworldcup.yahoo.com/Zidane_xy_.xhtml 2) jobs java roadrunner.Shell -Nhotjobs \ -Oexamples/www.hotjobs.com/hotjobs.xml \ examples/www.hotjobs.com/ByLocation01_xy_.xhtml \ examples/www.hotjobs.com/ByLocation02_xy_.xhtml \ examples/www.hotjobs.com/ByLocation03_xy_.xhtml 3) Overstock Jewelry java roadrunner.Shell -Noverstock \ -Oexamples/www.overstock.com/jewelry.xml \ examples/www.overstock.com/jewelry01_xy_.xhtml \ examples/www.overstock.com/jewelry02_xy_.xhtml \ examples/www.overstock.com/jewelry03_xy_.xhtml After executing one of these examples, you find a directory named "output", whose subdirectories are named after the -N option of the command line: "players" for the first example and "overstock" for the last. Therefore, the last example about jewelry produces the following files: output/overstock/overstock00.xml (the wrapper inferred) output/overstock/overstock0_DataSet.xml (the dataset extracted from input samples) output/overstock/overstockWrappersIndex.xml (and index of all wrappers produced) output/overstock/results.html (html file to display all results) Open the files "results.html" with a css, xslt-capable browser (Mozilla 1.3- and recent versions of IE seem to work) to watch directly the data extracted by the automatically generated wrapper. USING ROADRUNNER ON NEW SAMPLES RoadRunner works on well-formed documents. The input HTML pages are pre-processed using the nekoHTML parser (http://www.apache.org/~andyc/neko/doc/html) to produce DOM representations of input documents. Note that in order to use the RR visual labelling feature, the input html document must be adorned with html comments reporting the coordinates of the bounding box of every text string in the visual rendering of the page. Just to give an example of what RR Labeller expects, the following fragment of HTML code: ... a text X a text Y... states that the coordinates of the bounding box of the string "a text X" in the visual rendering of the document are (minX=10, minY=10, maxX=60, maxY=20); similarly for "a text Y". Currently we use the suffix "_xy_.xhtml" to mark HTML files which have been processed in this way. REFERENCES The general ideas underlying the RoadRunner Project can be found in the papers: Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proc. of 27th International Conference on Very Large Databases (VLDB 2001): 109-118 http://www.vldb.org/conf/2001/P109.pdf Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo Automatic Annotation of Data Extracted from Large Web Sites. SIGMOD Conference 2003, WebDB Workshop. http://doi.acm.org/10.1145/***3477.***3480 Valter Crescenzi, Giansalvatore Mecca, On Automatic Information Extraction from Large Web Sites. Dip. Informatica ed Automazione Technical Report dia-76-2003. http://www.dia.uniroma3.it/db/roadRunner/publications/dia-76-2003.ps.gz THIRD-PARTY SOFTWARE INCLUDED * Xerces Java Parser (check file license: license/Apache_Software_LICENSE.TXT) * NekoHTML HTML scanner (check file license: license/NekoHTML_LICENSE.TXT ) This product includes software developed by: - Andy Clark (http://www.apache.org/~andyc/neko/doc/html) - The Apache Software Foundation (http://www.apache.org/). Namely, these softwares are used to produce DOM representations of input html pages.

近期下载者

相关文件


收藏者