RoadRunner-0.02.11
所属分类:其他
开发工具:Java
文件大小:2259KB
下载次数:88
上传日期:2008-05-09 10:46:09
上 传 者:
LittleJohny
说明: 一个经典的页面数据采集工具RoadRunner.其关键思想是通过处理页面比较得到的mismatch来不断地修改当前的模板,最终推导出能够覆盖例子页面的模板,然后根据模板来实现对类似
页面的信息抽取。
(A classic page data collection tool for RoadRunner. The key idea is to be compared through the pages deal with the mismatch to continue to modify the current template, and ultimately derived from examples to cover page template, and then come to realize in accordance with the template page for similar information extraction.)
文件列表:
RoadRunner (0, 2004-03-13)
RoadRunner\docs (0, 2004-03-13)
RoadRunner\docs\.nbattrs (334, 2004-02-22)
RoadRunner\docs\PROPERTIES.TXT (9108, 2004-02-17)
RoadRunner\etc (0, 2004-03-13)
RoadRunner\etc\defaults.xml (1646, 2004-03-13)
RoadRunner\etc\logging.properties (1729, 2004-02-25)
RoadRunner\etc\prefix.xml (683, 2004-01-28)
RoadRunner\etc\tidy.cfg (500, 2003-12-21)
RoadRunner\examples (0, 2004-03-13)
RoadRunner\examples\fifaworldcup.yahoo.com (0, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro.html (44336, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files (0, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\155957.jpg (9112, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\336232509.jpg (1977, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\back2.gif (37, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\black.gif (35, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\cn0.gif (84, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\da20.gif (1082, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\de0a.gif (109, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\en1a.gif (102, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\eng12090.gif (2809, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\en_gamezone120x30.gif (1525, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\es0a.gif (108, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\et20.gif (1025, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fcb.gif (173, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fr0a.gif (107, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fra98_logo.gif (678, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\fwccom120x30.gif (758, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\gray.gif (35, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\hm30.gif (768, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\ITA.gif (369, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\ita.jpg (12673, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\jp0.gif (112, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\kr0.gif (85, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\o.gif (43, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\pf30.gif (1048, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\search4.gif (213, 2004-02-16)
RoadRunner\examples\fifaworldcup.yahoo.com\Cannavaro_files\teams4.gif (204, 2004-02-16)
... ...
INSTALLATION
- Unzip the distribution and consider the main directory "RoadRunner"
- Add to your CLASSPATH these libraries:
lib/roadrunner.jar
lib/nekohtml.jar
lib/xercesImpl.jar
lib/xmlParserAPIs.jar
In order to run RoadRunner you can either cd to the main RoadRunner directory
or set the system property "rr.home" to point to the main directory.
(Note: the system property "rr.home" can be set by adding the option
-Drr.home=path/to/directory/RoadRunner
to the JVM invocation)
USAGE
The main class is roadrunner.Shell. For help on its usage
type "java roadrunner.Shell" without args:
Usage: java roadrunner.Shell [-trace[level]] [-O]
[-N]
( -Finputfiles.txt | file0 [file1 ...] )
-O a xml configuration file
-trace[level] enable tracing at log level
-F files.txt lists the input filenames
-N set a for output (default "out"):
wrapper file(s): "Wrapper.xml"
data file(s): "_DataSet.xml"
RR disposes inferred wrappers and extracted dataset under
subdirectories of the "output" directory. Each subdirectory
is named after the argument following the option -N.
The option -O specifies an XML configuration file whose
sintax is briefly described in docs/PROPERTIES.TXT
EXAMPLES
Three examples (bash sintax):
1) Soccer players
java roadrunner.Shell -Nplayers \
-Oexamples/fifaworldcup.yahoo.com/players.xml \
examples/fifaworldcup.yahoo.com/Cannavaro_xy_.xhtml \
examples/fifaworldcup.yahoo.com/Nesta_xy_.xhtml \
examples/fifaworldcup.yahoo.com/Zidane_xy_.xhtml
2) jobs
java roadrunner.Shell -Nhotjobs \
-Oexamples/www.hotjobs.com/hotjobs.xml \
examples/www.hotjobs.com/ByLocation01_xy_.xhtml \
examples/www.hotjobs.com/ByLocation02_xy_.xhtml \
examples/www.hotjobs.com/ByLocation03_xy_.xhtml
3) Overstock Jewelry
java roadrunner.Shell -Noverstock \
-Oexamples/www.overstock.com/jewelry.xml \
examples/www.overstock.com/jewelry01_xy_.xhtml \
examples/www.overstock.com/jewelry02_xy_.xhtml \
examples/www.overstock.com/jewelry03_xy_.xhtml
After executing one of these examples, you find a directory named "output",
whose subdirectories are named after the -N option of the command line:
"players" for the first example and "overstock" for the last.
Therefore, the last example about jewelry produces the following files:
output/overstock/overstock00.xml (the wrapper inferred)
output/overstock/overstock0_DataSet.xml (the dataset extracted from input samples)
output/overstock/overstockWrappersIndex.xml (and index of all wrappers produced)
output/overstock/results.html (html file to display all results)
Open the files "results.html" with a css, xslt-capable browser (Mozilla 1.3- and
recent versions of IE seem to work) to watch directly the data extracted by the
automatically generated wrapper.
USING ROADRUNNER ON NEW SAMPLES
RoadRunner works on well-formed documents. The input HTML pages are
pre-processed using the nekoHTML parser (http://www.apache.org/~andyc/neko/doc/html)
to produce DOM representations of input documents.
Note that in order to use the RR visual labelling feature, the input html document
must be adorned with html comments reporting the coordinates of the bounding box
of every text string in the visual rendering of the page. Just to give an
example of what RR Labeller expects, the following fragment of HTML code:
... a text X | a text Y |
...
states that the coordinates of the bounding box of the string "a text X" in the
visual rendering of the document are (minX=10, minY=10, maxX=60, maxY=20);
similarly for "a text Y". Currently we use the suffix "_xy_.xhtml" to mark
HTML files which have been processed in this way.
REFERENCES
The general ideas underlying the RoadRunner Project can be found in the papers:
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards
Automatic Data Extraction from Large Web Sites. Proc. of 27th International
Conference on Very Large Databases (VLDB 2001): 109-118
http://www.vldb.org/conf/2001/P109.pdf
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo Automatic Annotation of
Data Extracted from Large Web Sites. SIGMOD Conference 2003, WebDB Workshop.
http://doi.acm.org/10.1145/***3477.***3480
Valter Crescenzi, Giansalvatore Mecca, On Automatic Information Extraction from
Large Web Sites. Dip. Informatica ed Automazione Technical Report dia-76-2003.
http://www.dia.uniroma3.it/db/roadRunner/publications/dia-76-2003.ps.gz
THIRD-PARTY SOFTWARE INCLUDED
* Xerces Java Parser (check file license: license/Apache_Software_LICENSE.TXT)
* NekoHTML HTML scanner (check file license: license/NekoHTML_LICENSE.TXT )
This product includes software developed by:
- Andy Clark (http://www.apache.org/~andyc/neko/doc/html)
- The Apache Software Foundation (http://www.apache.org/).
Namely, these softwares are used to produce DOM representations of input html pages.
近期下载者:
相关文件:
收藏者: