ContextExtraction
所属分类:模式识别(视觉/语音等)
开发工具:Java
文件大小:83KB
下载次数:0
上传日期:2017-05-25 16:53:06
上 传 者:
sh-1993
说明: 在线新闻文章(HTML页面)上下文提取使用最大子序列分割算法,如所提出的...
(Online news article (HTML pages) context extraction using Maximum Subsequence Segmentation Algorithm as presented by Pasternack and Roth)
文件列表:
pom.xml (9342, 2014-07-14)
src (0, 2014-07-14)
src\main (0, 2014-07-14)
src\main\java (0, 2014-07-14)
src\main\java\at (0, 2014-07-14)
src\main\java\at\rovo (0, 2014-07-14)
src\main\java\at\rovo\Main.java (17049, 2014-07-14)
src\main\java\at\rovo\textextraction (0, 2014-07-14)
src\main\java\at\rovo\textextraction\AbstractTrainer.java (6655, 2014-07-14)
src\main\java\at\rovo\textextraction\ExtractionException.java (1473, 2014-07-14)
src\main\java\at\rovo\textextraction\FileTrainer.java (3633, 2014-07-14)
src\main\java\at\rovo\textextraction\SQLiteDBTrainer.java (6863, 2014-07-14)
src\main\java\at\rovo\textextraction\TextExtractor.java (11965, 2014-07-14)
src\main\java\at\rovo\textextraction\TrainData.java (77, 2014-07-14)
src\main\java\at\rovo\textextraction\TrainerFactory.java (2271, 2014-07-14)
src\main\java\at\rovo\textextraction\TrainingDataStrategy.java (782, 2014-07-14)
src\main\java\at\rovo\textextraction\mss (0, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\MaximumSubsequenceSegmentation.java (20287, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\NoSubsequenceFoundException.java (1643, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\NotTrainedException.java (1512, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\SemiSupervisedMSS.java (25005, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\SimpleMSS.java (4909, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\SupervisedMSS.java (12562, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\TrainFeatureStrategy.java (1094, 2014-07-14)
src\main\java\at\rovo\textextraction\mss\TrainingEntry.java (32119, 2014-07-14)
src\main\java\at\rovo\textextraction\template (0, 2014-07-14)
src\main\java\at\rovo\textextraction\template\MatchedMatrixValue.java (705, 2014-07-14)
src\main\java\at\rovo\textextraction\template\TemplateExtraction.java (16742, 2014-07-14)
src\main\java\at\rovo\textextraction\templateIndependent (0, 2014-07-14)
src\main\java\at\rovo\textextraction\templateIndependent\TemplateIndependentNewsExtractor.java (23987, 2014-07-14)
src\main\resources (0, 2014-07-14)
src\main\resources\ContentExtraction.conf (3303, 2014-07-14)
src\main\resources\log (0, 2014-07-14)
src\main\resources\log\log4j2.xml (4006, 2014-07-14)
src\main\resources\stopwords.txt (2292, 2014-07-14)
src\test (0, 2014-07-14)
src\test\java (0, 2014-07-14)
... ...
GENERAL INFO:
=============
Modern websites present content in various ways. Most websites nowadays use a
form of content management system to publish new content and link to other
articles and websites not only in a separated link section.
Content is therefore embedded in a specific design which is often in the middle
of the screen surrounded by link-, related articles and comment-sections. Though
plenty of other sections are possible too.
While focusing on online news articles, extracting the main article of the page
is often quite easy to accomplish for humans. However trying to achieve the same
result via employing a computer is not. A couple of research papers recently
addressed this issue. F.e: 'Learning Block Importance Models for Web Pages' by
Song, Liu, Wen and Ma as well as 'Extracting Article Text from the Web with
Maximum Subsequence Segmentation' by Pasternack and Roth.
A DOM-based template producing content extraction algorithm (Zhang, Lin 2010)
and a template independent method to extract content (Wu, Yang 2012) are
currently experimentally added. Both algorithms do not extract content currently
due to some unclear descriptions in their corresponding papers.
INSTALLATION & EXECUTION:
=========================
The project is set up as Apache Maven project. It is currently configured to
execute automatically on installation via the pom.xml file. Therefore make sure
you have downloaded the training database taken from the MSS creators
(http://cogcomp.cs.illinois.edu/Data/MSS/) and put it in the 'trainingData'
subdirectory of the project root.
On the first run, Maven will try to install all dependencies and then start the
training of the naive Bayes classifier which is later used to get the
probabilities of tokens (tags and words) to build a score which is used by the
MSS algorithm to identify the main content of a page. Training can actually
require more than an hour, depending on the sample size and the computer used.
Note that training of 12 times 15000 examples for Bigrams may require up to 8
Gigabyte of RAM, training on Trigram-features will require even more as much
more trigrams will be found than Bigrams. So training 12 times 10000 samples
will require more than 8 Gigabyte of RAM.
Training will be done on 12 predefined "online news paper providers" which are
already available in the 'ate.db' SQLite-Database.
If training was executed successfully, the Java object that contains the
training data will be persisted to disk into the 'trainingData' subdirectory to
prevent re-training on multiple executions. Moreover a 'commonTags.ser' file
will be created that contains the common tags used by various pages.
Note however that the persisted training object is specific to the selected
feature-type (Bigram, Trigram, ...) and the number of samples used for training.
On changing one of these parameters in the pom.xml file, the training process is
invoked again.
ToDo:
=====
*) Currently Bigram achieve best results, though the paper states that with less
Trigram training more accurate results should be achievable
*) Fix Template-based algorithm. ISTM seems to work properly, MMTB should be
fine to. MCB and TE are the algorithms that needs to be corrected
*) TemplateIndependent algorithm: Unclear what exactly should happen at what
point in time. Moreover, not sure if an external stopword list should be used
or the maximum likelihood estimator (MLE) directly take the words of the
segments.
*) Using map & reduce pattern for learning. Multi-Threaded would be nice to
ToDo for external projects:
===========================
*) Improve runtime performance of Parser framework
*) Improve storage behavior of Naive Bayes data
近期下载者:
相关文件:
收藏者: