COLLINS-PARSER

所属分类:多国语言处理
开发工具:Unix_Linux
文件大小:3879KB
下载次数:34
上传日期:2009-04-07 17:23:22
上 传 者zhougy640
说明:  中心词驱动的短语结构句法分析器。该模型考虑了跟随介词短语的名词短语的中心词的作用。 有MIT大学Colling开发,是目前国际上工人的最好的英语句法分析器
(Head-driven Phrase Structure Parser. The model considered the prepositional phrase following the noun phrase the central role of the word. MIT has developed the University of Colling, workers are on the current international best English Parser)

文件列表:
COLLINS-PARSER\code\chart.c (42832, 2002-12-15)
COLLINS-PARSER\code\chart.h (2144, 2002-12-15)
COLLINS-PARSER\code\edges.c (3928, 2002-12-15)
COLLINS-PARSER\code\edges.h (4805, 2002-12-15)
COLLINS-PARSER\code\effhash.c (2751, 2002-12-15)
COLLINS-PARSER\code\effhash.h (1765, 2002-12-15)
COLLINS-PARSER\code\genprob.c (5138, 2002-12-15)
COLLINS-PARSER\code\genprob.h (2223, 2002-12-15)
COLLINS-PARSER\code\GNU_GENERAL_PUBLIC_LICENSE (18009, 2002-12-15)
COLLINS-PARSER\code\grammar.c (6378, 2002-12-15)
COLLINS-PARSER\code\grammar.h (3968, 2002-12-15)
COLLINS-PARSER\code\hash.c (2472, 2002-12-15)
COLLINS-PARSER\code\hash.h (2029, 2002-12-15)
COLLINS-PARSER\code\key.c (1705, 2002-12-15)
COLLINS-PARSER\code\key.h (1482, 2002-12-15)
COLLINS-PARSER\code\lexicon.c (3809, 2002-12-15)
COLLINS-PARSER\code\lexicon.h (1774, 2002-12-15)
COLLINS-PARSER\code\main.c (2624, 2002-12-15)
COLLINS-PARSER\code\Makefile (1000, 2002-12-14)
COLLINS-PARSER\code\mymalloc.c (1364, 2002-12-15)
COLLINS-PARSER\code\mymalloc.h (1221, 2002-12-15)
COLLINS-PARSER\code\mymalloc_char.c (1432, 2002-12-15)
COLLINS-PARSER\code\mymalloc_char.h (1267, 2002-12-15)
COLLINS-PARSER\code\parser (54570, 2002-12-15)
COLLINS-PARSER\code\prob.c (15439, 2002-12-15)
COLLINS-PARSER\code\prob.h (3981, 2002-12-15)
COLLINS-PARSER\code\prob_witheffhash.c (5256, 2002-12-15)
COLLINS-PARSER\code\prob_witheffhash.h (2201, 2002-12-15)
COLLINS-PARSER\code\readevents.c (5587, 2002-12-15)
COLLINS-PARSER\code\readevents.h (2244, 2002-12-15)
COLLINS-PARSER\code\sentence.c (3796, 2002-12-15)
COLLINS-PARSER\code\sentence.h (3382, 2002-12-15)
COLLINS-PARSER\collins99.thesis.ps (2281547, 2002-12-16)
COLLINS-PARSER\examples\sec00.tagged (405838, 2002-12-15)
COLLINS-PARSER\examples\sec23.tagged (498255, 2002-12-15)
COLLINS-PARSER\GNU_GENERAL_PUBLIC_LICENSE (18009, 2002-12-15)
COLLINS-PARSER\sec23\cleanSec23.pl (1990, 2002-12-14)
COLLINS-PARSER\sec23\merge.prl (317, 2002-12-15)
COLLINS-PARSER\sec23\proc_pout.prl (549, 2002-12-14)
... ...

This code is the statistical natural language parser described in M. Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania. Version 1.0, released Dec 15th 2002. Copyright (C) 1999 Michael Collins This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA To download the latest version of this software, follow the appropriate link at http://www.ai.mit.edu/people/mcollins You can send mail to mcollins@ai.mit.edu. I'll try to reply to queries, although I apologise in advance if I'm not able to respond -- it is difficult to keep up with the volume of mail I receive concerned with the parser. CONTENTS [0] Compiling the code [1] Running the code [1.1] More details about the command line format [2] Input format [3] Output format [4] Evaluating the accuracy on section 23 [5] A list of the files included in this package ============================================================================== [0] Compiling the code To compile the parsing code: cd code make If need be, use rm *.o to remove obsolete files before typing "make". ============================================================================== [1] Running the code To run the code: To parse , with output piped to : With Model 1: gunzip -c models/model1/events.gz | code/parser models/model1/grammar 10000 1 1 1 1 > & With Model 2: gunzip -c models/model2/events.gz | code/parser models/model2/grammar 10000 1 1 1 1 > & With Model 3: gunzip -c models/model3/events.gz | code/parser models/model3/grammar 10000 1 1 1 1 > & You should see something similar to the following at stderr when you run the parser: Initialised lexicons Initialised grammar Loaded non-terminals Loaded lexicon Loaded grammar NUMSENTENCES 1917 Hash table: 100000 lines read Hash table: 200000 lines read Hash table: 300000 lines read Hash table: 400000 lines read .... There are some sample inputs in the "examples" sub-directory: If you run over examples/sec23.tagged with models 1, 2 and 3 you should get identical output to sec23/sec23.model1, sec23/sec23.model2 and sec23.model3 (with the exception of the timing examples, lines starting with "TIME"). It's probably good to do this to check that everything is running correctly. For example, to parse section 23 with the three models: gunzip -c models/model1/events.gz | code/parser examples/sec23.tagged models/model1/grammar 10000 1 1 1 1 > examples/sec23.model1 & gunzip -c models/model2/events.gz | code/parser examples/sec23.tagged models/model2/grammar 10000 1 1 1 1 > examples/sec23.model2 & gunzip -c models/model3/events.gz | code/parser examples/sec23.tagged models/model3/grammar 10000 1 1 1 1 > examples/sec23.model3 & [1.1] More details about the command line format The general format is gunzip -c events_file.gz | code/parser tagged_file grammar_stem beamsize punctuation-flag distaflag distvflag npbflag where: events_file.gz = file of training events tagged_file = the file to be parsed (see [2] for its format) grammar_stem = the stem for various grammar files beamsize = the size of the beam. 10000 is usual, 1000 will be faster at a slight cost in accuracy punctuation-flag = 1 if the punctuation constraint is to be used, you will usually want this to be the case (see Collins 99 section 7.5.5 for a description of the constraint) distaflag = 1 for the adjacency condition in the distance measure to be used. This flag should almost certainly be set to be 1. distvflag = 1 for the verb condition in the distance measure to be used. This flag should almost certainly be set to be 1. npbflag = 1 for output format that can be scored against the the treebank. If it's set to 0 the output will include an extra level in some NPs, for example: npbflag = 1 (TOP (S (NPB the man) (VP saw (NPB the dog)))) vs. npbflag = 0 (TOP (S (NP (NPB the man)) (VP saw (NP (NPB the dog))))) Notice the extra NP level when npbflag = 0. This extra level is more consistent (there are always NP and NPB levels, even when there are no modifiers to the noun phrase), so it may be be better for some applications. ============================================================================== [2] Input format Input format of the tagged file: N word_1 tag_1 ... word_n tag_n where N is the number of words in the sentence. e.g. 18 Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . . ============================================================================== [3] Output format Output format: In general, to see a straightforward version of the output, cat parsed_file | sec23/proc_pout.prl The "raw" output format is as follows: First line is "PROB num_edges_in_chart log_prob 0" e.g. PROB 3890 -72.7453 0 Next few lines are the parse tree printed, one word per line, with log probs on each constituent Next line is the full parse output (see below for details) Final line is "TIME time" e.g. "TIME 10" meaning the parse took 10 seconds The full parse output is in the following format: first an example: (TOP~will~1~1 (S~will~2~2 (NP-A~Vinken~2~1 (NPB~Vinken~2~2 Pierre/NNP Vinken/NNP ,/PUNC, ) (ADJP~old~2~2 (NPB~years~2~2 61/CD years/NNS ) old/JJ ,/PUNC, ) ) (VP~will~2~1 will/MD (VP-A~join~4~1 join/VB (NPB~board~2~2 the/DT board/NN ) (PP~as~2~1 as/IN (NPB~director~3~3 a/DT nonexecutive/JJ director/NN ) ) (NPB~Nov.~2~1 Nov./NNP 29/CD ./PUNC. ) ) ) ) ) Now some details: 1) Word tag pairs are separated by '/' . Punctuation marks have their POS tag preceded by "PUNC". For example ,/PUNC, 2) '(' and ')' show the tree bracketing. The opening parenthesis is immediately followed by a non-terminal, whose format is described in (3). 3) Non-terminals are in the following format: non-term-label~headword~total#_of_children~constituent_# where constituent_# is the number of the child from which the headword is taken. This uses 1 based indexing, with punctuation marks not being counted as children. For example: (NP~flowers~4~4 the/DT green/JJ ,/PUNC, hungry/JJ flowers/NN ) 4) In models 2 and 3, "-A" is appended to non-terminals which are arguments (complements) as opposed to adjuncts. In model 3, "-g" is appended to non-terminals which contain a slash category. ============================================================================== [4] Evaluating the accuracy on section 23 See README.sec23 in the sec23/ directory for full details of how the parser output was scored. ============================================================================== [5] A list of the files included in this package code/ this directory includes the source code for the parser examples/sec23.tagged part-of-speech tagged version of section 23 of the treebank (tagged by Adwait Ratnaparkhi's tagger) examples/sec00.tagged part-of-speech tagged version of section 0 of the treebank (tagged by Adwait Ratnaparkhi's tagger) sec23/ this directory has output on section 23 for the three models (sec23.model1, sec23.model2, and sec23.model3). See README.sec23 for how to evaluate these three files against the treebank. sec23/proc_pout.prl A useful script for converting the parser's output to treebank-style parses models/model1 These three directories hold the grammar and lexicon models/model2 files for the three parsers. models/model3

近期下载者

相关文件


收藏者