masc_wordsense

所属分类:其他
开发工具:LINUX
文件大小:6636KB
下载次数:0
上传日期:2018-11-21 13:25:26
上 传 者vishsfsfsf
说明:  In a Boolean retrieval system, stemming never lowers recall.

文件列表:
masc_wordsense (0, 2010-07-27)
masc_wordsense\.DS_Store (6148, 2010-05-03)
__MACOSX (0, 2010-07-27)
__MACOSX\masc_wordsense (0, 2010-07-27)
__MACOSX\masc_wordsense\._.DS_Store (82, 2010-05-03)
masc_wordsense\data (0, 2010-05-09)
masc_wordsense\data\.DS_Store (6148, 2010-05-09)
__MACOSX\masc_wordsense\data (0, 2010-07-27)
__MACOSX\masc_wordsense\data\._.DS_Store (82, 2010-05-09)
masc_wordsense\data\Full_set (0, 2010-05-09)
masc_wordsense\data\Full_set\round1 (0, 2010-07-26)
masc_wordsense\data\Full_set\round1\demolish-v (0, 2010-05-10)
masc_wordsense\data\Full_set\round1\demolish-v\demolish-v-s.xml (25928, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set (0, 2010-07-27)
__MACOSX\masc_wordsense\data\Full_set\round1 (0, 2010-07-27)
__MACOSX\masc_wordsense\data\Full_set\round1\demolish-v (0, 2010-07-27)
__MACOSX\masc_wordsense\data\Full_set\round1\demolish-v\._demolish-v-s.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\demolish-v\demolish-v-wn.xml (69262, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\demolish-v\._demolish-v-wn.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\demolish-v\demolish-v.txt (10517, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\demolish-v\._demolish-v.txt (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\development-n (0, 2010-05-10)
masc_wordsense\data\Full_set\round1\development-n\development-n-s.xml (1509190, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\development-n (0, 2010-07-27)
__MACOSX\masc_wordsense\data\Full_set\round1\development-n\._development-n-s.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\development-n\development-n-wn.xml (3422186, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\development-n\._development-n-wn.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\development-n\development-n.txt (790572, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\development-n\._development-n.txt (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\history-n (0, 2010-05-10)
masc_wordsense\data\Full_set\round1\history-n\history-n-s.xml (1284262, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\history-n (0, 2010-07-27)
__MACOSX\masc_wordsense\data\Full_set\round1\history-n\._history-n-s.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\history-n\history-n-wn.xml (2877016, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\history-n\._history-n-wn.xml (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\history-n\history-n.txt (558883, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\history-n\._history-n.txt (225, 2010-05-07)
masc_wordsense\data\Full_set\round1\influence-n (0, 2010-05-10)
masc_wordsense\data\Full_set\round1\influence-n\influence-n-s.xml (339508, 2010-05-07)
__MACOSX\masc_wordsense\data\Full_set\round1\influence-n (0, 2010-07-27)
... ...

The Manually Annotated Sub-Corpus (MASC) project has a subtask to create samples of sentences from the MASC corpus that have word sense annotations, using WordNet sense numbers as the annotation values, and to create FrameNet frame element annotations for 10% of this data. By the end of the project in mid 2011, 1000 occurrences of each of approximately 100 words will have been tagged with WordNet senses, and 100 sentences from each word set will be annotated for FrameNet frame elements. To date, the project has conducted 5 rounds of tagging, including ~10 words per round. Round 1 was a small pilot study involving only 50 occurrences of 10 words tagged by two annotators. The other rounds annotated the full sample of approximately 1000 occurrences per word (in some cases fewer, depending on the word). Currently the total number of sentences annotated using WordNet 3.1 senses so far is 40058. Additional rounds of data will be added as they are completed. This README documents the MASC word sense annotation data for five rounds of word sense annotation: rounds 1 through 5. Very briefly, our procedure was to identify words of interest to the WordNet and FrameNet projects so that the current effort can support the effort to harmonize the two lexical resources, to review the WordNet sense inventory for these words by having multiple annotators annotate a small sample of fifty to one hundred sentences, then after the sense inventory review, which includes a complementary task of assessing inter-annotator agreement, to collect one thousand sense annotations of each word in its sentential context, with sentences taken from the MASC and the Open American National Corpus (OANC). Here we document: 1. Participants in the project 2. Relevant Publications 3. Rounds a. How words are selected for each round b. The procedure for selecting sentences, and dividing sentences into a pre-annotation sample for computing interannotator agreement among multiple annotators, and the full set of sentences for each word. 4. Annotation guidelines 5. Annotation tool 6. Directory structure 1. Participants in the MASC word sense annotation project Nancy Ide, PI, Vassar College Christiane Fellbaum (WordNet), Princeton University Collin Baker (FrameNet), UC Berkeley and ICSI Rebecca J. Passonneau, Columbia University 2. Relevant Publications Baker, Collin F.; Fellbaum, Christiane. 2009. WordNet and FrameNet as complementary resources for annotation. Proceedings of the Third Linguistic Annotation Workshop, pages 125-129, Suntec, Singapore, August. Association for Computational Linguistics. Ide, Nancy; Baker, Collin; Fellbaum, Christiane; Fillmore, Charles; Passonneau, Rebecca J. 2008. MASC: The Manually Annotated Sub-Corpus of American English. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC). May-June, 2008. Marrakesh, Morroco. Ide, Nancy; Baker, Collin; Fellbaum, Christiane; Passonneau, Rebecca J. 2010. MASC: A Community Resource For and By the People. 2010. Proceedings of ACL 2010, Uppsala, Sweden, July 11-16. Association for Computational Linguistics. Passonneau, Rebecca; Salleb-Aouissi, Ansaf; Ide, Nancy. 2009. Making sense of word sense variation. Proceedings of the NAACL-HLT 2009 Workshop on Semantic Evalutions: Recent Achievements and Future Directions (SEW-2009), pages 2-9. Boulder, Colorado, June 4, 2009. Passonneau, Rebecca J.; Salleb-Aouissi, Ansaf; Bhardwaj, Vikas; Ide, Nancy. 2010. Word Sense Annotation of PolysemousWords by Multiple Annotators. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). May 19-21, 2010. Valleta, Malta. 3. Rounds The steps in completing one round of word sense annotation consist of: a. Selecting the words b. Selecting sentences to be annotated c. Assembling annotators d. Annotating an initial subsample of 50 to 100 sentences per word e. Review of annotators' comments on subsample for potential revision of WordNet sense inventory f. Annotation of up to 1000 sentences per word using the revised sense inventory Words to be annotated are selected by Christiane Fellbaum and Collin Baker to address the goal of harmonizing WordNet and Framenet in order to bring the sense distinctions in each resource into better alignment. For Rounds 1-3 and 5, there were 10 words; for Round 4 there were 13. The set of words were chosen so as to balance part-of-speech, and to represent relatively polysemous words for round 2 (avg num senses per word=9.5), not so polysemous words for rounds 3 (avg num senses per word=4.8) and 4 (avg num senses per word=5.0), and a mix of very polysemous words with words having few senses for round 5 (avg num sense per word = ***). Annotators were undergraduates supported by National Science Foundation CRI grant 0708952, including supplemental funds for Research Experience for Undergraduates (REUs). There were nine distinct annotators who worked on Rounds 1 - 5, four at Columbia (101-104), and five at Vassar (105-109). Sentences containing instances of each word (restricted to a given part-of-speech) were selected from the MASC subset of the Open American National Corpus, drawn equally from each of the genre-specific portions of the corpus. Where necessary, additional sentences were taken from the remainder of the 15 million word OANC. Each of the fifty occurrences of each word in context were sense-tagged using WordNet 3.0 senses by four to six annotators (depending on the round) for a review of the WordNet sense inventory so as to determine whether the inventory needed revision, and for an in-depth study of inter-annotator agreement (Passonneau et al., 2009). For details, see the README files for the IAA data (data/IAA/roundXX/README.data.roundXX, XX= 2.1-5). Note that an extra "round", round 2.1, is included in this data. Round 1 was a pilot study to test the WordNet sense inventory revision process and gather information about the task before commencing the full study. Fifty occurrences of each of 10 words were tagged by two annotators, and one tagger tagged between 63 and 3628 occurrences of each word. No inter-annotator agreement was computed for this data. For rounds 2.2-5, one thousand occurrences of each word in context were sense-tagged with senses from the revised WordNet inventory (WordNet 3.1) by at least one annotator. In cases where 1000 occurrences did not exist in the MASC and OANC, fewer occurrences were tagged. For details see the README files in the Full_set data (data/Full_set/roundXX/README.roundXX, XX=1-5). 4. Annotation Guidelines Christiane Fellbaum prepared the annotation guidelines based on her previous word sense annotation projects. The most recent version is in the file "tagging.guidelines.v3.doc" included in the "doc" directory. 5. Annotation tool The Sense Annotation Tool for the American National Corpus (SATANiC) was developed during the course of the MASC word sense annotation project by Keith Suderman, and updated several times. A screenshot of the tool appears in (Passonneau et al., 2009). The current version displays the WordNet sense glosses for each word, plus four additional options: * Glob * No senses is appropriate * Wrong part of speech * Not enough context is available Glob is used to identify collocations, as defined in the annotation guidelines (see part 4. above). SATANiC will be made publicly available in the near future. 6. Directory structure masc_wordsense --> data --> IAA --> Full_set --> doc --> pub --> README.txt (this document) --> tools The data directory contains two sub-directories: IAA: contains the raw data, WordNet 3.0 sense annotations, and IAA statistics for 50 instances of the words in each round (100 instances for round 2.2). Full_set: contains the raw data and sense annotations by at least one annotator, using Wordnet 3.1 senses, for 1000 occurrences (fewer when 1000 instances were not available) of each word in each round. The data include a "sentence corpus" for each word, a standoff annotation file with the sense annotations that is linked to the relevant occurrence in the sentence corpus, and a sentence file that provides pointers to the original text in which the sentence occurs. When they become available, FrameNet annotations for 100 of the sentences will be included. The doc directory contains the tagging guidelines (tagging.guidelines.v3.doc) and an overview of the words and number of senses and sentences used in each round (wordsense-overview.xslx). The tools directory contains Ron Artstein's calculate_alpha.pl perl script. The IAA data includes raw tables for each word in the form expected by this program, so that the agreement numbers can be regenerated. Note that the ANC2Go (http://www.anc.org/ANC2Go) web application can be used to merge the sense annotations with the sentence and output the results in several different formats. Probably the most useful for many people is XML inline, which will include XML tags around each annotated word with attributes providing all of the sense and annotator information. ------------------------------------------- Last revised 2010-07-27 American National Corpus Project Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0732 anc@cs.vassar.edu

近期下载者

相关文件


收藏者