pos-tagger-en-es:用于英语、西班牙语、荷兰语和其他语言的 POS 标记器

  • T7_408631
  • 25.8MB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-05-06 13:29
英语、西班牙语、荷兰语、意大利语、法语 POS 标记器 此存储库包含 OpeNER 项目的英语和西班牙语 POS 标记器的源代码。 如 K. Toutanova、D. Klein 和 CD Manning 所述,已经使用 WSJ 树库训练和评估了英语感知器模型。 具有循环依赖网络的功能丰富的词性标记。 在 Proceedings of HLT-NAACL'03, 2003 中。目前我们获得了 96.87% 的性能,而 Toutanova 等人获得了 97.24%。 (2003)。 已使用 Ancora 训练和评估西班牙最大熵模型语料库; 将其随机分为 90% 的训练(45 万字)和 10% 的测试(5 万字),获得 98.88% 的性能。 使用 ESTER 语料库训练的法国最大熵模型。 使用 TUT Treebank 训练的意大利感知器模型。 Apache OpenNLP 网站上
pos-tagger-en-es ================ This module provides a Part of Speech tagger for English, Spanish, Dutch, French and Italian. For installing and using the core of this repository please scroll down to the end of the document for the [installation instructions](#installation). ## OVERVIEW This module provides POS tagging and lemmatization for 5 languages. We provide 5 fast POS tagging models with features based on Collins (2002) paper on the Perceptron. + **Perceptron model for English** trained and evaluated using the WSJ treebank as explained in K. Toutanova, D. Klein, and C. D. Manning. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL’03, 2003. + **Maximum Entropy model for Spanish** trained and evaluated using the Ancora corpus; it was randomly divided in 90% for training (450K words) and 10% testing (50K words). + **POS tagging model for Dutch** publicly available from the Apache OpenNLP website: http://opennlp.sourceforge.net/models-1.5/ + **Maximum Entropy model for French** trained with the French Treebank using 80% for training, and 10% for development and testing. + **Dictionary-based lemmatization** for all 5 languages. To avoid duplication of efforts, we use the machine learning API provided by the [Apache OpenNLP project](http://opennlp.apache.org). Additionally, we have added dictionary-based lemmatization for each language. Therefore, the following resources are provided within the module: + **English Perceptron POS Model**: + Penn Treebank 96.66 word accuracy. + **Spanish POS Models**: we obtained better results overall with Maximum Entropy models (Ratnapharki 1999). The best results are obtained when a c0 (cutoff 0) is used, but those models are slower for production than the Perceptron models. Therefore, we provide both types, based on maxent and perceptron. + **Ancora Maxent**: 98.88 Word accuracy. + **Ancora Perceptron**: 98.24 Word accuracy . + **Dutch POS Perceptron**: downloaded from Apache OpenNLP website: http://opennlp.sourceforge.net/models-1.5/ + **French POS Maximum Entropy**: trained with the [French Treebank](http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php) divided into 80% of training (500K words), 10% development (42K words) and 10% for testing (42K words). Test word accuracy: 94.9. + **Italian Perceptron model**: trained with TUT Treebank. + **Lemmatizer Dictionaries for all 5 languages**: + **Plain text dictionary**: "Word POStag lemma" dictionary in plain text to perform lemmatization. + **Morfologik-stemming**: Binarized palin text dictionaries as a finite state automata using the morfologik-stemming project (see NOTICE file for details). This method uses much less RAM with respect to the plain text dictionary (**this is the default**). ## USAGE If you are in hurry, just execute: ````shell cat file.txt | tokenizer | java -jar $PATH/target/ehu-pos-$version.jar -l $lang ```` If you want to know more, please follow reading. This pos tagger reads KAF documents (with *wf* and *term* elements) via standard input and outputs KAF through standard output. You can get the necessary input for ehu-pos by piping it with the [OpeNER tokenizer](https://github.com/opener-project/tokenizer). There are several options to tag with ehu-pos: + **lang**: choose between en, es, fr, it and nl. + **lemmatize**: choose dictionary method to perform lemmatization: + **bin**: Morfologik binary dictionary (**default**). + **plain**: plain text dictionary. + **wn**: WordNet 3.0-based lemmatization, **only for English**. To get WordNet go to: ````shell wget http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz ```` **Tagging Example**: ````shell cat file.txt | tokenizer | java -jar $PATH/target/ehu-pos-$version.jar -l $lang ```` ## JAVADOC It is possible to generate the javadoc of the module by executing: ````shell cd $repo/core/ mvn javadoc:jar ```` Which will create a jar file core/target/ehu-pos-$version-javadoc.jar ## Module contents The contents of the core are the following: + formatter.xml Apache OpenNLP code formatter for Eclipse SDK + pom.xml maven pom file which deals with everything related to compilation and execution of the module + src/ java source code of the module and required resources + Furthermore, the installation process, as described in the README.md, will generate another directory: target/ it contains binary executable and other directories ## INSTALLATION Installing the ehu-pos requires the following steps: If you already have installed in your machine the Java 1.7+ and MAVEN 3, please go to step 3 directly. Otherwise, follow these steps: ### 1. Install JDK 1.7 If you do not install JDK 1.7 in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile: ````shell export JAVA_HOME=/yourpath/local/java7 export PATH=${JAVA_HOME}/bin:${PATH} ```` If you use tcsh you will need to specify it in your .login as follows: ````shell setenv JAVA_HOME /usr/java/java17 setenv PATH ${JAVA_HOME}/bin:${PATH} ```` If you re-login into your shell and run the command ````shell java -version ```` You should now see that your JDK is 1.7 ### 2. Install MAVEN 3 Download MAVEN 3 from ````shell wget http://apache.rediris.es/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz ```` Now you need to configure the PATH. For Bash Shell: ````shell export MAVEN_HOME=/home/ragerri/local/apache-maven-3.0.5 export PATH=${MAVEN_HOME}/bin:${PATH} ```` For tcsh shell: ````shell setenv MAVEN3_HOME ~/local/apache-maven-3.0.5 setenv PATH ${MAVEN3}/bin:{PATH} ```` If you re-login into your shell and run the command ````shell mvn -version ```` You should see reference to the MAVEN version you have just installed plus the JDK 7 that is using. ### 3. Get module source code If you must get the module source code from here do this: ````shell git clone https://github.com/opener-project/pos-tagger-en-es ```` ### 4. Compile ````shell cd $repo/core mvn clean package ```` This step will create a directory called target/ which contains various directories and files. Most importantly, there you will find the module executable: ehu-pos-$version.jar This executable contains every dependency the module needs, so it is completely portable as long as you have a JVM 1.7 installed. To install the module in the local maven repository, usually located in ~/.m2/, execute: ````shell mvn clean install ````