LLClass

所属分类:人工智能/神经网络/深度学习
开发工具:Scala
文件大小:8105KB
下载次数:0
上传日期:2020-04-23 22:24:33
上 传 者sh-1993
说明:  MIT LL文本分类器,包括MIRA在线分类器、SVM和感知器(LID、情感分析、文本困难...
(MIT LL Text Classifier including MIRA Online classifier, SVM, and perceptron (LID, sentiment analysis, text difficulty assessment))

文件列表:
build.sbt (1850, 2017-07-14)
docs (0, 2017-07-14)
docs\Shen_Williams_Marius_Salesky_ACL2013.pdf (250067, 2017-07-14)
find_max.py (637, 2017-07-14)
lib (0, 2017-07-14)
lib\jtr.jar (6783, 2017-07-14)
lib\mallet-deps.jar (2033489, 2017-07-14)
lib\mallet.jar (2134041, 2017-07-14)
lib\structlearn.jar (66614, 2017-07-14)
models (0, 2017-07-14)
models\fourLang.mod.gz (941490, 2017-07-14)
models\news4L.mod (3150321, 2017-07-14)
project (0, 2017-07-14)
project\assembly.sbt (190, 2017-07-14)
src (0, 2017-07-14)
src\main (0, 2017-07-14)
src\main\resources (0, 2017-07-14)
src\main\resources\logback.xml (1200, 2017-07-14)
src\main\scala (0, 2017-07-14)
src\main\scala\mitll (0, 2017-07-14)
src\main\scala\mitll\lid (0, 2017-07-14)
src\main\scala\mitll\lid\Args.scala (19366, 2017-07-14)
src\main\scala\mitll\lid\Log.scala (4249, 2017-07-14)
src\main\scala\mitll\lid\Pipes.scala (33782, 2017-07-14)
src\main\scala\mitll\lid\RESTService.scala (2921, 2017-07-14)
src\main\scala\mitll\lid\classifier.scala (6319, 2017-07-14)
src\main\scala\mitll\lid\jq.scala (13614, 2017-07-14)
src\main\scala\mitll\lid\lid.scala (22492, 2017-07-14)
src\main\scala\mitll\lid\lm.scala (12143, 2017-07-14)
src\main\scala\mitll\lid\mira.scala (23111, 2017-07-14)
src\main\scala\mitll\lid\org.scala (1597, 2017-07-14)
src\main\scala\mitll\lid\text.scala (24854, 2017-07-14)
src\main\scala\mitll\lid\utilities.scala (26985, 2017-07-14)
src\main\scala\mitll\lid\vec.scala (6814, 2017-07-14)
src\test (0, 2017-07-14)
src\test\scala (0, 2017-07-14)
src\test\scala\mitll (0, 2017-07-14)
src\test\scala\mitll\lid (0, 2017-07-14)
... ...

### Introduction LLClass is a Java tool that can be used for a number of text classification problems including: * Language Identification (LID) - especially twitter data * Unreliable Article Style Detector : [Download new model](https://github.com/mitll/LLClass/releases/download/v1.1.4/unreliableArticleEnglish.mod) - trained on in-house hackathon training data * Automatic text difficulty assessment * Sentiment analysis It includes a number of different classifiers including MIRA, SVM, and a perceptron. It also includes a simple REST service for doing classification and some pre trained models. More documentation can be found under docs : * [Low-Resource Twitter LID](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwi9nIPhrITVAhWCVj4KHeWwB4IQFggmMAA&url=http%3A%2F%2Fweb.science.mq.edu.au%2F~smalmasi%2Fvardial4%2Fpdf%2FVarDial09.pdf&usg=AFQjCNEB2r8xLI4LfbZrO6iFbeLjGKKraw) * [Auto ILR Paper](docs/Shen_Williams_Marius_Salesky_ACL2013.pdf) See below for some performance benchmarks. ### Build Dependencies * scala 2.11.8 * sbt * Java 1.8 ### To Compile Source Code and Build At top-level directory type: ``` sbt assembly ``` Running this command will cause SBT to download some dependencies, this may take some time depending on your internet connection. If you use a proxy, you may need to adjust your local proxy settings to allow SBT to fetch dependencies. If you are behind a firewall, you may need to add a configuration file in your ```~/.sbt``` directory. See the [SBT Behind a Firewall](#sbt-behind-a-firewall) section for more details. This creates a jar under target at ``` [info] Packaging ... target/scala-2.11/LLClass-assembly-1.1.jar ``` For examples below, you can add a link to the jar (from the top-level directory): ``` ln -s target/scala-2.11/LLClass-assembly-1.1.jar LLClass.jar ``` ### Data Format Description * The data should one line per example, separated with a tab or whitespace between label and document. ##### Example Data Format: ``` en this is english. fr quelle langue est-elle? ``` ### Default MIRA Parameters - when unspecified, these parameters are automatically set: * split: 0.10 (90/10 train-test split) * word-ngram-order: 1 * char-ngram-order: 3 * bkg-min-count: 2 * slack: 0.01 * iterations: 20 ### Quickstart: ``` java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz ``` ### Quickstart Expected Results: ``` (truncated from above) 2015-10-05 15:56:25.912 [INFO] Completed training 2015-10-05 15:56:25.912 [INFO] Training complete. 2015-10-05 15:56:27.325 [INFO] # of trials: 200 2015-10-05 15:56:27.325 [INFO] ru fa es dar N class % 2015-10-05 15:56:27.325 [INFO] ru 50 0 0 0 50 1.000000 2015-10-05 15:56:27.325 [INFO] fa 0 46 0 4 50 0.920000 2015-10-05 15:56:27.326 [INFO] es 0 0 50 0 50 1.000000 2015-10-05 15:56:27.326 [INFO] dar 0 0 0 50 50 1.000000 2015-10-05 15:56:27.326 [INFO] accuracy = 0.***5 ``` ### MITLL-LID Options * Model - save a LID model for later application onto new data * Log - log the parameters, accuracy per language, overall accuracy, debugging * Score - generate a file with LID scores on each sentence * Data - use (-all) to generate models or do a train/test split. Data should be in TSV format and gzipped (gzip myfile.tsv) * Train/Test - if not using (-all with optional -split), then specify separate train and test sets (-train mytrain.tsv.gz -test mytest.tsv.gz). This is useful for training out-of-domain followed by testing in-domain. * Output Files - Scored files, model files, and log files are only saved when the user specifies them on command line at runtime #### Use 85/15 train/test split and run for 10 iterations: ``` java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 10 ``` #### Save score files, model files, and log files, use 85/15 train/test split (optional - specify and save the resulting model, log and score files): ``` java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 30 -model news4L.mod -log news4L.log -score news4L.score ``` #### Apply an existing model to new test data ``` java -jar LLClass.jar LID -test test/no_nl_da_en_5K.tsv.gz -model models/fourLang.mod.gz ``` #### Train and test on different data sets ``` java -jar LLClass.jar LID -train test/no_nl_da_en_5k.tsv.gz -test test/no_nl_da_en_500.tsv.gz ``` ### Calling from Java/Scala There are two main functions to score text. * textLID() returns the language code and a confidence value for that code. * textLIDFull() returns a set of language labels ranked by most likely to least likely and a confidence value for each one. The confidence values range [0,1] where larger numbers imply higher confidence. #### Steps to perform language identification via LLClass 1) Import LLClass language id package ``` import mitll.lid ``` 2) Create an instance of the ```Scorer``` class and specify the LID model ``` var newsRunner = new lid.Scorer("path/to/lid/model") ``` 3) call the function mitll.Scorer.textLID() ``` var (language, confidence) = newsRunner.textLID("what language is this text string?") ``` 4) or call the function mitll.Scorer.textLIDFull() ``` var langConfArray : Array[(Symbol,Double)] = newsRunner.textLIDFull("what language is this text string?") ``` ### REST service * Start the service ``` java -jar LLClass.jar REST ``` * Simple text call in a browser (http://localhost:8080/classify?q=...) ``` http://localhost:8080/classify?q=No%20quiero%20pagar%20la%20cuenta%20del%20hospital. ``` ``` es ``` * Curl ``` curl --noproxy localhost http://localhost:8080/classify?q=Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable. es ``` * return JSON (http://localhost:8080/classify/json?q=...) ``` http://localhost:8080/classify/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22 ``` ```javascript { class: "es", confidence: "47.82" } ``` * Model labels ``` http://localhost:8080/labels ``` ```javascript { labels: [ "dar", "es", "fa", "ru" ] } ``` * JSON scores for all labels in model (http://localhost:8080/classify/all/json?q=...) ``` http://localhost:8080/classify/all/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22 ``` ```javascript results: [ { class: "es", confidence: 0.7623826612993418 }, { class: "dar", confidence: -0.128890820239746 }, { class: "ru", confidence: -0.26141672506577*** }, { class: "fa", confidence: -0.37207511599381426 } ] } ``` * See RESTServiceSpec for details ### Tests * LIDSpec has more usage examples. * EvalSpec has tests that show swapping out an LLClass classifier for a langid.py service classifier. * TwitterEvalSpec has tests for running against twitter data from [Evaluating language identification performance](https://blog.twitter.com/2015/evaluating-language-identification-performance) * RESTServiceSpec shows variations on running the RESTService ###sbt behind a firewall * You may need to add a repositories file like this under your ~/.sbt directory: ``` 515918-mitll:.sbt $ cat repositories [repositories] local my-ivy-proxy-releases: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext] my-maven-proxy-releases: http://repo1.maven.org/maven2/ ``` #### Twitter Results ### Tweet normalization Tweet normalization was done by the [LLString tweet normalizer](https://github.com/mitll/LLString.git). See that repo for instructions on how to install it and run the python normalization script. ### Twitter : Evaluating language identification performance From [Evaluating language identification performance](https://blog.twitter.com/2015/evaluating-language-identification-performance) #### Precision dataset Note that the precision_oriented dataset had 69000 tweets but we only could actually download 51567 tweets. ##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label Labels with fewer than 500 examples were excluded. |Info|Value| |----|-----| |Train|44352| |Test|6654| |Labels(44)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW| |Accuracy|0.86654***5| ##### Train/Test 85/15 split all labels, no text normalization but skip the und label The und label marked undefined tweets which could match several languages. |Info|Value| |----|-----| |Train|34546| |Test|5183| |Labels(43)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW| |Accuracy|0.94***36| This model can be found in the release directory if you want to try it yourself. ##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label Ran a python script to attempt to normalize tweet text to remove markup, hashtags, etc. |Info|Value| |----|-----| |Train|44080| |Test|6614| |Labels(44)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW| |Accuracy|0.86815846| ##### Train/Test 85/15 split all labels, with text normalization but skip the und label |Info|Value| |----|-----| |Train|34540| |Test|5183| |Labels(43)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW| |Accuracy|0.95504534| #### Recall ##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label Initial data had 72000 of 87585 tweets from recall_oriented. |Info|Value| |----|-----| |Train|71196| |Test|10682| |Labels (67)|am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW| |Accuracy|0.9237971| See model in releases. ##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label Initial data had 72000 of 87585 tweets from recall_oriented. |Info|Value| |----|-----| |Train|71187| |Test|10680| |Labels (67)|am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW| |Accuracy|0.92387***| #### Uniform ##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label |Info|Value| |----|-----| |Train|7***42| |Test|11467| |Labels (13)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und| |Accuracy|0.90***282| ##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label, no und label |Info|Value| |----|-----| |Train|7***42| |Test|10402| |Labels (12)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr| |Accuracy|0.9699096| ##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label |Info|Value| |----|-----| |Train|69344| |Test|11228| |Labels (13)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und| |Accuracy|0.9291058| ##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label, no und label |Info|Value| |----|-----| |Train|69338| |Test|10401| |Labels (12)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr| |Accuracy|0.***08672| ### Freetext #### [Europarl](https://code.google.com/archive/p/language-detection/downloads) |Info|Value| |----|-----| |Train|74854| |Test|3150| |Labels (21)|bg cs da de el en es et fi fr hu it lt lv nl pl pt ro sk sl sv| |Accuracy|0.9***41267| See releases for model. ### Twitter 11 Languages small dataset ``` 2016-04-15 16:11:37.257 [INFO] # of trials: 825 2016-04-15 16:11:37.258 [INFO] zh uk ru no nl ko id fa en da ar N class % 2016-04-15 16:11:37.258 [INFO] zh 74 1 0 0 0 0 0 0 0 0 0 75 0.***6667 2016-04-15 16:11:37.258 [INFO] uk 1 40 28 2 0 0 0 2 0 2 0 75 0.533333 2016-04-15 16:11:37.258 [INFO] ru 0 4 71 0 0 0 0 0 0 0 0 75 0.946667 2016-04-15 16:11:37.258 [INFO] no 1 1 0 28 11 1 10 0 8 15 0 75 0.373333 2016-04-15 16:11:37.258 [INFO] nl 1 0 0 3 57 0 6 0 4 4 0 75 0.760000 2016-04-15 16:11:37.258 [INFO] ko 0 0 0 0 0 74 0 0 1 0 0 75 0.***6667 2016-04-15 16:11:37.258 [INFO] id 1 0 0 1 2 0 68 0 2 1 0 75 0.906667 2016-04-15 16:11:37.258 [INFO] fa 1 1 0 0 0 0 1 69 0 0 3 75 0.920000 2016-04-15 16:11:37.258 [INFO] en 1 1 0 3 0 0 3 0 57 10 0 75 0.760000 2016-04-15 16:11:37.258 [INFO] da 0 0 0 9 1 0 4 1 6 54 0 75 0.720000 2016-04-15 16:11:37.258 [INFO] ar 0 0 0 0 0 0 0 1 0 0 74 75 0.***6667 2016-04-15 16:11:37.258 [INFO] accuracy = 0.807273 ```

近期下载者

相关文件


收藏者