LLClass
所属分类:人工智能/神经网络/深度学习
开发工具:Scala
文件大小:8105KB
下载次数:0
上传日期:2020-04-23 22:24:33
上 传 者:
sh-1993
说明: MIT LL文本分类器,包括MIRA在线分类器、SVM和感知器(LID、情感分析、文本困难...
(MIT LL Text Classifier including MIRA Online classifier, SVM, and perceptron (LID, sentiment analysis, text difficulty assessment))
文件列表:
build.sbt (1850, 2017-07-14)
docs (0, 2017-07-14)
docs\Shen_Williams_Marius_Salesky_ACL2013.pdf (250067, 2017-07-14)
find_max.py (637, 2017-07-14)
lib (0, 2017-07-14)
lib\jtr.jar (6783, 2017-07-14)
lib\mallet-deps.jar (2033489, 2017-07-14)
lib\mallet.jar (2134041, 2017-07-14)
lib\structlearn.jar (66614, 2017-07-14)
models (0, 2017-07-14)
models\fourLang.mod.gz (941490, 2017-07-14)
models\news4L.mod (3150321, 2017-07-14)
project (0, 2017-07-14)
project\assembly.sbt (190, 2017-07-14)
src (0, 2017-07-14)
src\main (0, 2017-07-14)
src\main\resources (0, 2017-07-14)
src\main\resources\logback.xml (1200, 2017-07-14)
src\main\scala (0, 2017-07-14)
src\main\scala\mitll (0, 2017-07-14)
src\main\scala\mitll\lid (0, 2017-07-14)
src\main\scala\mitll\lid\Args.scala (19366, 2017-07-14)
src\main\scala\mitll\lid\Log.scala (4249, 2017-07-14)
src\main\scala\mitll\lid\Pipes.scala (33782, 2017-07-14)
src\main\scala\mitll\lid\RESTService.scala (2921, 2017-07-14)
src\main\scala\mitll\lid\classifier.scala (6319, 2017-07-14)
src\main\scala\mitll\lid\jq.scala (13614, 2017-07-14)
src\main\scala\mitll\lid\lid.scala (22492, 2017-07-14)
src\main\scala\mitll\lid\lm.scala (12143, 2017-07-14)
src\main\scala\mitll\lid\mira.scala (23111, 2017-07-14)
src\main\scala\mitll\lid\org.scala (1597, 2017-07-14)
src\main\scala\mitll\lid\text.scala (24854, 2017-07-14)
src\main\scala\mitll\lid\utilities.scala (26985, 2017-07-14)
src\main\scala\mitll\lid\vec.scala (6814, 2017-07-14)
src\test (0, 2017-07-14)
src\test\scala (0, 2017-07-14)
src\test\scala\mitll (0, 2017-07-14)
src\test\scala\mitll\lid (0, 2017-07-14)
... ...
### Introduction
LLClass is a Java tool that can be used for a number of text classification problems including:
* Language Identification (LID) - especially twitter data
* Unreliable Article Style Detector : [Download new model](https://github.com/mitll/LLClass/releases/download/v1.1.4/unreliableArticleEnglish.mod) - trained on in-house hackathon training data
* Automatic text difficulty assessment
* Sentiment analysis
It includes a number of different classifiers including MIRA, SVM, and a perceptron.
It also includes a simple REST service for doing classification and some pre trained models.
More documentation can be found under docs :
* [Low-Resource Twitter LID](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwi9nIPhrITVAhWCVj4KHeWwB4IQFggmMAA&url=http%3A%2F%2Fweb.science.mq.edu.au%2F~smalmasi%2Fvardial4%2Fpdf%2FVarDial09.pdf&usg=AFQjCNEB2r8xLI4LfbZrO6iFbeLjGKKraw)
* [Auto ILR Paper](docs/Shen_Williams_Marius_Salesky_ACL2013.pdf)
See below for some performance benchmarks.
### Build Dependencies
* scala 2.11.8
* sbt
* Java 1.8
### To Compile Source Code and Build
At top-level directory type:
```
sbt assembly
```
Running this command will cause SBT to download some dependencies, this may take some time depending on your internet connection.
If you use a proxy, you may need to adjust your local proxy settings to allow SBT to fetch dependencies.
If you are behind a firewall, you may need to add a configuration file in your ```~/.sbt``` directory. See the [SBT Behind a Firewall](#sbt-behind-a-firewall) section for more details.
This creates a jar under target at
```
[info] Packaging ... target/scala-2.11/LLClass-assembly-1.1.jar
```
For examples below, you can add a link to the jar (from the top-level directory):
```
ln -s target/scala-2.11/LLClass-assembly-1.1.jar LLClass.jar
```
### Data Format Description
* The data should one line per example, separated with a tab or whitespace between label and document.
##### Example Data Format:
```
en this is english.
fr quelle langue est-elle?
```
### Default MIRA Parameters - when unspecified, these parameters are automatically set:
* split: 0.10 (90/10 train-test split)
* word-ngram-order: 1
* char-ngram-order: 3
* bkg-min-count: 2
* slack: 0.01
* iterations: 20
### Quickstart:
```
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz
```
### Quickstart Expected Results:
```
(truncated from above)
2015-10-05 15:56:25.912 [INFO] Completed training
2015-10-05 15:56:25.912 [INFO] Training complete.
2015-10-05 15:56:27.325 [INFO] # of trials: 200
2015-10-05 15:56:27.325 [INFO] ru fa es dar N class %
2015-10-05 15:56:27.325 [INFO] ru 50 0 0 0 50 1.000000
2015-10-05 15:56:27.325 [INFO] fa 0 46 0 4 50 0.920000
2015-10-05 15:56:27.326 [INFO] es 0 0 50 0 50 1.000000
2015-10-05 15:56:27.326 [INFO] dar 0 0 0 50 50 1.000000
2015-10-05 15:56:27.326 [INFO] accuracy = 0.***5
```
### MITLL-LID Options
* Model - save a LID model for later application onto new data
* Log - log the parameters, accuracy per language, overall accuracy, debugging
* Score - generate a file with LID scores on each sentence
* Data - use (-all) to generate models or do a train/test split. Data should be in TSV format and gzipped (gzip myfile.tsv)
* Train/Test - if not using (-all with optional -split), then specify separate train and test sets (-train mytrain.tsv.gz -test mytest.tsv.gz). This is useful for training out-of-domain followed by testing in-domain.
* Output Files - Scored files, model files, and log files are only saved when the user specifies them on command line at runtime
#### Use 85/15 train/test split and run for 10 iterations:
```
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 10
```
#### Save score files, model files, and log files, use 85/15 train/test split (optional - specify and save the resulting model, log and score files):
```
java -jar LLClass.jar LID -all test/news4L-500each.tsv.gz -split 0.15 -iterations 30 -model news4L.mod -log news4L.log -score news4L.score
```
#### Apply an existing model to new test data
```
java -jar LLClass.jar LID -test test/no_nl_da_en_5K.tsv.gz -model models/fourLang.mod.gz
```
#### Train and test on different data sets
```
java -jar LLClass.jar LID -train test/no_nl_da_en_5k.tsv.gz -test test/no_nl_da_en_500.tsv.gz
```
### Calling from Java/Scala
There are two main functions to score text.
* textLID() returns the language code and a confidence value for that code.
* textLIDFull() returns a set of language labels ranked by most likely to least likely and a confidence value for each one. The confidence values range [0,1] where larger numbers imply higher confidence.
#### Steps to perform language identification via LLClass
1) Import LLClass language id package
```
import mitll.lid
```
2) Create an instance of the ```Scorer``` class and specify the LID model
```
var newsRunner = new lid.Scorer("path/to/lid/model")
```
3) call the function mitll.Scorer.textLID()
```
var (language, confidence) = newsRunner.textLID("what language is this text string?")
```
4) or call the function mitll.Scorer.textLIDFull()
```
var langConfArray : Array[(Symbol,Double)] = newsRunner.textLIDFull("what language is this text string?")
```
### REST service
* Start the service
```
java -jar LLClass.jar REST
```
* Simple text call in a browser (http://localhost:8080/classify?q=...)
```
http://localhost:8080/classify?q=No%20quiero%20pagar%20la%20cuenta%20del%20hospital.
```
```
es
```
* Curl
```
curl --noproxy localhost http://localhost:8080/classify?q=Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.
es
```
* return JSON (http://localhost:8080/classify/json?q=...)
```
http://localhost:8080/classify/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22
```
```javascript
{
class: "es",
confidence: "47.82"
}
```
* Model labels
```
http://localhost:8080/labels
```
```javascript
{
labels: [
"dar",
"es",
"fa",
"ru"
]
}
```
* JSON scores for all labels in model (http://localhost:8080/classify/all/json?q=...)
```
http://localhost:8080/classify/all/json?q=%22Necesito%20pagar%20los%20servicios%20de%20electricidad%20y%20cable.%22
```
```javascript
results: [
{
class: "es",
confidence: 0.7623826612993418
},
{
class: "dar",
confidence: -0.128890820239746
},
{
class: "ru",
confidence: -0.26141672506577***
},
{
class: "fa",
confidence: -0.37207511599381426
}
]
}
```
* See RESTServiceSpec for details
### Tests
* LIDSpec has more usage examples.
* EvalSpec has tests that show swapping out an LLClass classifier for a langid.py service classifier.
* TwitterEvalSpec has tests for running against twitter data from [Evaluating language identification performance](https://blog.twitter.com/2015/evaluating-language-identification-performance)
* RESTServiceSpec shows variations on running the RESTService
###sbt behind a firewall
* You may need to add a repositories file like this under your ~/.sbt directory:
```
515918-mitll:.sbt $ cat repositories
[repositories]
local
my-ivy-proxy-releases: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
my-maven-proxy-releases: http://repo1.maven.org/maven2/
```
#### Twitter Results
### Tweet normalization
Tweet normalization was done by the [LLString tweet normalizer](https://github.com/mitll/LLString.git). See that repo for instructions on how to install it and run the python normalization script.
### Twitter : Evaluating language identification performance
From [Evaluating language identification performance](https://blog.twitter.com/2015/evaluating-language-identification-performance)
#### Precision dataset
Note that the precision_oriented dataset had 69000 tweets but we only could actually download 51567 tweets.
##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label
Labels with fewer than 500 examples were excluded.
|Info|Value|
|----|-----|
|Train|44352|
|Test|6654|
|Labels(44)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW|
|Accuracy|0.86654***5|
##### Train/Test 85/15 split all labels, no text normalization but skip the und label
The und label marked undefined tweets which could match several languages.
|Info|Value|
|----|-----|
|Train|34546|
|Test|5183|
|Labels(43)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW|
|Accuracy|0.94***36|
This model can be found in the release directory if you want to try it yourself.
##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label
Ran a python script to attempt to normalize tweet text to remove markup, hashtags, etc.
|Info|Value|
|----|-----|
|Train|44080|
|Test|6614|
|Labels(44)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,und,ur,vi,zh-CN,zh-TW|
|Accuracy|0.86815846|
##### Train/Test 85/15 split all labels, with text normalization but skip the und label
|Info|Value|
|----|-----|
|Train|34540|
|Test|5183|
|Labels(43)|ar,bn,ckb,de,el,en,es,fa,fr,gu,he,hi,hi-Latn,hy,id,it,ja,ka,km,kn,lo,ml,mr,my,ne,nl,pa,pl,ps,pt,ru,sd,si,sr,sv,ta,te,th,ur,vi,zh-CN,zh-TW|
|Accuracy|0.95504534|
#### Recall
##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label
Initial data had 72000 of 87585 tweets from recall_oriented.
|Info|Value|
|----|-----|
|Train|71196|
|Test|10682|
|Labels (67)|am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW|
|Accuracy|0.9237971|
See model in releases.
##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label
Initial data had 72000 of 87585 tweets from recall_oriented.
|Info|Value|
|----|-----|
|Train|71187|
|Test|10680|
|Labels (67)|am, ar, bg, bn, bo, bs, ca, ckb, cs, cy, da, de, dv, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hi-Latn, hr, ht, hu, hy, id, is, it, ja, ka, km, kn, ko, lo, lv, ml, mr, my, ne, nl, no, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sr, sv, ta, te, th, tl, tr, uk, ur, vi, zh-CN, zh-TW|
|Accuracy|0.92387***|
#### Uniform
##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label
|Info|Value|
|----|-----|
|Train|7***42|
|Test|11467|
|Labels (13)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und|
|Accuracy|0.90***282|
##### Train/Test 85/15 split all labels, no text normalization, minimum 500 examples per label, no und label
|Info|Value|
|----|-----|
|Train|7***42|
|Test|10402|
|Labels (12)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr|
|Accuracy|0.9699096|
##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label
|Info|Value|
|----|-----|
|Train|69344|
|Test|11228|
|Labels (13)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr, und|
|Accuracy|0.9291058|
##### Train/Test 85/15 split all labels, with text normalization, minimum 500 examples per label, no und label
|Info|Value|
|----|-----|
|Train|69338|
|Test|10401|
|Labels (12)|ar, en, es, fr, id, ja, ko, pt, ru, th, tl, tr|
|Accuracy|0.***08672|
### Freetext
#### [Europarl](https://code.google.com/archive/p/language-detection/downloads)
|Info|Value|
|----|-----|
|Train|74854|
|Test|3150|
|Labels (21)|bg cs da de el en es et fi fr hu it lt lv nl pl pt ro sk sl sv|
|Accuracy|0.9***41267|
See releases for model.
### Twitter 11 Languages small dataset
```
2016-04-15 16:11:37.257 [INFO] # of trials: 825
2016-04-15 16:11:37.258 [INFO] zh uk ru no nl ko id fa en da ar N class %
2016-04-15 16:11:37.258 [INFO] zh 74 1 0 0 0 0 0 0 0 0 0 75 0.***6667
2016-04-15 16:11:37.258 [INFO] uk 1 40 28 2 0 0 0 2 0 2 0 75 0.533333
2016-04-15 16:11:37.258 [INFO] ru 0 4 71 0 0 0 0 0 0 0 0 75 0.946667
2016-04-15 16:11:37.258 [INFO] no 1 1 0 28 11 1 10 0 8 15 0 75 0.373333
2016-04-15 16:11:37.258 [INFO] nl 1 0 0 3 57 0 6 0 4 4 0 75 0.760000
2016-04-15 16:11:37.258 [INFO] ko 0 0 0 0 0 74 0 0 1 0 0 75 0.***6667
2016-04-15 16:11:37.258 [INFO] id 1 0 0 1 2 0 68 0 2 1 0 75 0.906667
2016-04-15 16:11:37.258 [INFO] fa 1 1 0 0 0 0 1 69 0 0 3 75 0.920000
2016-04-15 16:11:37.258 [INFO] en 1 1 0 3 0 0 3 0 57 10 0 75 0.760000
2016-04-15 16:11:37.258 [INFO] da 0 0 0 9 1 0 4 1 6 54 0 75 0.720000
2016-04-15 16:11:37.258 [INFO] ar 0 0 0 0 0 0 0 1 0 0 74 75 0.***6667
2016-04-15 16:11:37.258 [INFO] accuracy = 0.807273
```
近期下载者:
相关文件:
收藏者: