DPnorm

所属分类:collect
开发工具:Julia
文件大小:0KB
下载次数:0
上传日期:2019-10-02 13:50:05
上 传 者sh-1993
说明:  DPnormal计算器
(DPnorm Calculator)

文件列表:
DPnorm.jl (2550, 2019-10-02)
Manifest.toml (20705, 2019-10-02)
Project.toml (611, 2019-10-02)
includes/ (0, 2019-10-02)
includes/data.jl (2332, 2019-10-02)
includes/english-contractions-list.txt (498, 2019-10-02)
includes/stats.jl (939, 2019-10-02)

# DPnorm Calculate the DPnorm values ([http://www.stgries.info/research/2010_STG_DispersionAdjFreq_CorpLingAppl.pdf](https://github.com/jaypmorgan/DPnorm/blob/master/)) for a corpus. ## Usage This script assumes that the corpus has been split into a corpus part per file prior to execution. When running the script, you then specify the directory where the [xml,txt] files for each corpus part is located, in addition to the filepath/filename of the output CSV that contains the resulting DPnorm scores for each token. Example: ```bash julia DPnorm.jl --input data/ --output scores.csv --punctuation !?.*- ``` ### Input The `input` flag should specify the directory where all of the xml/txt files for are located. Each of these files in the directory are assumed to be single corpus part. ### Output The tool will write a CSV file with three columns (tokens, frequency, dpnorm) to a location specified by the `output` flag. CSV column types: | Column Name | Type | Example | Description | |-------------|---------|---------|--------------------------------------------------------------| | token | String | ties | The word/token from the corpus | | frequency | Integer | 20 | The total number of occurences of token in the corpus | | dpnorm | Float | 0.192 | The resulting DPnorm score for token within the range [0,1]. | ### Removing Punctuation The tool includes an additional command line argument `punctuation` where you can supply a list of tokens (assumed to be a single character) to remove from the text before computing the scores.

近期下载者

相关文件


收藏者