DPnorm
所属分类:collect
开发工具:Julia
文件大小:0KB
下载次数:0
上传日期:2019-10-02 13:50:05
上 传 者:
sh-1993
说明: DPnormal计算器
(DPnorm Calculator)
文件列表:
DPnorm.jl (2550, 2019-10-02)
Manifest.toml (20705, 2019-10-02)
Project.toml (611, 2019-10-02)
includes/ (0, 2019-10-02)
includes/data.jl (2332, 2019-10-02)
includes/english-contractions-list.txt (498, 2019-10-02)
includes/stats.jl (939, 2019-10-02)
# DPnorm
Calculate the DPnorm values ([http://www.stgries.info/research/2010_STG_DispersionAdjFreq_CorpLingAppl.pdf](https://github.com/jaypmorgan/DPnorm/blob/master/)) for a corpus.
## Usage
This script assumes that the corpus has been split into a corpus part per file prior to execution. When running the script, you then specify the directory where the [xml,txt] files for each corpus part is located, in addition to the filepath/filename of the output CSV that contains the resulting DPnorm scores for each token.
Example:
```bash
julia DPnorm.jl --input data/ --output scores.csv --punctuation !?.*-
```
### Input
The `input` flag should specify the directory where all of the xml/txt files for are located. Each of these files in the directory are assumed to be single corpus part.
### Output
The tool will write a CSV file with three columns (tokens, frequency, dpnorm) to a location specified by the `output` flag.
CSV column types:
| Column Name | Type | Example | Description |
|-------------|---------|---------|--------------------------------------------------------------|
| token | String | ties | The word/token from the corpus |
| frequency | Integer | 20 | The total number of occurences of token in the corpus |
| dpnorm | Float | 0.192 | The resulting DPnorm score for token within the range [0,1]. |
### Removing Punctuation
The tool includes an additional command line argument `punctuation` where you can supply a list of tokens (assumed to be a single character) to remove from the text before computing the scores.
近期下载者:
相关文件:
收藏者: