TinySegmenter

所属分类:collect
开发工具:Julia
文件大小:0KB
下载次数:0
上传日期:2020-11-24 21:03:25
上 传 者sh-1993
说明:  Julia版本的TinySegmenter,紧凑的日语标记器,
(Julia version of TinySegmenter, compact Japanese tokenizer,)

文件列表:
.travis.yml (402, 2020-11-24)
LICENSE.md (1557, 2020-11-24)
Project.toml (201, 2020-11-24)
appveyor.yml (1325, 2020-11-24)
benchmark/ (0, 2020-11-24)
benchmark/Gemfile (63, 2020-11-24)
benchmark/Gemfile.lock (156, 2020-11-24)
benchmark/benchmark.jl (177, 2020-11-24)
benchmark/benchmark.js (360, 2020-11-24)
benchmark/benchmark.py (546, 2020-11-24)
benchmark/benchmark.rb (257, 2020-11-24)
benchmark/benchmark.sh (335, 2020-11-24)
benchmark/download.sh (70, 2020-11-24)
benchmark/requirements.txt (55, 2020-11-24)
benchmark/test_tinysegmenter.py (1314, 2020-11-24)
benchmark/tiny_segmenter-0.2.js (20640, 2020-11-24)
src/ (0, 2020-11-24)
src/TinySegmenter.jl (28444, 2020-11-24)
test/ (0, 2020-11-24)
test/runtests.jl (732, 2020-11-24)
test/timemachineu8j.tokenized.txt (409139, 2020-11-24)
test/timemachineu8j.txt (248537, 2020-11-24)

# TinySegmenter [![Build Status](https://travis-ci.org/JuliaStrings/TinySegmenter.jl.svg?branch=master)](https://travis-ci.org/JuliaStrings/TinySegmenter.jl) TinySegmenter.jl is a Julia version of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/), which is an extremely compact Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. ## Usage ```jl using TinySegmenter join(tokenize("私の名前は中野です"), " | ") # "私 | の | 名前 | は | 中野 | です" ``` The return value of `tokenize` is an array of substrings of the string input, giving the locations of the tokens in the text. (Substrings are represented by the `SubString` Julia type.) ## Benchmarks The following are times in seconds for a benchmark (see [benchmark/README.md](benchmark/README.md)) of TinySegmenter implementations in different languages tokenizing a large (243kB) Japanese text: |Ruby | C++ | Perl | JavaScript(Node.js) | Go | Python | Julia | |---|---|---|---|---|---|---|---| |132.98 | 48 | 134 |105.31 | 10.50 | 111.85 | 11.70 | The benchmark was performed on the following machine: - Intel Core i5-3210M CPU at 2.50GHz - 8GB RAM (1600MHz DDR3) - MacBook Pro (Retina, 13-inch, Late 2012), MacOS 10.11 ("El Capitan") The [benchmark text](http://www.genpaku.org/timemachine/timemachineu8j.txt) was [The Time Machine](https://en.wikipedia.org/wiki/The_Time_Machine) by H.G. Wells, translated to Japanese by Hiroo Yamagata under the CC BY-SA 2.0 License. We also use the same text for validation (in the `test` directory).

近期下载者

相关文件


收藏者