kaiseki

所属分类:特征抽取
开发工具:Rust
文件大小:0KB
下载次数:0
上传日期:2023-09-12 03:12:08
上 传 者sh-1993
说明:  一个日本标记器和形态分析仪,
(A japanese tokenizer and morphological analyzer,)

文件列表:
Cargo.lock (2862, 2023-09-17)
Cargo.toml (535, 2023-09-17)
LICENSE (1081, 2023-09-17)
bin/ (0, 2023-09-17)
bin/char.bin (369865, 2023-09-17)
bin/dict.bin (16218410, 2023-09-17)
bin/matrix.bin (4986593, 2023-09-17)
bin/term.fst (2158351, 2023-09-17)
bin/unk.bin (683, 2023-09-17)
ipadic-install.sh (317, 2023-09-17)
mecab/ (0, 2023-09-17)
mecab/.keep (0, 2023-09-17)
src/ (0, 2023-09-17)
src/bin/ (0, 2023-09-17)
src/bin/build.rs (7424, 2023-09-17)
src/bincode.rs (702, 2023-09-17)
src/char.rs (1625, 2023-09-17)
src/conjugation.rs (11711, 2023-09-17)
src/dict.rs (703, 2023-09-17)
src/error.rs (1281, 2023-09-17)
src/feature.rs (1225, 2023-09-17)
src/fst.rs (1626, 2023-09-17)
src/inflection.rs (5773, 2023-09-17)
src/lattice.rs (4381, 2023-09-17)
src/lib.rs (539, 2023-09-17)
src/matrix.rs (740, 2023-09-17)
src/morpheme.rs (3830, 2023-09-17)
src/pos.rs (7662, 2023-09-17)
src/row.rs (1336, 2023-09-17)
src/term.rs (860, 2023-09-17)
src/tokenizer.rs (6854, 2023-09-17)
src/unk.rs (1175, 2023-09-17)
src/word.rs (4345, 2023-09-17)

# kaiseki kaiseki (解析) is a japanese tokenizer and morphological analyzer using [mecab-ipadic](https://taku910.github.io/mecab/), insipired by [this article](https://towardsdatascience.com/how-japanese-tokenizers-work-87ab6b256984). ## Usage kaiseki supports both morpheme tokenization and word tokenization (inflections included). It also provides additional informations from the mecab dictionary such as part of speech, conjugation form,... ```rust usekaiseki::{Tokenizer, error:Error}; fn main() -> Result<(), Error> { let tokenizer = Tokenizer::new()?; letmorphemes =tokenizer.tokenize("東京都に住んでいる"); letmorphemes:Vec<_>=morphemes.iter().map(|m|&m.text).collect(); println!("{:?}", morphemes); // ["東京","都","に","住ん", "で", "いる"] letwords=tokenizer.tokenize_word("東京都に住んでいる"); letwords:Vec<_>=words.iter().map(|w|&w.text).collect(); println!("{:?}", words); // ["東京","都","に","住んでいる"] Ok(()) } ``` ## Test ```sh cargo test ``` ## Credits - The [Mecab Project](https://taku910.github.io/mecab/) for providing the the dictionary and data used for tokenizing. - [kotori](https://github.com/wanasit/kotori) and [kuromoji-rs](https://github.com/fulmicoton/kuromoji-rs) for some reference. ## Articles - [How Japanese Tokenizers Work](https://towardsdatascience.com/how-japanese-tokenizers-work-87ab6b256984). - [日本語形態素解析の裏側を覗く!MeCab はどのように形態素解析しているか](https://techlife.cookpad.com/entry/2016/05/11/170000). ## License MIT License.

近期下载者

相关文件


收藏者