kaiseki
所属分类:特征抽取
开发工具:Rust
文件大小:0KB
下载次数:0
上传日期:2023-09-12 03:12:08
上 传 者:
sh-1993
说明: 一个日本标记器和形态分析仪,
(A japanese tokenizer and morphological analyzer,)
文件列表:
Cargo.lock (2862, 2023-09-17)
Cargo.toml (535, 2023-09-17)
LICENSE (1081, 2023-09-17)
bin/ (0, 2023-09-17)
bin/char.bin (369865, 2023-09-17)
bin/dict.bin (16218410, 2023-09-17)
bin/matrix.bin (4986593, 2023-09-17)
bin/term.fst (2158351, 2023-09-17)
bin/unk.bin (683, 2023-09-17)
ipadic-install.sh (317, 2023-09-17)
mecab/ (0, 2023-09-17)
mecab/.keep (0, 2023-09-17)
src/ (0, 2023-09-17)
src/bin/ (0, 2023-09-17)
src/bin/build.rs (7424, 2023-09-17)
src/bincode.rs (702, 2023-09-17)
src/char.rs (1625, 2023-09-17)
src/conjugation.rs (11711, 2023-09-17)
src/dict.rs (703, 2023-09-17)
src/error.rs (1281, 2023-09-17)
src/feature.rs (1225, 2023-09-17)
src/fst.rs (1626, 2023-09-17)
src/inflection.rs (5773, 2023-09-17)
src/lattice.rs (4381, 2023-09-17)
src/lib.rs (539, 2023-09-17)
src/matrix.rs (740, 2023-09-17)
src/morpheme.rs (3830, 2023-09-17)
src/pos.rs (7662, 2023-09-17)
src/row.rs (1336, 2023-09-17)
src/term.rs (860, 2023-09-17)
src/tokenizer.rs (6854, 2023-09-17)
src/unk.rs (1175, 2023-09-17)
src/word.rs (4345, 2023-09-17)
# kaiseki
kaiseki (解析) is a japanese tokenizer and morphological analyzer using [mecab-ipadic](https://taku910.github.io/mecab/), insipired by [this article](https://towardsdatascience.com/how-japanese-tokenizers-work-87ab6b256984).
## Usage
kaiseki supports both morpheme tokenization and word tokenization (inflections included). It also provides additional informations from the mecab dictionary such as part of speech, conjugation form,...
```rust
usekaiseki::{Tokenizer, error:Error};
fn main() -> Result<(), Error> {
let tokenizer = Tokenizer::new()?;
letmorphemes =tokenizer.tokenize("東京都に住んでいる");
letmorphemes:Vec<_>=morphemes.iter().map(|m|&m.text).collect();
println!("{:?}", morphemes); // ["東京","都","に","住ん", "で", "いる"]
letwords=tokenizer.tokenize_word("東京都に住んでいる");
letwords:Vec<_>=words.iter().map(|w|&w.text).collect();
println!("{:?}", words); // ["東京","都","に","住んでいる"]
Ok(())
}
```
## Test
```sh
cargo test
```
## Credits
- The [Mecab Project](https://taku910.github.io/mecab/) for providing the the dictionary and data used for tokenizing.
- [kotori](https://github.com/wanasit/kotori) and [kuromoji-rs](https://github.com/fulmicoton/kuromoji-rs) for some reference.
## Articles
- [How Japanese Tokenizers Work](https://towardsdatascience.com/how-japanese-tokenizers-work-87ab6b256984).
- [日本語形態素解析の裏側を覗く!MeCab はどのように形態素解析しているか](https://techlife.cookpad.com/entry/2016/05/11/170000).
## License
MIT License.
近期下载者:
相关文件:
收藏者: