cang-jie

所属分类:特征抽取
开发工具:Rust
文件大小:7KB
下载次数:0
上传日期:2023-02-12 02:43:42
上 传 者sh-1993
说明:  tantivy的中文标记器,基于jieba rs
(Chinese tokenizer for tantivy, based on jieba-rs)

文件列表:
Cargo.toml (452, 2023-06-11)
LICENSE (1086, 2023-06-11)
pre-commit (159, 2023-06-11)
rustfmt.toml (122, 2023-06-11)
src (0, 2023-06-11)
src\lib.rs (200, 2023-06-11)
src\options.rs (425, 2023-06-11)
src\stream.rs (1286, 2023-06-11)
src\tokenizer.rs (1526, 2023-06-11)
tests (0, 2023-06-11)
tests\unicode_split.rs (2735, 2023-06-11)

# cang-jie([仓颉](https://en.wikipedia.org/wiki/Cangjie)) [![Crates.io](https://img.shields.io/crates/v/cang-jie.svg)](https://crates.io/crates/cang-jie) [![latest document](https://img.shields.io/badge/latest-document-ff69b4.svg)](https://docs.rs/cang-jie/) [![dependency status](https://deps.rs/repo/github/dcjanus/cang-jie/status.svg)](https://deps.rs/repo/github/dcjanus/cang-jie) A Chinese tokenizer for [tantivy](https://github.com/tantivy-search/tantivy), based on [jieba-rs](https://github.com/messense/jieba-rs). As of now, only support UTF-8. ## Example ```rust let mut schema_builder = SchemaBuilder::default(); let text_indexing = TextFieldIndexing::default() .set_tokenizer(CANG_JIE) // Set custom tokenizer .set_index_option(IndexRecordOption::WithFreqsAndPositions); let text_options = TextOptions::default() .set_indexing_options(text_indexing) .set_stored(); // ... Some code let index = Index::create(RAMDirectory::create(), schema.clone())?; let tokenizer = CangJieTokenizer { worker: Arc::new(Jieba::empty()), // empty dictionary option: TokenizerOption::Unicode, }; index.tokenizers().register(CANG_JIE, tokenizer); // ... Some code ``` [Full example](./tests/unicode_split.rs)

近期下载者

相关文件


收藏者