cang-jie
所属分类:特征抽取
开发工具:Rust
文件大小:7KB
下载次数:0
上传日期:2023-02-12 02:43:42
上 传 者:
sh-1993
说明: tantivy的中文标记器,基于jieba rs
(Chinese tokenizer for tantivy, based on jieba-rs)
文件列表:
Cargo.toml (452, 2023-06-11)
LICENSE (1086, 2023-06-11)
pre-commit (159, 2023-06-11)
rustfmt.toml (122, 2023-06-11)
src (0, 2023-06-11)
src\lib.rs (200, 2023-06-11)
src\options.rs (425, 2023-06-11)
src\stream.rs (1286, 2023-06-11)
src\tokenizer.rs (1526, 2023-06-11)
tests (0, 2023-06-11)
tests\unicode_split.rs (2735, 2023-06-11)
# cang-jie([仓颉](https://en.wikipedia.org/wiki/Cangjie))
[![Crates.io](https://img.shields.io/crates/v/cang-jie.svg)](https://crates.io/crates/cang-jie)
[![latest document](https://img.shields.io/badge/latest-document-ff69b4.svg)](https://docs.rs/cang-jie/)
[![dependency status](https://deps.rs/repo/github/dcjanus/cang-jie/status.svg)](https://deps.rs/repo/github/dcjanus/cang-jie)
A Chinese tokenizer for [tantivy](https://github.com/tantivy-search/tantivy), based on [jieba-rs](https://github.com/messense/jieba-rs).
As of now, only support UTF-8.
## Example
```rust
let mut schema_builder = SchemaBuilder::default();
let text_indexing = TextFieldIndexing::default()
.set_tokenizer(CANG_JIE) // Set custom tokenizer
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_indexing)
.set_stored();
// ... Some code
let index = Index::create(RAMDirectory::create(), schema.clone())?;
let tokenizer = CangJieTokenizer {
worker: Arc::new(Jieba::empty()), // empty dictionary
option: TokenizerOption::Unicode,
};
index.tokenizers().register(CANG_JIE, tokenizer);
// ... Some code
```
[Full example](./tests/unicode_split.rs)
近期下载者:
相关文件:
收藏者: