bytepiece-rs

所属分类:特征抽取
开发工具:Rust
文件大小:0KB
下载次数:0
上传日期:2023-10-16 14:19:03
上 传 者sh-1993
说明:  字节标记器在Rust.中实现。,
(The Bytepiece Tokenizer Implemented in Rust.,)

文件列表:
bindings/ (0, 2023-11-28)
bindings/python/ (0, 2023-11-28)
bindings/python/.cargo/ (0, 2023-11-28)
bindings/python/.cargo/config.toml (317, 2023-11-28)
bindings/python/Cargo.toml (469, 2023-11-28)
bindings/python/LICENSE (11357, 2023-11-28)
bindings/python/benches/ (0, 2023-11-28)
bindings/python/benches/benchmark.py (1784, 2023-11-28)
bindings/python/benches/tokenizer_aho.py (2829, 2023-11-28)
bindings/python/benches/tokenizer_jieba.py (4673, 2023-11-28)
bindings/python/pyproject.toml (905, 2023-11-28)
bindings/python/rs_bytepiece/ (0, 2023-11-28)
bindings/python/rs_bytepiece/__init__.py (59, 2023-11-28)
bindings/python/src/ (0, 2023-11-28)
bindings/python/src/lib.rs (1229, 2023-11-28)
bindings/python/tests/ (0, 2023-11-28)
bindings/python/tests/test_tokenizer.py (160, 2023-11-28)
bytepiece_rs/ (0, 2023-11-28)
bytepiece_rs/Cargo.toml (844, 2023-11-28)
bytepiece_rs/LICENSE (11357, 2023-11-28)
bytepiece_rs/bench_aho/ (0, 2023-11-28)
bytepiece_rs/bench_aho/Cargo.toml (325, 2023-11-28)
bytepiece_rs/bench_aho/data/ (0, 2023-11-28)
bytepiece_rs/bench_aho/data/鲁迅全集.txt (1843308, 2023-11-28)
bytepiece_rs/bench_aho/main.py (1113, 2023-11-28)
bytepiece_rs/bench_aho/src/ (0, 2023-11-28)
bytepiece_rs/bench_aho/src/main.rs (1859, 2023-11-28)
bytepiece_rs/benches/ (0, 2023-11-28)
bytepiece_rs/benches/bytepiece_benchmark.rs (1148, 2023-11-28)
bytepiece_rs/src/ (0, 2023-11-28)
bytepiece_rs/src/lib.rs (88, 2023-11-28)
... ...

# bytepiece Implementation of Su's [bytepiece](https://github.com/bojone/bytepiece). Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent. ## Bindings - [Rust](https://github.com/hscspring/bytepiece-rs/tree/main/bytepiece_rs) - [Python](https://github.com/hscspring/bytepiece-rs/tree/main/bindings/python) ## Quick Example using Python ```python from rs_bytepiece import Tokenizer tokenizer = Tokenizer() output = tokenizer.encode("今天天气不错") print(output) # [40496, 45268, 39432] ```

近期下载者

相关文件


收藏者