instant-clip-tokenizer
所属分类:特征抽取
开发工具:Rust
文件大小:0KB
下载次数:0
上传日期:2023-11-29 08:40:59
上 传 者:
sh-1993
说明: 即时剪辑标记器
(instant clip tokenizer)
文件列表:
.cargo/ (0, 2023-12-11)
.cargo/config.toml (224, 2023-12-11)
Cargo.toml (392, 2023-12-11)
LICENSE (1075, 2023-12-11)
Makefile (476, 2023-12-11)
cover.svg (123449, 2023-12-11)
deny.toml (167, 2023-12-11)
instant-clip-tokenizer-py/ (0, 2023-12-11)
instant-clip-tokenizer-py/Cargo.toml (512, 2023-12-11)
instant-clip-tokenizer-py/pyproject.toml (156, 2023-12-11)
instant-clip-tokenizer-py/src/ (0, 2023-12-11)
instant-clip-tokenizer-py/src/lib.rs (5317, 2023-12-11)
instant-clip-tokenizer-py/test/ (0, 2023-12-11)
instant-clip-tokenizer-py/test/test.py (931, 2023-12-11)
instant-clip-tokenizer/ (0, 2023-12-11)
instant-clip-tokenizer/Cargo.toml (797, 2023-12-11)
instant-clip-tokenizer/benches/ (0, 2023-12-11)
instant-clip-tokenizer/benches/encode.rs (1827, 2023-12-11)
instant-clip-tokenizer/benches/tokenize_batch.rs (4169, 2023-12-11)
instant-clip-tokenizer/bpe_simple_vocab_16e6.txt (3194984, 2023-12-11)
instant-clip-tokenizer/examples/ (0, 2023-12-11)
instant-clip-tokenizer/examples/tokenize.rs (673, 2023-12-11)
instant-clip-tokenizer/src/ (0, 2023-12-11)
instant-clip-tokenizer/src/lib.rs (20880, 2023-12-11)
scripts/ (0, 2023-12-11)
scripts/original.py (6871, 2023-12-11)
scripts/validate.py (1447, 2023-12-11)
![Cover logo](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/./cover.svg)
# Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network
[![Documentation](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://docs.rs/instant-clip-tokenizer/badge.svg)](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://docs.rs/instant-clip-tokenizer/)
[![Crates.io](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://img.shields.io/crates/v/instant-clip-tokenizer.svg)](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://crates.io/crates/instant-clip-tokenizer)
[![PyPI](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://img.shields.io/pypi/v/instant-clip-tokenizer)](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://pypi.org/project/instant-clip-tokenizer/)
[![Build status](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://github.com/instant-labs/instant-clip-tokenizer/workflows/CI/badge.svg)](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://github.com/instant-labs/instant-clip-tokenizer/actions?query=workflow%3ACI)
[![License: MIT](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/LICENSE-MIT)
Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for [OpenAI's CLIP model](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://github.com/openai/CLIP). It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with [OpenCLIP](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://github.com/mlfoundations/open_clip) and other implementations using the same tokenizer.
In addition to being usable as a Rust crate it also includes Python bindings built with [PyO3](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://pyo3.rs/) so that it can be used as a native Python module.
For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).
## Using the library
### Rust
```toml
[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }
```
### Python **(>= 3.9)**
```sh
pip install instant-clip-tokenizer
```
Using the library requires `numpy >= 1.16.0` installed in your Python environment (e.g., via `pip install numpy`).
### Examples
```rust
use instant_clip_tokenizer::{Token, Tokenizer};
let tokenizer = Tokenizer::new();
let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::>();
println!("{:?}", tokens);
// -> [320, 2533, 6765, 320, 10297]
```
```python
import instant_clip_tokenizer
tokenizer = instant_clip_tokenizer.Tokenizer()
tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)
# -> [320, 2533, 6765, 320, 10297]
batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)
# -> [[49406 320 2533 6765 49407]
# [49406 1883 997 49407 0]]
```
## Testing
To run the tests run the following:
```sh
cargo test --all-features
```
You can also test the Python bindings with:
```sh
make test-python
```
## Acknowledgements
The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI ([MIT-License](https://github.com/instant-labs/instant-clip-tokenizer/blob/master/https://github.com/openai/CLIP/blob/main/LICENSE)).
近期下载者:
相关文件:
收藏者: