tokenizer-benchmark

所属分类:特征抽取
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2024-04-05 12:07:41
上 传 者sh-1993
说明:  斯堪的纳维亚代币使用者基准
(Benchmark for Scandinavian Tokenizers)

文件列表:
allwiki/
images/
wikipedia_100k/
wikipedia_1k/
LICENSE
run_test.py
sample_wikipedia.py
tokenizer_list.jsonl

# Benchmark for Scandinavian Language Tokenizers This repo provides tools for evaluating the efficiency of various tokenizers for Swedish, Danish, Norwegian Bokml and Norwegian Nynorsk. It will also support English for comparison. Here we meassure the tokenizer efficiency by tokenizing a total of 100k words from the top 500 Wikipedia pages for this language. Tokenizer efficincy, , can be defined as the ratio of the total number of words, , to the total number of tokens, , multiplied by 100 to express it as a percentage:
Tokenizer Efficiency Formula
### Scandinavian Tokenizers | Tokenizer | Type | Vocab Size | en | sv | da | no | nn | Average | Tokens/ Word | |:----------------------------------------------------------------------------------------|:--------------|:-------------|-----:|-----:|-----:|-----:|-----:|:----------|--------------:| | [AISweedenRoberta](https://hf.co/AI-Sweden-Models/roberta-large-1160k) | BPE | 50,265 | 68 | 75 | 77 | 75 | 67 | 72.8% | 1.38 | | [Viking](https://hf.co/LumiOpen/Viking-7B) | BPE | 131,072 | 76 | 68 | 70 | 69 | 69 | 70.9% | 1.41 | | [MBart](https://hf.co/facebook/mbart-large-en-ro) | SentencePiece | 250,027 | 74 | 65 | 67 | 67 | 63 | 68.0% | 1.48 | | [Gemma](https://hf.co/google/gemma-7b) | SentencePiece | 256,000 | 81 | 60 | 61 | 61 | 60 | 65.0% | 1.56 | | [norMistral](https://hf.co/norallm/normistral-7b-scratch) | BPE | 32,768 | 62 | 52 | 62 | 70 | 66 | 62.9% | 1.61 | | [mT5](https://hf.co/google/mt5-small) | SentencePiece | 250,100 | 69 | 58 | 60 | 60 | 58 | 61.7% | 1.63 | | [GPT-J](https://hf.co/EleutherAI/gpt-j-6b) | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 | | [NB-GPT-J](https://hf.co/NbAiLab/nb-gpt-j-6B-v2) | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 | | [Roberta](https://hf.co/roberta-base) | BPE | 50,265 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 | | [GPT2](https://hf.co/gpt2) | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 | | [Llama](https://hf.co/meta-llama/Llama-2-7b-hf) | BPE | 32,000 | 71 | 50 | 49 | 49 | 49 | 54.1% | 1.89 | | [Mistral](https://hf.co/mistralai/Mistral-7B-Instruct-v0.2) | BPE | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 | | [BinericGPT](https://hf.co/bineric/NorskGPT-Mistral-7b) | WordPiece | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 | | [MabeckMistral](https://hf.co/Mabeck/Heidrun-Mistral-7B-chat) | WordPiece | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 | | [KBLab-Megatron](https://hf.co/KBLab/megatron.bert-large.unigram-64k-pretok.500k-steps) | WordPiece | 64,005 | 52 | 61 | 45 | 45 | 45 | 50.1% | 2.02 | ### Not Fully Supported Tokenizers | Tokenizer | Type | Vocab Size | Scand Test | Nordic Test | Eng Test | Average | Tokens/ Word | |:----------------------------------------------------------------|:--------------|:-------------|:-------------|:--------------|:-----------|:----------|--------------:| | [NB-BERT](https://hf.co/NbAiLab/nb-bert-large) | WordPiece | 50,000 | OK (lower) | Failed | OK (lower) | 86.0% | 1.3 | | [NorBert](https://hf.co/ltg/norbert3-large) | WordPiece | 50,000 | Failed | Failed | Failed | 82.5% | 1.4 | | [norT5](https://hf.co/ltg/nort5-base) | SentencePiece | 50,000 | Failed | Failed | Failed | 82.5% | 1.4 | | [mBERT](https://hf.co/bert-base-multilingual-uncased) | WordPiece | 105,879 | Failed | Failed | OK (lower) | 72.8% | 1.34 | | [KBLab-BERT](https://hf.co/KBLab/bert-base-swedish-cased) | WordPiece | 50,325 | Failed | Failed | Success | 63.2% | 1.51 | | [Saattrupdan-no](https://hf.co/saattrupdan/tokenizer-no) | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 59.2% | 1.94 | | [Saattrupdan-scand](https://hf.co/saattrupdan/tokenizer-scandi) | WordPiece | 100,000 | OK (lower) | OK (lower) | OK (lower) | 56.3% | 1.84 | | [Bert](https://hf.co/bert-base-uncased) | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 | | [DistilBert](https://hf.co/distilbert-base-uncased) | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 | | [LayoutLM](https://hf.co/microsoft/layoutlm-base-uncased) | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 | | [Saattrupdan-da](https://hf.co/saattrupdan/tokenizer-da) | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 51.0% | 1.97 | | [Saattrupdan-sv](https://hf.co/saattrupdan/tokenizer-sv) | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 43.9% | 2.1 | | [XLNet](https://hf.co/xlnet-base-cased) | SentencePiece | 32,000 | Failed | Failed | Success | 41.0% | 2.21 | | [T5](https://hf.co/t5-base) | SentencePiece | 32,100 | Failed | Failed | Success | 36.9% | 2.46 | # sample_wikipedia.py This script creates a corpus for Wikipedia articles for the defined set of languages. It is a tool for creating the tokenization benchmark. It extracts the first 200 words from each article on a specified date. Articles shorter than 200 words are dropped. In the default mode it samples until it has reached 100k words. To create the corpus files, run the command below: ```bash for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_100k/wiki_$lang.txt --num_articles 500 --num_words 200;done for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_1k/wiki_$lang.txt --num_articles 50 --num_words 20;done ``` # run_test.py This script runs the test and creates the tables in this document. ```bash python run_test.py ``` # Faster test run python run_test.py --directory wikipedia_1k/ ```

近期下载者

相关文件


收藏者