chamkho
所属分类:collect
开发工具:Rust
文件大小:0KB
下载次数:0
上传日期:2023-07-11 03:26:49
上 传 者:
sh-1993
说明: no intro
(Khmer, Lao, Myanmar, and Thai word segmentation breaking library and command line,)
文件列表:
Cargo.lock (10938, 2023-10-22)
Cargo.toml (674, 2023-10-22)
LICENSE (11357, 2023-10-22)
LICENSE-PyThaiNLP (11357, 2023-10-22)
LICENSE-khmerdict (1076, 2023-10-22)
LICENSE-myanmar-dict (1070, 2023-10-22)
LICENSE-thai.txt (11357, 2023-10-22)
Lao-Dictionary-LICENSE.txt (1323, 2023-10-22)
build-standalone-release.bash (676, 2023-10-22)
data/ (0, 2023-10-22)
data/khmer_cluster_rules.txt (260, 2023-10-22)
data/khmerdict.txt (1501303, 2023-10-22)
data/laowords.txt (203262, 2023-10-22)
data/myanmar-dict.txt (925535, 2023-10-22)
data/thai-replace-rules.json (46, 2023-10-22)
data/thai.txt (278702, 2023-10-22)
data/thai_cluster_rules.txt (790, 2023-10-22)
data/words_th.txt (1517594, 2023-10-22)
src/ (0, 2023-10-22)
src/cli.rs (3259, 2023-10-22)
src/init.rs (5589, 2023-10-22)
# Moved to https://codeberg.org/mekong-lang/chamkho
# Chamkho
Khmer, Lao, Myanmar, and Thai word segmentation/breaking library and command line
## Algorithm
https://github.com/veer66/wordcut-engine
## C-ABI Library
https://github.com/veer66/wordcutw
## Usage
### Binary tarball
#### On Google Colab
```
!wget -q https://github.com/veer66/chamkho/releases/download/1.4.0/chamkho-1.4.0-linux-x64-musl.tar.gz -O - | tar -xzf -
with open('input.txt', 'w') as f:
f.write(u'')
!cd chamkho-1.4.0-linux-x64-musl; ./wordcut < ../input.txt; cd ..
```
#### On GNU/Linux
```
wget -q https://github.com/veer66/chamkho/releases/download/1.4.0/chamkho-1.4.0-linux-x64-musl.tar.gz -O - | tar -xzf -
cd chamkho-1.4.0-linux-x64-musl
echo | ./wordcut
cd ..
```
#### On Windows Powershell
```
PS C:\ex1> $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS C:\ex1> Invoke-WebRequest -uri https://github.com/veer66/chamkho/releases/download/1.1.0/chamkho-1.1.0-windows-amd64.zip -OutFile chamkho.zip
PS C:\ex1> Expand-Archive -Path .\chamkho.zip -DestinationPath .
PS C:\ex1> cd .\chamkho-1.1.0-windows-amd64\
PS C:\ex1\chamkho-1.1.0-windows-amd64> echo | .\wordcut.
exe
||
```
### As command line
echo "" | wordcut
#### External dictionary
echo "" | wordcut -d
#### Specific language
```Bash
echo "" | wordcut -l lao
```
```Bash
echo | wordcut -l khmer
```
```Bash
echo | wordcut -l myanmar
```
#### WebAssembly
```Bash
# Add wasm32-wasi target
rustup target add wasm32-wasi
# Install wasmtime (macOS) https://docs.wasmtime.dev/cli-install.html
curl https://wasmtime.dev/install.sh -sSf | bash
# Source (or restart terminal)
source ~/.zshrc
# Build
cargo build --target=wasm32-wasi
# Run with input
echo "" | wasmtime --dir=$(pwd)/data target/wasm32-wasi/debug/wordcut.was
# Run and wait for input
wasmtime --dir=$(pwd)/data target/wasm32-wasi/debug/wordcut.wasm
```
## Benchmark
### Setup
* Computer: Hetzner's CX11
* Rustc: rustc 1.53.0 (53cb7b09b 2021-06-17)
* OS: Linux exper1 5.12.15-300.fc34.x86_64 #1 SMP Wed Jul 7 19:46:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
* Script:
```Bash
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
#INPUT=x.txt
for i in {1..10}
do
{ time nlpo3 segment < $INPUT > o3 ; } 2>> bench_o3.txt
{ time wordcut < $INPUT > wc.txt ; } 2>> bench_wc.txt
done
```
* nlpo3 version: 1.1.2
* nlpo3-cli version: 0.0.1
* chamkho version: 0.5.0
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt
### Result
#### nlpo3
```
[root@exper1 ~]# grep real bench_o3.txt
real 3m26.884s
real 3m15.001s
real 3m12.829s
real 3m11.998s
real 3m12.399s
real 3m13.829s
real 3m14.506s
real 3m9.198s
real 3m6.749s
real 3m8.729s
```
#### chamko
```
[root@exper1 ~]# grep real bench_wc.txt
real 1m41.611s
real 1m40.262s
real 1m40.488s
real 1m40.765s
real 1m39.385s
real 1m41.002s
real 1m38.292s
real 1m35.906s
real 1m40.263s
real 1m36.523s
```
#### Average
* nlpo3: 193.21s
* chamkho: 99.44s
## Benchmark 2
Chamkho and Newmm on Mac mini M1
### Setup
* Computer: Scaleway's Mac mini M1
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* Python: Python 3.8.2
* OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
* Script:
```Bash
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time python3 newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
{ time wordcut < $INPUT > cham.out ; } 2>> bench_chamkho.txt
done
```
* PyThaiNLP: 2.3.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt
### Result
#### newmm
```
real 7m40.693s
real 7m40.623s
real 7m41.623s
real 7m40.438s
real 7m41.363s
real 7m39.108s
real 7m39.486s
real 7m39.946s
real 7m39.960s
real 7m40.279s
```
#### chamko
```
real 1m2.110s
real 1m2.200s
real 1m1.954s
real 1m1.823s
real 1m1.836s
real 1m1.864s
real 1m1.638s
real 1m1.641s
real 1m1.688s
real 1m1.923s
```
#### Average
* newmm
```
$ grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
460.3519
```
* chamkho:
```
$ grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
61.8677
```
#### Performance ratio
7.4
## Benchmark 3
Chamkho and Newmm on Xeon
### Setup
* Computer: Hetzner's CX11
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* Python: Python 3.9.6
* OS: Linux fedora-2gb-hel1-1 5.12.15-300.fc34.x86_64 #1 SMP Wed Jul 7 19:46:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
* Script:
```Bash
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time python newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
{ time wordcut < $INPUT > cham.out ; } 2>> bench_chamkho.txt
done
```
* PyThaiNLP: 2.3.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt
### Result
#### newmm
```
# grep real bench_newmm.txt
real 17m15.608s
real 17m14.038s
real 17m7.864s
real 17m17.329s
real 17m5.501s
real 17m10.841s
real 17m16.348s
real 17m19.813s
real 17m28.796s
real 17m26.056s
```
#### chamko
```
# grep real bench_chamkho.txt
real 1m46.157s
real 1m47.785s
real 1m47.173s
real 1m45.656s
real 1m45.554s
real 1m46.612s
real 1m48.991s
real 1m49.656s
real 1m47.677s
real 1m47.876s
```
#### Average
* newmm
```
# grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
1036.2194000000002
```
* chamkho:
```
$ # grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
107.31370000000001
```
#### Performance ratio
9.65
## Benchmark 4
Chamkho and Nlpo3 on Mac mini M1
### Setup
* Computer: Scaleway's Mac mini M1
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
* Script:
```Bash
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time wordcut < $INPUT > newmm.out ; } 2>> bench_chamkho.txt
{ time nlpo3 segment < $INPUT > cham.out ; } 2>> bench_o3.txt
done
```
* nlpo3-cli version: 0.0.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt
### Result
#### nlpo3
```
% grep real bench_o3.txt
real 2m7.639s
real 2m7.024s
real 2m6.296s
real 2m7.731s
real 2m7.873s
real 2m7.028s
real 2m6.411s
real 2m6.974s
real 2m7.746s
real 2m6.955s
```
#### chamko
```
% grep real bench_chamkho.txt
real 1m0.237s
real 1m1.799s
real 1m1.752s
real 1m1.373s
real 1m1.128s
real 1m1.870s
real 1m1.878s
real 1m1.709s
real 1m1.690s
real 1m1.030s
```
#### Average
* nlpo3
```
% grep real bench_o3.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
127.1677
```
* chamkho:
```
% grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
61.44659999999999
```
#### Performance ratio
2.07
近期下载者:
相关文件:
收藏者: