vibrrt
所属分类:特征抽取
开发工具:R
文件大小:0KB
下载次数:0
上传日期:2023-08-27 00:38:16
上 传 者:
sh-1993
说明: Vibrato的R包装器:基于Viterbi的加速标记器,
(An R wrapper of Vibrato : Viterbi-based accelerated tokenizer,)
文件列表:
.Rbuildignore (117, 2023-12-23)
.devcontainer/ (0, 2023-12-23)
.devcontainer/devcontainer.json (654, 2023-12-23)
DESCRIPTION (668, 2023-12-23)
LICENSE (44, 2023-12-23)
LICENSE.md (1073, 2023-12-23)
NAMESPACE (634, 2023-12-23)
R/ (0, 2023-12-23)
R/bind_lr.R (2069, 2023-12-23)
R/bind_tf_idf2.R (5225, 2023-12-23)
R/collapse_tokens.R (1251, 2023-12-23)
R/dictionary.R (2596, 2023-12-23)
R/extendr-wrappers.R (425, 2023-12-23)
R/imports.R (109, 2023-12-23)
R/lex_density.R (1362, 2023-12-23)
R/mute_tokens.R (657, 2023-12-23)
R/pack.R (2533, 2023-12-23)
R/prettify.R (4514, 2023-12-23)
R/tokenize.R (4579, 2023-12-23)
R/utils.R (1894, 2023-12-23)
_pkgdown.yml (156, 2023-12-23)
inst/ (0, 2023-12-23)
inst/user.csv (231, 2023-12-23)
man/ (0, 2023-12-23)
man/as_tokens.Rd (739, 2023-12-23)
man/bind_lr.Rd (991, 2023-12-23)
man/bind_tf_idf2.Rd (2168, 2023-12-23)
man/collapse_tokens.Rd (786, 2023-12-23)
man/dict_path.Rd (584, 2023-12-23)
man/download_dict.Rd (721, 2023-12-23)
man/get_dict_features.Rd (939, 2023-12-23)
man/is_blank.Rd (459, 2023-12-23)
... ...
# vibrrt
> An R wrapper of ‘[Vibrato](https://github.com/daac-tools/vibrato)’:
> Viterbi-based accelerated tokenizer
[![vibrrt status
badge](https://paithiov909.r-universe.dev/badges/vibrrt)](https://paithiov909.r-universe.dev)
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/paithiov909/vibrrt/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/paithiov909/vibrrt/actions/workflows/R-CMD-check.yaml)
## Installation
``` r
install.packages("vibrrt", repos = "https://paithiov909.r-universe.dev")
```
## Usage
``` r
ipadic <- vibrrt::dict_path("ipadic-mecab-2_7_0")
if (!file.exists(ipadic)) {
vibrrt::download_dict("ipadic-mecab-2_7_0")
}
gibasa::ginga[5:10] |>
vibrrt::tokenize(sys_dic = ipadic) |>
vibrrt::prettify(col_select = c("POS1", "POS2"))
```
## Benchmark
``` r
microbenchmark::microbenchmark(
gibasa = gibasa::tokenize(gibasa::ginga, mode = "wakati"),
vibrrt_ipadic = vibrrt::tokenize(
gibasa::ginga,
sys_dic = vibrrt::dict_path("ipadic-mecab-2_7_0"),
mode = "wakati"
),
times = 10L,
check = "equal"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> gibasa 104.3151 107.1517 113.6609 108.2799 116.5667 151.5655 10
#> vibrrt_ipadic 400.1805 406.0459 437.5194 423.5166 456.0265 521.1660 10
```
``` r
microbenchmark::microbenchmark(
gibasa = gibasa::tokenize(
gibasa::ginga,
sys_dic = "/usr/local/lib/python3.10/dist-packages/unidic_lite/dicdir",
mode = "wakati"
),
vibrrt_unidic = vibrrt::tokenize(
gibasa::ginga,
sys_dic = vibrrt::dict_path("unidic-mecab-2_1_2"),
mode = "wakati"
),
times = 5L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> gibasa 386.1541 390.373 1158.620 544.906 630.9402 3840.727 5
#> vibrrt_unidic 2334.7467 2352.088 2628.865 2404.555 2474.3871 3578.548 5
```
近期下载者:
相关文件:
收藏者: