coccoc-tokenizer

所属分类:特征抽取
开发工具:C++
文件大小:57014KB
下载次数:0
上传日期:2021-04-15 16:47:12
上 传 者sh-1993
说明:  越南语高性能标记器
(high performance tokenizer for Vietnamese language)

文件列表:
.clang-format (663, 2021-04-16)
CMakeLists.txt (3568, 2021-04-16)
CMakeMacro.cmake (9372, 2021-04-16)
LICENSE (7652, 2021-04-16)
RELEASE.md (1342, 2021-04-16)
debian (0, 2021-04-16)
debian\changelog (1313, 2021-04-16)
debian\coccoc-tokenizer-java.install (23, 2021-04-16)
debian\coccoc-tokenizer.install (40, 2021-04-16)
debian\compat (2, 2021-04-16)
debian\control (1024, 2021-04-16)
debian\rules (763, 2021-04-16)
dicts (0, 2021-04-16)
dicts\tokenizer (0, 2021-04-16)
dicts\tokenizer\Freq2NontoneUniFile (1687343, 2021-04-16)
dicts\tokenizer\acronyms (1416717, 2021-04-16)
dicts\tokenizer\chemical_comp (8477, 2021-04-16)
dicts\tokenizer\keyword.freq (3484493, 2021-04-16)
dicts\tokenizer\nontone_pair_freq (67124001, 2021-04-16)
dicts\tokenizer\special_token.strong (364, 2021-04-16)
dicts\tokenizer\special_token.weak (7716, 2021-04-16)
dicts\tokenizer\vndic_multiterm (8491299, 2021-04-16)
dicts\vn_lang_tool (0, 2021-04-16)
dicts\vn_lang_tool\alphabetic (20920, 2021-04-16)
dicts\vn_lang_tool\d_and_gi.txt (861, 2021-04-16)
dicts\vn_lang_tool\i_and_y.txt (366, 2021-04-16)
dicts\vn_lang_tool\numeric (6304, 2021-04-16)
java (0, 2021-04-16)
java\build_java.sh (890, 2021-04-16)
java\src (0, 2021-04-16)
java\src\java (0, 2021-04-16)
java\src\java\Token.java (4656, 2021-04-16)
java\src\java\Tokenizer.java (4955, 2021-04-16)
java\src\java\Unsafe.java (3623, 2021-04-16)
java\src\jni (0, 2021-04-16)
java\src\jni\Tokenizer.cpp (2633, 2021-04-16)
python (0, 2021-04-16)
... ...

# C++ tokenizer for Vietnamese This project provides tokenizer library for Vietnamese language and 2 command line tools for tokenization and some simple Vietnamese-specific operations with text (i.e. remove diacritics). It is used in Coc Coc Search and Ads systems and the main goal in its development was to reach high performance while keeping the quality reasonable for search ranking needs. ### Installing Building from source and installing into sandbox (or into the system): ``` $ mkdir build && cd build $ cmake .. # make install ``` To include java bindings: ``` $ mkdir build && cd build $ cmake -DBUILD_JAVA=1 .. # make install ``` To include python bindings - install [cython](https://pypi.org/project/Cython/) package and compile wrapper code (only Python3 is supported): ``` $ mkdir build && cd build $ cmake -DBUILD_PYTHON=1 .. # make install ``` Building debian package can be done with debhelper tools: ``` $ dpkg-buildpackage # from source tree root ``` If you want to build and install everything into your sandbox, you can use something like this (it will build everything and install into ~/.local, which is considered as a standard sandbox PREFIX by many applications and frameworks): ``` $ mdkir build && cd build $ cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=~/.local .. $ make install ``` ## Using the tools Both tools will show their usage with `--help` option. Both tools can accept either command line arguments or stdin as an input (if both provided, command line arguments are preferred). If stdin is used, each line is considered as one separate argument. The output format is TAB-separated tokens of the original phrase (note that Vietnamese tokens can have whitespaces inside). There's a few examples of usage below. Tokenize command line argument: ``` $ tokenizer "Tung buoc e tro thanh mot lap trinh vien gioi" tung buoc e tro thanh mot lap trinh vien gioi ``` Note that it may take one or two seconds for tokenizer to load due to one comparably big dictionary used to tokenize "sticky phrases" (when people write words without spacing). You can disable it by using `-n` option and the tokenizer will be up in no time. The default behaviour about "sticky phrases" is to only try to split them within urls or domains. With `-n` you can disable it completely and with `-u` you can force using it for the whole text. Compare: ``` $ tokenizer "toisongohanoi, toi ang ky tren thegioididong.vn" toisongohanoi toi ang ky tren the gioi di dong vn $ tokenizer -n "toisongohanoi, toi ang ky tren thegioididong.vn" toisongohanoi toi ang ky tren thegioididong vn $ tokenizer -u "toisongohanoi, toi ang ky tren thegioididong.vn" toi song o ha noi toi ang ky tren the gioi di dong vn ``` To avoid reloading dictionaries for every phrase, you can pass phrases from stdin. Here's an example (note that the first line of output is empty - that means empty result for "/" input line): ``` $ echo -ne "/\nanh yeu em\nbun cha o nha hang Quan An Ngon ko ngon\n" | tokenizer anh yeu em bun cha o nha hang quan an ngon ko ngon ``` Whitespaces and punctuations are ignored during normal tokenization, but are kept during tokenization for transformation, which is used internally by Coc Coc search engine. To keep punctuations during normal tokenization, except those in segmented URLs, use `-k`. To run tokenization for transformation, use `-t` - notice that this will format result by replacing spaces in multi-syllable tokens with `_` and `_` with `~`. ``` $ tokenizer "toisongohanoi, toi ang ky tren thegioididong.vn" -k toisongohanoi , toi ang ky tren the gioi di dong vn $ tokenizer "toisongohanoi, toi ang ky tren thegioididong.vn" -t toisongohanoi , toi ang_ky tren the_gioi di_dong vn ``` The usage of `vn_lang_tool` is pretty similar, you can see full list of options for both tools by using: ``` $ tokenizer --help $ vn_lang_tool --help ``` ## Using the library Use the code of both tools as an example of usage for a library, they are pretty straightforward and easy to understand: ``` utils/tokenizer.cpp # for tokenizer tool utils/vn_lang_tool.cpp # for vn_lang_tool ``` Here's a short code snippet from there: ```cpp // initialize tokenizer, exit in case of failure if (0 > Tokenizer::instance().initialize(opts.dict_path, !opts.no_sticky)) { exit(EXIT_FAILURE); } // tokenize given text, two additional options are: // - bool for_transforming - this option is Coc Coc specific kept for backwards compatibility // - int tokenize_options - TOKENIZE_NORMAL, TOKENIZE_HOST or TOKENIZE_URL, // just use Tokenizer::TOKENIZE_NORMAL if unsure std::vector< FullToken > res = Tokenizer::instance().segment(text, false, opts.tokenize_option); for (FullToken t : res) { // do something with tokens } ``` Note that you can call `segment()` function of the same Tokenizer instance multiple times and in parallel from multiple threads. Here's a short explanation of fields in FullToken structure: ```cpp struct Token { // position of the start of normalized token (in chars) int32_t normalized_start; // position of the end of normalized token (in chars) int32_t normalized_end; // position of the start of token in original text (in bytes) int32_t original_start; // position of the end of token in original text (in bytes) int32_t original_end; // token type (WORD, NUMBER, SPACE or PUNCT) int32_t type; // token segmentation type (this field is Coc Coc specific and kept for backwards compatibility) int32_t seg_type; } struct FullToken : Token { // normalized token text std::string text; } ``` ## Using Java bindings A java interface is provided to be used in java projects. Internally it utilizes JNI and the Unsafe API to connect Java and C++. You can find an example of its usage in `Tokenizer` class's main function: ``` java/src/java/Tokenizer.java ``` To run this test class from source tree, use the following command: ``` $ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "mot cau van tieng Viet" ``` Normally `LD_LIBRARY_PATH` should point to a directory with `libcoccoc_tokenizer_jni.so` binary. If you have already installed deb package or `make install`-ed everything into your system, `LD_LIBRARY_PATH` is not needed as the binary will be taken from your system (`/usr/lib` or similar). ## Using Python bindings ```python from CocCocTokenizer import PyTokenizer # load_nontone_data is True by default T = PyTokenizer(load_nontone_data=True) # tokenize_option: # 0: TOKENIZE_NORMAL (default) # 1: TOKENIZE_HOST # 2: TOKENIZE_URL print(T.word_tokenize("xin chao, toi la nguoi Viet Nam", tokenize_option=0)) # output: ['xin', 'chao', ',', 'toi', 'la', 'nguoi', 'Viet_Nam'] ``` ## Other languages Bindings for other languages are not yet implemented but it will be nice if someone can help to write them. ## Benchmark The library provides high speed tokenization which is a requirement for performance critical applications. The benchmark is done on a typical laptop with Intel Core i5-5200U processor: - Dataset: **1.203.165** Vietnamese Wikipedia articles ([Link](https://drive.google.com/file/d/1Amh8Tp3rM0kdThJ0Idd88FlGRmuwaK6o/view?usp=sharing)) - Output: **106.042.935** tokens out of **630.252.179** characters - Processing time: **41** seconds - Speed: **15M** characters / second, or **2.5M** tokens / second - RAM consumption is around **300Mb** ## Quality Comparison The `tokenizer` tool has a special output format which is similar to other existing tools for tokenization of Vietnamese texts - it preserves all the original text and just marks multi-syllable tokens with underscores instead of spaces. Compare: ``` $ tokenizer 'Lan hoi: "ieu kien gi?".' lan hoi ieu kien gi $ tokenizer -f original 'Lan hoi: "ieu kien gi?".' Lan hoi: "ieu_kien gi?". ``` Using the [following testset](https://github.com/UniversalDependencies/UD_Vietnamese-VTB) for comparison with [underthesea](https://github.com/undertheseanlp/underthesea) and [RDRsegmenter](https://github.com/datquocnguyen/RDRsegmenter), we get significantly lower result, but for most of the cases the observed differences are not important for search ranking quality. Below you can find few examples of such differences. Please, be aware of them when using this library. ``` original : Em ut theo anh ca vao mien Nam. coccoc-tokenizer : Em_ut theo anh_ca vao mien_Nam. underthesea : Em_ut theo anh ca vao mien Nam. RDRsegmenter : Em_ut theo anh_ca vao mien Nam. original : ket qua cuoc thi phong su - ky su 2004 cua bao Tuoi Tre. coccoc-tokenizer : ket_qua cuoc_thi phong_su - ky_su 2004 cua bao Tuoi_Tre. underthesea : ket_qua cuoc thi phong_su - ky_su 2004 cua bao Tuoi_Tre. RDRsegmenter : ket_qua cuoc thi phong_su - ky_su 2004 cua bao Tuoi_Tre. original : co be lon len duoi mai leu tranh rach nat, trong mot gia inh co bon the he phai xach bi gay i an xin. coccoc-tokenizer : co_be lon len duoi mai leu tranh rach_nat, trong mot gia_inh co bon the_he phai xach bi gay i an_xin. underthesea : co be lon len duoi mai leu tranh rach_nat, trong mot gia_inh co bon the_he phai xach bi_gay i an_xin. RDRsegmenter : co be lon len duoi mai leu tranh rach_nat, trong mot gia_inh co bon the_he phai xach bi_gay i an_xin. ``` We also don't apply any named entity recognition mechanisms within the tokenizer and have few rare cases where we fail to solve ambiguity correctly. We thus didn't want to provide exact quality comparison results as probably the goals and potential use cases of this library and of those similar ones mentioned above are different and thus precise comparison doesn't make much sense. ## Future Plans We'd love to introduce bindings for Python and maybe other languages later and we'd be happy if somebody can help us doing that. We are also thinking about adding POS tagger and more complex linguistic features later. If you find any issues or have any suggestions regarding further upgrades, please, report them here or write us through github.

近期下载者

相关文件


收藏者