node-icu-tokenizer

所属分类:特征抽取
开发工具:C
文件大小:13126KB
下载次数:0
上传日期:2022-08-11 02:49:23
上 传 者sh-1993
说明:  节点icu标记器
(node-icu-tokenizer)

文件列表:
.npmignore (525, 2020-12-24)
CHANGELOG.md (380, 2020-12-24)
LICENSE (1071, 2020-12-24)
binding.gyp (3771, 2020-12-24)
includes (0, 2020-12-24)
includes\unicode (0, 2020-12-24)
includes\unicode\appendable.h (8632, 2020-12-24)
includes\unicode\brkiter.h (28200, 2020-12-24)
includes\unicode\bytestream.h (9819, 2020-12-24)
includes\unicode\bytestrie.h (19718, 2020-12-24)
includes\unicode\bytestriebuilder.h (7268, 2020-12-24)
includes\unicode\caniter.h (7530, 2020-12-24)
includes\unicode\casemap.h (25913, 2020-12-24)
includes\unicode\char16ptr.h (7470, 2020-12-24)
includes\unicode\chariter.h (24574, 2020-12-24)
includes\unicode\dbbi.h (1133, 2020-12-24)
includes\unicode\docmain.h (6721, 2020-12-24)
includes\unicode\dtintrv.h (3847, 2020-12-24)
includes\unicode\edits.h (15916, 2020-12-24)
includes\unicode\enumset.h (2104, 2020-12-24)
includes\unicode\errorcode.h (4894, 2020-12-24)
includes\unicode\filteredbrk.h (5595, 2020-12-24)
includes\unicode\icudataver.h (1051, 2020-12-24)
includes\unicode\icuplug.h (12141, 2020-12-24)
includes\unicode\idna.h (12938, 2020-12-24)
includes\unicode\listformatter.h (5097, 2020-12-24)
includes\unicode\localpointer.h (17284, 2020-12-24)
includes\unicode\locdspnm.h (7224, 2020-12-24)
includes\unicode\locid.h (32150, 2020-12-24)
includes\unicode\messagepattern.h (34449, 2020-12-24)
includes\unicode\normalizer2.h (34785, 2020-12-24)
includes\unicode\normlzr.h (31476, 2020-12-24)
includes\unicode\parseerr.h (3157, 2020-12-24)
includes\unicode\parsepos.h (5581, 2020-12-24)
includes\unicode\platform.h (27844, 2020-12-24)
includes\unicode\ptypes.h (3554, 2020-12-24)
includes\unicode\putil.h (6488, 2020-12-24)
... ...

# node-icu-tokenizer Node.js String Tokenizer using ICU's BreakIterator See [http://userguide.icu-project.org/boundaryanalysis](http://userguide.icu-project.org/boundaryanalysis) for a rundown on how the BreakIterator works. Install the NPM module: ``` npm install @flowaccount/node-icu-tokenizer ``` Call the tokenizer: ``` new Tokenizer().tokenize('pretty quiet out there eh?'); ``` Receive an array of tokens with boundaries: ``` [ { token: 'pretty', bounds: { start: 0, end: 6 } }, { token: 'quiet', bounds: { start: 7, end: 12 } }, { token: 'out', bounds: { start: 13, end: 16 } }, { token: 'there', bounds: { start: 17, end: 22 } }, { token: 'eh', bounds: { start: 23, end: 25 } }, { token: '?', bounds: { start: 25, end: 26 } } ] ``` ### Tokenizer Options **locale** * An [ICU locale](http://userguide.icu-project.org/locale). Defaults to **en_US** **ignoreWhitespaceTokens** * If true (default) whitespaces are ommitted as tokens. Otherwise they are treated as normal words. ## Acknowledgments This module is based off of node-icu-wordsplit, which also uses the BreakIterator for tokenizing. [https://github.com/chakrit/node-icu-wordsplit] (https://github.com/chakrit/node-icu-wordsplit)

近期下载者

相关文件


收藏者