latok

所属分类:特征抽取
开发工具:C
文件大小:129KB
下载次数:0
上传日期:2020-02-01 22:36:39
上 传 者sh-1993
说明:  线性代数标记器
(Linear Algebraic Tokenizer)

文件列表:
.dockerignore (230, 2018-11-06)
.pylintrc (14957, 2018-11-06)
MANIFEST.in (62, 2018-11-06)
_testing_output (0, 2018-11-06)
bin (0, 2018-11-06)
bin\clean (313, 2018-11-06)
bin\dev (146, 2018-11-06)
bin\dock-notebook (147, 2018-11-06)
bin\install_local (147, 2018-11-06)
bin\notebook (151, 2018-11-06)
bin\setup-dev (2321, 2018-11-06)
bin\test (4056, 2018-11-06)
coverage.cfg (35, 2018-11-06)
docker (0, 2018-11-06)
docker\base (0, 2018-11-06)
docker\base\Dockerfile (3312, 2018-11-06)
docker\dockerutils.cfg (1308, 2018-11-06)
docker\jenkins (0, 2018-11-06)
docker\jenkins\Dockerfile (1207, 2018-11-06)
docker\notebook (0, 2018-11-06)
docker\notebook\Dockerfile (614, 2018-11-06)
docker\notebook\entrypoint (497, 2018-11-06)
docs (0, 2018-11-06)
latok (0, 2018-11-06)
latok\__init__.py (93, 2018-11-06)
latok\_version.py (18509, 2018-11-06)
latok\core (0, 2018-11-06)
latok\core\__init__.py (0, 2018-11-06)
latok\core\default_tokenizer.py (6861, 2018-11-06)
latok\core\latok_utils.py (3067, 2018-11-06)
latok\core\offsets.py (1012, 2018-11-06)
latok\core\src (0, 2018-11-06)
... ...

LaTok ============================= Linear Algebraic Tokenizer ## Description An NLP tokenizer based on linear algebraic operations. ### Key Points: #### Algorithm: * Construct a matrix, representing each letter in a string as a vector of features. * Where features are, e.g., * unicode character characteristics like alpha, numeric, uppercase, lowercase, etc. * context information such as preceding or following character characteristics, etc. * Apply relevant linear operations on the feature matrix to generate a tokenization mask * Where non-zero entries in the final mask identify character locations on which to split the string into tokens. #### Classification * Provide token-level classification based on the character-level features #### Performance: * As a primary design and implementation goal, ensure that tokenization is * Performant in terms of strings tokenized over time * Memory efficient in terms of memory consumed for tokenization * Implemented where necessary as C extensions to NumPy and Python ## Project Setup * If you have ops/bin in your path, please remove it, it has been deprecated. * Ensure that you have python installed. 3.5 or 3.6 is required at this point. 3.7 should be supported shortly. * Ensure that you have docker installed and /data configured as a file share * Ensure that your python bin directory is in your path (likely /Library/Frameworks/Python.framework/Versions/3.6/bin) * Ensure that your pip.conf (~/.pip/pip.conf) includes our internal pypi servers (see pip.conf.template in this repo) * bin/setup-dev to install environment * activate virtual environment (source activate) * run unit test (bin/test -ud)

近期下载者

相关文件


收藏者