latok
所属分类:特征抽取
开发工具:C
文件大小:129KB
下载次数:0
上传日期:2020-02-01 22:36:39
上 传 者:
sh-1993
说明: 线性代数标记器
(Linear Algebraic Tokenizer)
文件列表:
.dockerignore (230, 2018-11-06)
.pylintrc (14957, 2018-11-06)
MANIFEST.in (62, 2018-11-06)
_testing_output (0, 2018-11-06)
bin (0, 2018-11-06)
bin\clean (313, 2018-11-06)
bin\dev (146, 2018-11-06)
bin\dock-notebook (147, 2018-11-06)
bin\install_local (147, 2018-11-06)
bin\notebook (151, 2018-11-06)
bin\setup-dev (2321, 2018-11-06)
bin\test (4056, 2018-11-06)
coverage.cfg (35, 2018-11-06)
docker (0, 2018-11-06)
docker\base (0, 2018-11-06)
docker\base\Dockerfile (3312, 2018-11-06)
docker\dockerutils.cfg (1308, 2018-11-06)
docker\jenkins (0, 2018-11-06)
docker\jenkins\Dockerfile (1207, 2018-11-06)
docker\notebook (0, 2018-11-06)
docker\notebook\Dockerfile (614, 2018-11-06)
docker\notebook\entrypoint (497, 2018-11-06)
docs (0, 2018-11-06)
latok (0, 2018-11-06)
latok\__init__.py (93, 2018-11-06)
latok\_version.py (18509, 2018-11-06)
latok\core (0, 2018-11-06)
latok\core\__init__.py (0, 2018-11-06)
latok\core\default_tokenizer.py (6861, 2018-11-06)
latok\core\latok_utils.py (3067, 2018-11-06)
latok\core\offsets.py (1012, 2018-11-06)
latok\core\src (0, 2018-11-06)
... ...
LaTok
=============================
Linear Algebraic Tokenizer
## Description
An NLP tokenizer based on linear algebraic operations.
### Key Points:
#### Algorithm:
* Construct a matrix, representing each letter in a string as a vector of features.
* Where features are, e.g.,
* unicode character characteristics like alpha, numeric, uppercase, lowercase, etc.
* context information such as preceding or following character characteristics, etc.
* Apply relevant linear operations on the feature matrix to generate a tokenization mask
* Where non-zero entries in the final mask identify character locations on which to split the string into tokens.
#### Classification
* Provide token-level classification based on the character-level features
#### Performance:
* As a primary design and implementation goal, ensure that tokenization is
* Performant in terms of strings tokenized over time
* Memory efficient in terms of memory consumed for tokenization
* Implemented where necessary as C extensions to NumPy and Python
## Project Setup
* If you have ops/bin in your path, please remove it, it has been deprecated.
* Ensure that you have python installed. 3.5 or 3.6 is required at this point. 3.7 should be supported shortly.
* Ensure that you have docker installed and /data configured as a file share
* Ensure that your python bin directory is in your path (likely /Library/Frameworks/Python.framework/Versions/3.6/bin)
* Ensure that your pip.conf (~/.pip/pip.conf) includes our internal pypi servers (see pip.conf.template in this repo)
* bin/setup-dev to install environment
* activate virtual environment (source activate)
* run unit test (bin/test -ud)
近期下载者:
相关文件:
收藏者: