icwb2-data

所属分类:文章/文档
开发工具:Python
文件大小:46906KB
下载次数:24
上传日期:2018-12-08 17:18:43
上 传 者万俟洛
说明:  中文分词数据,含有as、 cityu、msr、pku数据,包括测试集、训练集
(Chinese word segmentation data)

文件列表:
icwb2-data\doc\instructions.txt (7293, 2005-11-18)
icwb2-data\doc\result_instructions.txt (3599, 2005-11-18)
icwb2-data\gold\as_testing_gold.txt (638655, 2005-08-14)
icwb2-data\gold\as_testing_gold.utf8 (942571, 2005-08-14)
icwb2-data\gold\as_training_words.txt (973900, 2005-08-14)
icwb2-data\gold\as_training_words.utf8 (1390030, 2005-08-14)
icwb2-data\gold\cityu_test_gold.txt (175453, 2005-08-14)
icwb2-data\gold\cityu_test_gold.utf8 (240767, 2005-08-14)
icwb2-data\gold\cityu_training_words.txt (421758, 2005-08-14)
icwb2-data\gold\cityu_training_words.utf8 (585205, 2005-08-14)
icwb2-data\gold\msr_test_gold.txt (582840, 2005-08-14)
icwb2-data\gold\msr_test_gold.utf8 (766786, 2005-08-14)
icwb2-data\gold\msr_training_words.txt (740517, 2005-08-14)
icwb2-data\gold\msr_training_words.utf8 (1065391, 2005-08-14)
icwb2-data\gold\pku_test_gold.txt (551863, 2005-08-14)
icwb2-data\gold\pku_test_gold.utf8 (718331, 2005-08-14)
icwb2-data\gold\pku_training_words.txt (347101, 2005-08-14)
icwb2-data\gold\pku_training_words.utf8 (490217, 2005-08-14)
icwb2-data\scripts\mwseg.pl (3543, 2005-08-09)
icwb2-data\scripts\score (7228, 2005-08-14)
icwb2-data\testing\as_test.txt (422268, 2005-11-18)
icwb2-data\testing\as_test.utf8 (617992, 2005-11-18)
icwb2-data\testing\cityu_test.txt (136040, 2005-11-18)
icwb2-data\testing\cityu_test.utf8 (201354, 2005-11-18)
icwb2-data\testing\msr_test.txt (376280, 2005-11-18)
icwb2-data\testing\msr_test.utf8 (560226, 2005-11-18)
icwb2-data\testing\pku_test.txt (343120, 2005-11-18)
icwb2-data\testing\pku_test.utf8 (509588, 2005-11-18)
icwb2-data\training\as_training.b5 (27635392, 2005-07-01)
icwb2-data\training\as_training.utf8 (40743877, 2005-07-01)
icwb2-data\training\cityu_training.txt (6230851, 2005-06-30)
icwb2-data\training\cityu_training.utf8 (8549669, 2005-06-30)
icwb2-data\training\crf_data_2_word.py (1238, 2018-01-18)
icwb2-data\training\crf_learn.exe (40448, 2010-05-16)
icwb2-data\training\crf_model (278932, 2018-01-18)
icwb2-data\training\crf_test.exe (40448, 2010-05-16)
icwb2-data\training\libcrfpp.dll (339456, 2010-05-16)
icwb2-data\training\make_crf_test_data.py (899, 2018-01-18)
icwb2-data\training\make_crf_train_data.py (1156, 2018-01-17)
... ...

2nd International Chinese Word Segmentation Bakeoff - Data Release Release 1, 2005-11-18 * Introduction This directory contains the training, test, and gold-standard data used in the 2nd International Chinese Word Segmentation Bakeoff. Also included is the script used to score the results submitted by the bakeoff participants and the simple segmenter used to generate the baseline and topline data. * File List gold/ Contains the gold standard segmentation of the test data along with the training data word lists. scripts/ Contains the scoring script and simple segmenter. testing/ Contains the unsegmented test data. training/ Contains the segmented training data. doc/ Contains the instructions used in the bakeoff. * Encoding Issues Files with the extension ".utf8" are encoded in UTF-8 Unicode. Files with the extension ".txt" are encoded as follows: as_ Big Five (CP950) hk_ Big Five/HKSCS msr_ EUC-CN (CP936) pku_ EUC-CN (CP936) EUC-CN is often called "GB" or "GB2312" encoding, though technically GB2312 is a character set, not a character encoding. * Scoring The script 'score' is used to generate compare two segmentations. The script takes three arguments: 1. The training set word list 2. The gold standard segmentation 3. The segmented test file You must not mix character encodings when invoking the scoring script. For example: % perl scripts/score gold/cityu_training_words.utf8 \ gold/cityu_test_gold.utf8 test_segmentation.utf8 > score.ut8 * Licensing The corpora have been made available by the providers for the purposes of this competition only. By downloading the training and testing corpora, you agree that you will not use these corpora for any other purpose than as material for this competition. Petitions to use the data for any other purpose MUST be directed to the original providers of the data. Neither SIGHAN nor the ACL will assume any liability for a participant's misuse of the data. * Questions? Questions or comments about these data can be sent to Tom Emerson, tree@sighan.org.

近期下载者

相关文件


收藏者