n-gram

所属分类:自然语言处理
开发工具:Python
文件大小:304KB
下载次数:0
上传日期:2017-12-20 04:04:26
上 传 者sh-1993
说明:  新浪新闻爬虫与分词
(Sina News Crawler and Word Segmentation)

文件列表:
.idea (0, 2017-12-20)
.idea\inspectionProfiles (0, 2017-12-20)
.idea\inspectionProfiles\Project_Default.xml (444, 2017-12-20)
.idea\inspectionProfiles\profiles_settings.xml (235, 2017-12-20)
.idea\misc.xml (1273, 2017-12-20)
.idea\modules.xml (264, 2017-12-20)
.idea\n-gram.iml (398, 2017-12-20)
.idea\workspace.xml (43181, 2017-12-20)
20160426作业.pdf (278687, 2017-12-20)
FileOperator.py (864, 2017-12-20)
nGram.py (8658, 2017-12-20)
reptile.py (4541, 2017-12-20)
result (0, 2017-12-20)
result\1Gram.txt (34651, 2017-12-20)
result\2Gram.txt (28813, 2017-12-20)
result\3Gram.txt (5535, 2017-12-20)
result\4Gram.txt (1586, 2017-12-20)
result\5Gram.txt (0, 2017-12-20)

# Sina news crawler + word segementation ## Project Structure - [Crawler Script](https://github.com/sunshineclt/n-gram/blob/master/./reptile.py) - [File operation(with date as filename)](https://github.com/sunshineclt/n-gram/blob/master/./FileOperator.py) - [n-gram word segmentation](https://github.com/sunshineclt/n-gram/blob/master/./nGram.py) - [Final result n-gram](https://github.com/sunshineclt/n-gram/blob/master/./result/) - [Requirement of the project](https://github.com/sunshineclt/n-gram/blob/master/./20160426作业.pdf) - [Github address](https://github.com/sunshineclt/n-gram/blob/master/https://github.com/sunshineclt/n-gram) ## Usage - Method1 (Without news material) - Change the start date and end date for news crawler - Run Crawler - Change start date and end date in nGram.py - Change parameters(Frequency, Freedom, Condensation) in nGram.py - Run nGram.py - Wait for nGram.py - 1Gram.txt-5Gram.txt will be generated when nGram.py ends - Method2 (With news material) - Change parameters(Frequency, Freedom, Condensation) in nGram.py - Run nGram.py - Wait for nGram.py - 1Gram.txt-5Gram.txt will be generated when nGram.py ends ## Advantages - Several crawler interferences are solved, such as - gzip compress - Other html attribute in

(Some webpage even has nested more than 1k times, which causes Rugular Expression to be dead) - I don't use HTTPParser as required but to use Regular Expression - n-gram word segmentation - [references](https://github.com/sunshineclt/n-gram/blob/master/http://www.matrix67.com/blog/archives/5044) - adopt three measurement to decide word segmentation - Work Frequency - Condensation(即“电影院”不是“电”+“影院”或“电影”+“院“) - Freedom(即“伊拉克”不是“伊拉”,也不是”拉客“) - Good Comment - Almost every line of code has comments - 2-character, 3-character words' performance is extremely great ## Disadvantages - Crawler may encounter some encoding problems, some of them are Sina's matter but some are due to my decoding method (Some of the webpage are not encoded with gb2312) - n-gram word segmentation requires a large amount of memory, although I've used some memory control method - n-gram word segmentation could be improved in time complexity, although it may require even bigger space complexity - n-gram word segmentation did not consider function word such as 3-character word: “激烈的” - 4-character and 5-character words' performance is relatively bad. There is no 5-character words in 200M news material even though I've lower the standard for 5-character word. ## Authored By Chen Letian at 2016.05.14

近期下载者

相关文件


收藏者