n-gram
所属分类:自然语言处理
开发工具:Python
文件大小:304KB
下载次数:0
上传日期:2017-12-20 04:04:26
上 传 者:
sh-1993
说明: 新浪新闻爬虫与分词
(Sina News Crawler and Word Segmentation)
文件列表:
.idea (0, 2017-12-20)
.idea\inspectionProfiles (0, 2017-12-20)
.idea\inspectionProfiles\Project_Default.xml (444, 2017-12-20)
.idea\inspectionProfiles\profiles_settings.xml (235, 2017-12-20)
.idea\misc.xml (1273, 2017-12-20)
.idea\modules.xml (264, 2017-12-20)
.idea\n-gram.iml (398, 2017-12-20)
.idea\workspace.xml (43181, 2017-12-20)
20160426作业.pdf (278687, 2017-12-20)
FileOperator.py (864, 2017-12-20)
nGram.py (8658, 2017-12-20)
reptile.py (4541, 2017-12-20)
result (0, 2017-12-20)
result\1Gram.txt (34651, 2017-12-20)
result\2Gram.txt (28813, 2017-12-20)
result\3Gram.txt (5535, 2017-12-20)
result\4Gram.txt (1586, 2017-12-20)
result\5Gram.txt (0, 2017-12-20)
# Sina news crawler + word segementation
## Project Structure
- [Crawler Script](https://github.com/sunshineclt/n-gram/blob/master/./reptile.py)
- [File operation(with date as filename)](https://github.com/sunshineclt/n-gram/blob/master/./FileOperator.py)
- [n-gram word segmentation](https://github.com/sunshineclt/n-gram/blob/master/./nGram.py)
- [Final result n-gram](https://github.com/sunshineclt/n-gram/blob/master/./result/)
- [Requirement of the project](https://github.com/sunshineclt/n-gram/blob/master/./20160426作业.pdf)
- [Github address](https://github.com/sunshineclt/n-gram/blob/master/https://github.com/sunshineclt/n-gram)
## Usage
- Method1 (Without news material)
- Change the start date and end date for news crawler
- Run Crawler
- Change start date and end date in nGram.py
- Change parameters(Frequency, Freedom, Condensation) in nGram.py
- Run nGram.py
- Wait for nGram.py
- 1Gram.txt-5Gram.txt will be generated when nGram.py ends
- Method2 (With news material)
- Change parameters(Frequency, Freedom, Condensation) in nGram.py
- Run nGram.py
- Wait for nGram.py
- 1Gram.txt-5Gram.txt will be generated when nGram.py ends
## Advantages
- Several crawler interferences are solved, such as
- gzip compress
- Other html attribute in
(Some webpage even has
nested more than 1k times, which causes Rugular Expression to be dead)
- I don't use HTTPParser as required but to use Regular Expression
- n-gram word segmentation
- [references](https://github.com/sunshineclt/n-gram/blob/master/http://www.matrix67.com/blog/archives/5044)
- adopt three measurement to decide word segmentation
- Work Frequency
- Condensation(即“电影院”不是“电”+“影院”或“电影”+“院“)
- Freedom(即“伊拉克”不是“伊拉”,也不是”拉客“)
- Good Comment
- Almost every line of code has comments
- 2-character, 3-character words' performance is extremely great
## Disadvantages
- Crawler may encounter some encoding problems, some of them are Sina's matter but some are due to my decoding method (Some of the webpage are not encoded with gb2312)
- n-gram word segmentation requires a large amount of memory, although I've used some memory control method
- n-gram word segmentation could be improved in time complexity, although it may require even bigger space complexity
- n-gram word segmentation did not consider function word such as 3-character word: “激烈的”
- 4-character and 5-character words' performance is relatively bad. There is no 5-character words in 200M news material even though I've lower the standard for 5-character word.
## Authored By Chen Letian at 2016.05.14
近期下载者:
相关文件:
收藏者: