arabicstopwords0.3
所属分类:Windows编程
开发工具:Java
文件大小:250KB
下载次数:3
上传日期:2013-01-08 19:36:14
上 传 者:
mhidy
说明: arabic stop word list
文件列表:
AUTHORS (83, 2009-09-03)
ChangeLog (429, 2010-12-02)
COPYING (18153, 2009-08-06)
data (0, 2010-12-05)
data\allforms (0, 2010-12-05)
data\allforms\stopwordsallforms.py (625535, 2010-12-04)
data\allforms\stopwordsallforms.sql (939913, 2010-12-04)
data\allforms\stopwordsallforms.txt (391248, 2010-12-04)
data\classified (0, 2010-12-05)
data\classified\stopword0.6.csv (26055, 2010-12-04)
data\classified\stopword0.6.xls (169984, 2010-12-03)
output (0, 2010-12-05)
stopwords.py (978, 2010-12-04)
TODO.txt (254, 2010-12-02)
tools (0, 2010-12-05)
tools\arabic_const.py (2833, 2007-09-09)
tools\ar_stowords.py (2195, 2010-12-03)
tools\generatefomrs.bat (319, 2010-12-05)
tools\generate_stopwords_forms.py (3990, 2010-12-03)
__init__.py (0, 2010-03-01)
#INSTALL
------------------
Arabic Stop words
--------------------
- This list can be reused,
It't not easy to detemine the stop words, and in other hand, stop words differs according to the case,
for this purpos, we propose a classified list
which can be parametered by developper
The Word list contains only wonds in its commun forms,
and we have generated all forms by a script.
Files
------
data/ : contains data of stopwords
data/classified/stopwords.cvs: the data file as csv
data/classified/stopwords.xls: data in Excel fomat with more valuble informations, and classified stopwords
data/allforms/stopwordsallforms.sql: all forms database in sql format
data/allforms/stopwords_allforms.txt: data generated from minimal data file
data/allforms/stopwordsallforms.py: all forms data as python dictionary
tools/: scripts used to generate all forms from minimal data
usage :
generate_stopwords_forms.py -f data/stopwords.cvs > output_file.txt
Note: to avoid program to treat some data, comment lines by #, in the data file
Note: script can be custumed
Data Structure
--------------
All forms data .CSV file
1st field : unvocalised word ( í)
2nd field : unvocalised stemmed word with -'-' between affixes: e.g. --í-í
Minimal classified data .CSV file
1st field : unvocalised word ( í)
2nd field : type of the word: e.g.
3rd field : class of word : e.g. preposition
Affixation infomration in other fields:
4th field : AIN in arabic , if word accept Conjuction 'á', '*' else
5th field : TEH in arabic , if word accept definate article 'á áí', '*' else
6th field : JEEM in arabic , if word accept preposition article ' á áá', '*' else
7th field : DAD in arabic , if word accept IDAFA articles 'á áá', '*' else
7th field : SAD in arabic , if word accept verb conjugation articles 'áí', '*' else
8th field : LAM in arabic , if word accept LAM QASAM articles 'á á', '*' else
8th field : MEEM in arabic , if word has ALEF LAM as definition article '', '*' else
How to custum stop word list
---------------
1- check the minimal form data file ( stopwords.csv)
2- comment by "#" all words which you don't need
3- run generate_stopwords_forms.py script
4- catch the output of script.
Generation script usage:
------------------------
Usage: generate_stopwords_forms -f filename [OPTIONS]
[-h | --help] outputs this usage message
[-V | --version] program version
[-f | --file= filename] input file to generate_stopwords_forms
[-o | --out= output format] output format(csv,python,sql)
How to add a word into word list
---------------
1- check if the word doesn't exist in the minimal form data file ( stopwords.csv)
2- add affixation information
3- run generate_stopwords_forms.py script
4- catch the output of script.
Thanks
近期下载者:
相关文件:
收藏者: