arabicstopwords0.3

所属分类:Windows编程
开发工具:Java
文件大小:250KB
下载次数:3
上传日期:2013-01-08 19:36:14
上 传 者mhidy
说明:  arabic stop word list

文件列表:
AUTHORS (83, 2009-09-03)
ChangeLog (429, 2010-12-02)
COPYING (18153, 2009-08-06)
data (0, 2010-12-05)
data\allforms (0, 2010-12-05)
data\allforms\stopwordsallforms.py (625535, 2010-12-04)
data\allforms\stopwordsallforms.sql (939913, 2010-12-04)
data\allforms\stopwordsallforms.txt (391248, 2010-12-04)
data\classified (0, 2010-12-05)
data\classified\stopword0.6.csv (26055, 2010-12-04)
data\classified\stopword0.6.xls (169984, 2010-12-03)
output (0, 2010-12-05)
stopwords.py (978, 2010-12-04)
TODO.txt (254, 2010-12-02)
tools (0, 2010-12-05)
tools\arabic_const.py (2833, 2007-09-09)
tools\ar_stowords.py (2195, 2010-12-03)
tools\generatefomrs.bat (319, 2010-12-05)
tools\generate_stopwords_forms.py (3990, 2010-12-03)
__init__.py (0, 2010-03-01)

#INSTALL ------------------ Arabic Stop words -------------------- - This list can be reused, It't not easy to detemine the stop words, and in other hand, stop words differs according to the case, for this purpos, we propose a classified list which can be parametered by developper The Word list contains only wonds in its commun forms, and we have generated all forms by a script. Files ------ data/ : contains data of stopwords data/classified/stopwords.cvs: the data file as csv data/classified/stopwords.xls: data in Excel fomat with more valuble informations, and classified stopwords data/allforms/stopwordsallforms.sql: all forms database in sql format data/allforms/stopwords_allforms.txt: data generated from minimal data file data/allforms/stopwordsallforms.py: all forms data as python dictionary tools/: scripts used to generate all forms from minimal data usage : generate_stopwords_forms.py -f data/stopwords.cvs > output_file.txt Note: to avoid program to treat some data, comment lines by #, in the data file Note: script can be custumed Data Structure -------------- All forms data .CSV file 1st field : unvocalised word ( í) 2nd field : unvocalised stemmed word with -'-' between affixes: e.g. --í-í Minimal classified data .CSV file 1st field : unvocalised word ( í) 2nd field : type of the word: e.g. 3rd field : class of word : e.g. preposition Affixation infomration in other fields: 4th field : AIN in arabic , if word accept Conjuction 'á', '*' else 5th field : TEH in arabic , if word accept definate article 'á áí', '*' else 6th field : JEEM in arabic , if word accept preposition article ' á áá', '*' else 7th field : DAD in arabic , if word accept IDAFA articles 'á áá', '*' else 7th field : SAD in arabic , if word accept verb conjugation articles 'áí', '*' else 8th field : LAM in arabic , if word accept LAM QASAM articles 'á á', '*' else 8th field : MEEM in arabic , if word has ALEF LAM as definition article '', '*' else How to custum stop word list --------------- 1- check the minimal form data file ( stopwords.csv) 2- comment by "#" all words which you don't need 3- run generate_stopwords_forms.py script 4- catch the output of script. Generation script usage: ------------------------ Usage: generate_stopwords_forms -f filename [OPTIONS] [-h | --help] outputs this usage message [-V | --version] program version [-f | --file= filename] input file to generate_stopwords_forms [-o | --out= output format] output format(csv,python,sql) How to add a word into word list --------------- 1- check if the word doesn't exist in the minimal form data file ( stopwords.csv) 2- add affixation information 3- run generate_stopwords_forms.py script 4- catch the output of script. Thanks

近期下载者

相关文件


收藏者