Arabic-News

所属分类:虚拟/增强现实-VR/AR
开发工具:Jupyter Notebook
文件大小:1143214KB
下载次数:0
上传日期:2021-12-16 10:42:59
上 传 者sh-1993
说明:  阿拉伯语新闻
(Arabic News)

文件列表:
compile_files.py (889, 2021-05-14)
compile_gz_files.py (898, 2021-05-14)
config (0, 2021-05-14)
config\config.cfg (14288, 2021-05-14)
config\config_lib.cfg (14338, 2021-05-14)
config\sitelist (copy).hjson (2127, 2021-05-14)
config\sitelist.hjson (831, 2021-05-14)
config\sitelist_20190409.hjson (831, 2021-05-14)
config\sitelist_20190419.hjson (2509, 2021-05-14)
config\sitelist_dw.hjson (1889, 2021-05-14)
corpora (0, 2021-05-14)
corpora\aljazeera.net_20190419_000.json (94319623, 2021-05-14)
corpora\aljazeera.net_20190419_000.txt (68194680, 2021-05-14)
corpora\aljazeera.net_20190419_001.json (94242870, 2021-05-14)
corpora\aljazeera.net_20190419_001.txt (68115658, 2021-05-14)
corpora\aljazeera.net_20190419_002.json (94030693, 2021-05-14)
corpora\aljazeera.net_20190419_002.txt (67926230, 2021-05-14)
corpora\aljazeera.net_20190419_003.json (94362913, 2021-05-14)
corpora\aljazeera.net_20190419_003.txt (68218417, 2021-05-14)
corpora\aljazeera.net_20190419_004.json (94864666, 2021-05-14)
corpora\aljazeera.net_20190419_004.txt (68759820, 2021-05-14)
corpora\aljazeera.net_20190419_005.json (94579410, 2021-05-14)
corpora\aljazeera.net_20190419_005.txt (68447646, 2021-05-14)
corpora\aljazeera.net_20190419_006.json (5988939, 2021-05-14)
corpora\aljazeera.net_20190419_006.txt (4335557, 2021-05-14)
corpora\arabic.cnn.com_20190419_000.json (79025829, 2021-05-14)
corpora\arabic.cnn.com_20190419_000.txt (58376058, 2021-05-14)
corpora\arabic.cnn.com_20190419_001.json (54015549, 2021-05-14)
corpora\arabic.cnn.com_20190419_001.txt (40080602, 2021-05-14)
corpora\arabic.euronews.com_20190409_000.json (66538384, 2021-05-14)
corpora\arabic.euronews.com_20190409_000.txt (47609044, 2021-05-14)
corpora\arabic.euronews.com_20190409_001.json (66574061, 2021-05-14)
corpora\arabic.euronews.com_20190409_001.txt (47645335, 2021-05-14)
corpora\arabic.euronews.com_20190409_002.json (37388928, 2021-05-14)
corpora\arabic.euronews.com_20190409_002.txt (26789341, 2021-05-14)
corpora\arabic.rt.com_20190419_000.json (61302663, 2021-05-14)
corpora\arabic.rt.com_20190419_000.txt (40251704, 2021-05-14)
... ...

# Arabic-News Arabic News for language modelling collected from * BBC Arabic * EuroNews * Aljazeera * CNN Arabic * RT Arabic These news are collected by [news-please](https://github.com/fhamborg/news-please) python library To extract news and titles `python json2corpus.py` Crawl Date: 19-04-2019 --- # Corpus information | Corpus | Size | number of words | | ------- |:----:| ---------------:| | Headlines | 54M | 487674 | | JSC | 395M | 1525372 | | RT | 713M | 3411451 | | CNN | 94M | 317260 | | BBC | 854M | 17***796 | | Euronews | 279M | 517227 | --- Log: ``` corpus name: 09/arabic.euronews.com/ processing /home/motaz/newsgitrepo/data/2019/04/09/arabic.euronews.com/ # of files 4***68 46079 is done out of 4***68 number of files 4***68 {'ar': 4***68} short count 389 .json number of num_parts 18000 len of sub_lists 3 .txt number of num_parts 18000 len of sub_lists 3 ------------------------------- corpus name: 09/bbc.com/ processing /home/motaz/newsgitrepo/data/2019/04/09/bbc.com/ # of files 212271 94734 is done out of 212271 number of files 212271 {'pt': 1, 'ar': 97468, 'fa': 114***8, 'en': 154} short count 2734 .json number of num_parts 18000 len of sub_lists 6 .txt number of num_parts 18000 len of sub_lists 6 ------------------------------- corpus name: 19/aljazeera.net/ processing /home/motaz/newsgitrepo/data/2019/04/19/aljazeera.net/ # of files 249106 109141 is done out of 249106 number of files 249106 {'ar': 170003, 'en': 3} short count 60862 .json number of num_parts 18000 len of sub_lists 7 .txt number of num_parts 18000 len of sub_lists 7 ------------------------------- corpus name: 19/arabic.rt.com/ processing /home/motaz/newsgitrepo/data/2019/04/19/arabic.rt.com/ # of files 368920 334268 is done out of 368920 number of files 368920 {'ar': 368857} short count 34589 .json number of num_parts 18000 len of sub_lists 19 .txt number of num_parts 18000 len of sub_lists 19 ------------------------------- corpus name: 19/arabic.cnn.com/ processing /home/motaz/newsgitrepo/data/2019/04/19/arabic.cnn.com/ # of files 30338 30140 is done out of 30338 number of files 30338 {'ar': 30338} short count 1*** .json number of num_parts 18000 len of sub_lists 2 .txt number of num_parts 18000 len of sub_lists 2 ------------------------------- all done ``` --- ## compress files in a directory ``` $ tree dir1/ dir1/ |-- dir11 | |-- file11 | |-- file12 | `-- file13 |-- file1 |-- file2 `-- file3 ``` now run the gzip command `$ gzip -r dir1` after ``` $ tree dir1/ dir1/ |-- dir11 | |-- file11.gz | |-- file12.gz | `-- file13.gz |-- file1.gz |-- file2.gz `-- file3.gz ``` # delete a lot of files `find . -name '*.html.gz' -print0 | xargs -0 rm` # compress files using 7z

近期下载者

相关文件


收藏者