Arabic-News
所属分类:虚拟/增强现实-VR/AR
开发工具:Jupyter Notebook
文件大小:1143214KB
下载次数:0
上传日期:2021-12-16 10:42:59
上 传 者:
sh-1993
说明: 阿拉伯语新闻
(Arabic News)
文件列表:
compile_files.py (889, 2021-05-14)
compile_gz_files.py (898, 2021-05-14)
config (0, 2021-05-14)
config\config.cfg (14288, 2021-05-14)
config\config_lib.cfg (14338, 2021-05-14)
config\sitelist (copy).hjson (2127, 2021-05-14)
config\sitelist.hjson (831, 2021-05-14)
config\sitelist_20190409.hjson (831, 2021-05-14)
config\sitelist_20190419.hjson (2509, 2021-05-14)
config\sitelist_dw.hjson (1889, 2021-05-14)
corpora (0, 2021-05-14)
corpora\aljazeera.net_20190419_000.json (94319623, 2021-05-14)
corpora\aljazeera.net_20190419_000.txt (68194680, 2021-05-14)
corpora\aljazeera.net_20190419_001.json (94242870, 2021-05-14)
corpora\aljazeera.net_20190419_001.txt (68115658, 2021-05-14)
corpora\aljazeera.net_20190419_002.json (94030693, 2021-05-14)
corpora\aljazeera.net_20190419_002.txt (67926230, 2021-05-14)
corpora\aljazeera.net_20190419_003.json (94362913, 2021-05-14)
corpora\aljazeera.net_20190419_003.txt (68218417, 2021-05-14)
corpora\aljazeera.net_20190419_004.json (94864666, 2021-05-14)
corpora\aljazeera.net_20190419_004.txt (68759820, 2021-05-14)
corpora\aljazeera.net_20190419_005.json (94579410, 2021-05-14)
corpora\aljazeera.net_20190419_005.txt (68447646, 2021-05-14)
corpora\aljazeera.net_20190419_006.json (5988939, 2021-05-14)
corpora\aljazeera.net_20190419_006.txt (4335557, 2021-05-14)
corpora\arabic.cnn.com_20190419_000.json (79025829, 2021-05-14)
corpora\arabic.cnn.com_20190419_000.txt (58376058, 2021-05-14)
corpora\arabic.cnn.com_20190419_001.json (54015549, 2021-05-14)
corpora\arabic.cnn.com_20190419_001.txt (40080602, 2021-05-14)
corpora\arabic.euronews.com_20190409_000.json (66538384, 2021-05-14)
corpora\arabic.euronews.com_20190409_000.txt (47609044, 2021-05-14)
corpora\arabic.euronews.com_20190409_001.json (66574061, 2021-05-14)
corpora\arabic.euronews.com_20190409_001.txt (47645335, 2021-05-14)
corpora\arabic.euronews.com_20190409_002.json (37388928, 2021-05-14)
corpora\arabic.euronews.com_20190409_002.txt (26789341, 2021-05-14)
corpora\arabic.rt.com_20190419_000.json (61302663, 2021-05-14)
corpora\arabic.rt.com_20190419_000.txt (40251704, 2021-05-14)
... ...
# Arabic-News
Arabic News for language modelling collected from
* BBC Arabic
* EuroNews
* Aljazeera
* CNN Arabic
* RT Arabic
These news are collected by [news-please](https://github.com/fhamborg/news-please) python library
To extract news and titles
`python json2corpus.py`
Crawl Date: 19-04-2019
---
# Corpus information
| Corpus | Size | number of words |
| ------- |:----:| ---------------:|
| Headlines | 54M | 487674 |
| JSC | 395M | 1525372 |
| RT | 713M | 3411451 |
| CNN | 94M | 317260 |
| BBC | 854M | 17***796 |
| Euronews | 279M | 517227 |
---
Log:
```
corpus name: 09/arabic.euronews.com/
processing /home/motaz/newsgitrepo/data/2019/04/09/arabic.euronews.com/
# of files 4***68
46079 is done out of 4***68
number of files 4***68
{'ar': 4***68}
short count 389
.json
number of num_parts 18000
len of sub_lists 3
.txt
number of num_parts 18000
len of sub_lists 3
-------------------------------
corpus name: 09/bbc.com/
processing /home/motaz/newsgitrepo/data/2019/04/09/bbc.com/
# of files 212271
94734 is done out of 212271
number of files 212271
{'pt': 1, 'ar': 97468, 'fa': 114***8, 'en': 154}
short count 2734
.json
number of num_parts 18000
len of sub_lists 6
.txt
number of num_parts 18000
len of sub_lists 6
-------------------------------
corpus name: 19/aljazeera.net/
processing /home/motaz/newsgitrepo/data/2019/04/19/aljazeera.net/
# of files 249106
109141 is done out of 249106
number of files 249106
{'ar': 170003, 'en': 3}
short count 60862
.json
number of num_parts 18000
len of sub_lists 7
.txt
number of num_parts 18000
len of sub_lists 7
-------------------------------
corpus name: 19/arabic.rt.com/
processing /home/motaz/newsgitrepo/data/2019/04/19/arabic.rt.com/
# of files 368920
334268 is done out of 368920
number of files 368920
{'ar': 368857}
short count 34589
.json
number of num_parts 18000
len of sub_lists 19
.txt
number of num_parts 18000
len of sub_lists 19
-------------------------------
corpus name: 19/arabic.cnn.com/
processing /home/motaz/newsgitrepo/data/2019/04/19/arabic.cnn.com/
# of files 30338
30140 is done out of 30338
number of files 30338
{'ar': 30338}
short count 1***
.json
number of num_parts 18000
len of sub_lists 2
.txt
number of num_parts 18000
len of sub_lists 2
-------------------------------
all done
```
---
## compress files in a directory
```
$ tree dir1/
dir1/
|-- dir11
| |-- file11
| |-- file12
| `-- file13
|-- file1
|-- file2
`-- file3
```
now run the gzip command
`$ gzip -r dir1`
after
```
$ tree dir1/
dir1/
|-- dir11
| |-- file11.gz
| |-- file12.gz
| `-- file13.gz
|-- file1.gz
|-- file2.gz
`-- file3.gz
```
# delete a lot of files
`find . -name '*.html.gz' -print0 | xargs -0 rm`
# compress files using 7z
近期下载者:
相关文件:
收藏者: