ria_news_dataset

所属分类:特征抽取
开发工具:Others
文件大小:1039KB
下载次数:0
上传日期:2019-09-25 09:05:43
上 传 者sh-1993
说明:  “Rossiya Segodnya”新闻数据集
("Rossiya Segodnya" news dataset)

文件列表:
LICENSE (14228, 2019-09-25)
LICENSE.ru (32280, 2019-09-25)
ria.json.gz (135, 2019-09-25)
ria_1k.json (4388286, 2019-09-25)
ria_20.json (57198, 2019-09-25)

# "Rossiya Segodnya" news dataset This repository contains a news dataset presented in the paper: Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh. Self-Attentive Model for Headline Generation. 41st European Conference on Information Retrieval, 2019. __[arXiv:1901.07786 [cs.CL]](https://arxiv.org/abs/1901.07786)__ To download the dataset please use a direct [link](https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz) or clone the repository using `git lfs`. ## Description Full dataset contains _1003869_ Russian language news documents from _January, 2010_ to _December, 2014_. * [`ria_20.json`](./ria_20.json) contains the first 20 news documents from the dataset. * [`ria_1k.json`](./ria_1k.json) contains the first 1000 news documents from the dataset. * [`ria.json.gz`](./ria.json.gz) is full GZip'ed dataset. Dataset format: each row contains a JSON document that consists of two fields: `text` is a document body, while `title` is a news headline. ## License This data is lisensed by Rossiya Segodnya news agency ([ria.ru](http://ria.ru)) under CC-BY-ND-NC license. The license text could be accessed [here](./LICENSE). The Russian version of the same license could be accessed [here](./LICENSE.ru). ## Misc If you're using the data in a research please consider citing the mentioned paper: @inproceedings{gavrilov2018self, title={Self-Attentive Model for Headline Generation}, author={Gavrilov, Daniil and Kalaidin, Pavel and Malykh, Valentin}, booktitle={Proceedings of the 41st European Conference on Information Retrieval}, year={2019} }

近期下载者

相关文件


收藏者