MM-COVID

所属分类:生物医药技术
开发工具:Python
文件大小:1703KB
下载次数:0
上传日期:2023-05-14 08:20:22
上 传 者sh-1993
说明:  MM-COVID,跨链接新冠肺炎假新闻数据集
(MM-COVID,Cross Linugual COVID-19 Fake News Dataset)

文件列表:
.idea (0, 2021-03-21)
.idea\pythonProject4.iml (284, 2021-03-21)
FakeNewsCrawler.py (29870, 2021-03-21)
NewsCrawler (0, 2021-03-21)
NewsCrawler\ArchiveCrawler.py (5145, 2021-03-21)
NewsCrawler\FakeNewsRules.py (7042, 2021-03-21)
NewsCrawler\FakeRulesFactCheck.py (43046, 2021-03-21)
NewsCrawler\NewsCrawler.py (285, 2021-03-21)
NewsCrawler\RealRules.py (46516, 2021-03-21)
NewsCrawler\__init__.py (0, 2021-03-21)
TwitterCrawlerHelper.py (16960, 2021-03-21)
figure (0, 2021-03-21)
figure\pipeline.png (1732928, 2021-03-21)
project.config (457, 2021-03-21)
requirements.txt (821, 2021-03-21)
util (0, 2021-03-21)
util\Constants.py (641, 2021-03-21)
util\CrawlerUtil.py (3342, 2021-03-21)
util\TwarcConnector.py (4821, 2021-03-21)
util\Util.py (17861, 2021-03-21)
util\__init__.py (0, 2021-03-21)

# MM-COVID [Multilingual and Multimodal COVID-19 Fake News Dataset](https://arxiv.org/abs/2011.04088) ## Data Structure The data is stored at [Google Drive](https://drive.google.com/drive/folders/1gd4AvT6BxPRtymmNd9Z7ukyaVhae5s7U?usp=sharing) - news_collection.json: this file stores the information about the fact-checking, news content and news label - news_tweet_relation.json: this file stores the dicussion of the news content from Twitter - tweet_tweet_relation.json: this file stores the retweets, recursively replies of the tweets. Due to the Twitter privacy concerns, we only provide the twitter IDs for the tweets, you can utilize [Twarc](https://github.com/DocNow/twarc) to **Hydrate** these tweet IDs. ## Crawling Pipeline This code stored the data into MongoDB. You should pre-install MongoDB before running the code. The main file is __FakeNewsCrawler.py__ and the pipeline of this file is as: ![pipeline](./figure/pipeline.png) #### WorkFlow 1. Use crawler to get all the fake news from the Factchecking server. 2. Fetch the html page of the source provided in the article and parse and get the "title" of the article 3. Using the title fetched in the previous step and Twitter's advanced search API get tweets matching title using web scrapping 4. For every tweet related to fake news get the favourites, replies, retweets associated with it. 5. For all the users who tweeted those fake tweets, gather the social network information like followers, followees. ## Installation ### Requirements Credits for [FakeNewsNet](https://github.com/KaiDMML/FakeNewsNet). Mongo db setup - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/ Firefox driver - Geckodriver installation - https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-ubuntu Data download scripts are writtern in python and requires python 3.6 + to run. Twitter API keys are used for collecting data from Twitter. Make use of the following link to get Twitter API keys https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html Script make use of keys from tweet_keys_file.json file located in code/resources folder. So the API keys needs to be updated in tweet_keys_file.json file. Provide the keys as array of JSON object with attributes app_key,app_secret,oauth_token,oauth_token_secret as mentioned in sample file. Install all the libraries in requirements.txt using the following command ```shell script pip install -r requirements.txt ``` ### Running Code Inorder to collect data set fast, code makes user of process parallelism and to synchronize twitter key limitations across mutiple python processes. ```shell script nohup python FakeNewsCrawler.py ``` ## References If you use this dataset, please cite the following paper: @misc{li2020mmcovid, title={MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation}, author={Yichuan Li and Bohan Jiang and Kai Shu and Huan Liu}, year={2020}, eprint={2011.04088}, archivePrefix={arXiv}, primaryClass={cs.SI}} If you have any questions about this dataset, please contact Yichuan Li (yli29@wpi.edu).

近期下载者

相关文件


收藏者