ensive-Multi-Platform-SocialMedia-Comment-Dataset

所属分类:虚拟/增强现实-VR/AR
开发工具:HTML
文件大小:652KB
下载次数:0
上传日期:2020-09-25 22:19:48
上 传 者sh-1993
说明:  阿拉伯语方言冒犯性语言数据集来自社交媒体对facebook、twitter和YouTube新闻帖子的评论...
(Arabic Dialectal Offensive Language dataset from social media comments on news post from facebook, twitter and youtube platforms)

文件列表:
.DS_Store (6148, 2020-09-26)
LICENSE (11357, 2020-09-26)
annotation_guideline (0, 2020-09-26)
annotation_guideline\.DS_Store (6148, 2020-09-26)
annotation_guideline\Annotation_GuidelineAMultiPlatformArabicNewsC.html (16967, 2020-09-26)
annotation_guideline\annotation_guideline.pdf (108764, 2020-09-26)
data (0, 2020-09-26)
data\.DS_Store (6148, 2020-09-26)
data\Arabic_offensive_comment_detection_annotation_4000_selected.xlsx (551827, 2020-09-26)

# Arabic Offensive Comments dataset from Multiple Social Media Platforms. This release includes annotated social media comment dataset with (not)offensive (OFF vs NOT_OFF) language tags for Arabic social media comments collected from three different online platforms: Twitter, Facebook and YouTube. The dataset is referred as Multi Platforms Offensive Language Dataset **(MPOLD)**. In addition to the offensive comments, the contents are manually annotated to analyse the distribution of hate speech (HS) and vulgar (but not hate) (V) content. ## Annotation Guidelines and Procedure The annotation of the collected dataset is obtained using Amazon Mechanical Turk (AMT). To ensure the quality of the annotation and language proficiency, we utilised two different evaluation criteria of the annotator. For more details, check the below paper: The guideline is available in the `annotation_guideline` folder and can also be cited using the below paper. ``` @inproceedings{chowdhury2020offensive, title={A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection}, author={Chowdhury, Shammur Absar and Mubarak, Hamdy and Abdelali, Ahmed and Jung, Soon-gyo and Jansen, Bernard J and Salminen, Joni}, booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC'20)}, year={2020} } ``` ## Sample Data Reader [Data Reader] (https://colab.research.google.com/drive/1vFfgdVlqsCvQnTT1LYj1K-EfIdBObtVG?usp=sharing) ## Data Format The input file contains have the following fields (separated by tabs), including `\t\t\t\t\t\t\t` where `Id` is the comment id (series) `Platform` indicates the origin of the social media comment. Values includes: Twitter, Facebook, YouTube `Comment` raw comments (anonymised for UserID and some urls) `Majority_Label` include binary labels: Non-Offensive or Offensive. The label is the final label agreed by at least 2 (out of 3) annotators `Agreement` field include values showing if there was 100% agreement between the annotator or the label was assigned by majority voting (2/3 annotator agreed). `NumOfJudgementUsed` the number indicates how many annotator's judgement was used for the majority consensus. `Total_Judgement` total number of judgement obtained from MTurk (this number also includes the rejected assignments). `Vulgar:V/HateSpeech:HS/None:-` includes the further classification (by expert) of the offensive comments, mentioning if the comment is either hate speech (HS), vulgar (V) or just offensive (-). Please note the HS can be vulgar but comments which contain vulgar language but not a hate speech is indicated by V. The first 500 comments were also manually checked to evaluate the performance of the annotation. Using expert annotation as reference, we observed that accuracy of the annotation is ~94%. The orange colored rows are the marked by the expert as an annotation error. ## Other Resources In addition to the dataset mentioned in the above paper, we also studied the use of this dataset individually or in combination with other freely available dataset available for offensive study in Arabic language. The resources used in the study are: * For out-of-domain evaluation [Egyptian Arabic dialect data](https://github.com/shammur/Arabic-Offensive-Multi-Platform-SocialMedia-Comment-Dataset/blob/master/http://alt.qcri.org/~hmubarak/offensive/TweetClassification-Summary.xlsx) ``` Mubarak, H., Darwish, K., and Magdy, W. (2017). Abusive language detection on Arabic social media. In Proceedings of the First Workshop on Abusive Language Online, pages 52–56. ``` [Levantine Hate Speech and Abusive (L-HSAB) dataset](https://github.com/shammur/Arabic-Offensive-Multi-Platform-SocialMedia-Comment-Dataset/blob/master/https://github.com/Hala-Mulki/L-HSAB-First-Arabic-Levantine-HateSpeech-Dataset) ``` Mulki, H., Haddad, H., Ali, C. B., and Alshabani, H. (2019). L-hsab: A levantine twitter dataset for hate speech and abusive language. In Proceedings of the Third Workshop on Abusive Language Online, pages 111–118 ``` * For additional cross-platform evaluation [Deleted Comments Dataset](https://github.com/shammur/Arabic-Offensive-Multi-Platform-SocialMedia-Comment-Dataset/blob/master/http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx) ``` Mubarak, H., Darwish, K., and Magdy, W. (2017). Abusive language detection on Arabic social media. In Proceedings of the First Workshop on Abusive Language Online, pages 52–56. ``` While using the dataset, cite: ``` @inproceedings{chowdhury2020offensive, title={A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection}, author={Chowdhury, Shammur Absar and Mubarak, Hamdy and Abdelali, Ahmed and Jung, Soon-gyo and Jansen, Bernard J and Salminen, Joni}, booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC'20)}, year={2020} } ```

近期下载者

相关文件


收藏者