Prop-HiT Dataset Version 1.0
Version 1.0: November 18, 2023
About
Prop-HiT is a Propaganda Dataset for Hindi Text. The Prop-HiT dataset includes 790 articles from 32 Hindi news websites. The dataset is manually annotated using the LightTag annotation tool considering 18 propaganda techniques as follows:
1. Appeal to authority
2. Appeal to fear/prejudice
3. Bandwagon
4. Black-and-white fallacy
5. Causal oversimplification
6. Doubt
7. Exaggeration/minimization
8. Flag-waving
9. Loaded Language
10. Name Calling or Labelling
11. Obfuscation, intentional vagueness, confusion
12. Red herring
13. Reductio ad Hitlerum
14. Repetition
15. Slogans
16. Straw man
17. Thought-terminating cliche
18. Whataboutism
Data format
The dataset consists of one plain text and one tab-separated file per article. The text file contains the contents of the article. The tsv file contains one propaganda technique per line with the following information: article_id, technique, begin_offset, and end_offset
The naming convention for the files is as follows:
- article[unique_id].txt for the plain-text file
- article[unique_id].labels.tsv for the annotations files
There are two subfolders as train with 550 articles and test with 240 articles.
Credit
Please cite the dataset as:
[Prop-HiT] Deptii Chaudhari, Dr. Ambika Pawar. 2023. Prop-HiT: Propaganda Dataset for Hindi Text. https://doi.org/10.5281/zenodo.10155424
Authors
Deptii Chaudhari; Dr. Ambika Pawar