Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
## Data description Identity-based hate speech corpora are available here, as described in the paper, ["How Hate Speech Varies by Target Identity: A Computational Analysis"][10], presented at CoNLL 2022. ### Sampled data Hate speech corpora based on identities, demographic category, and power (marginalized/dominant) are available in the `sampled` folder. These corpora are sampled from 4 publicly available hate speech datasets with annotations for target identity: 1. Civil Comments ([source][1], [paper][2]) 2. Social Bias Inference Corpus ([source][3], [paper][4]) 3. Kennedy et al. 2020 ([source][5], [paper][6]) 4. HateXplain ([source][7], [paper][8]) Each corpus is uniformly sampled across these datasets, allowing duplication of entries (see the paper for more info). These corpora are used to see how well hate speech classifiers generalize across identities, demographic categories, and relative social power. **Note that there are duplicate entries in the data**. This is to have enough instances from each dataset and to reach a 30/70 hate/non-hate ratio. #### Sampled data format Files are in JSON lines (newline-delimited JSON) format. Fields: - `grouping`: the specific identity grouping whose corpus this entry belongs to (a specific identity, or demographic category, or marginalized/dominant groups) - `fold`: the fold of the corpus this entry belonged to (train or test) in generalization experiments in the CoNLL paper - `text`: the preprocessed text of the comment, post, etc - `target_groups`: a list of the identity groups targeted, from original source datasets - `dataset`: original source dataset - `hate`: binary value of hate speech or not, from original dataset ### Unsampled data Unsampled, preprocessed data with annotations for identity groups is available in `unsampled_identity_hate_corpus.jsonl`. This data is from the following 7 publicly available hate speech datasets with annotations for target identity: 1. Civil Comments ([source][1], [paper][2]) 2. Social Bias Inference Corpus ([source][3], [paper][4]) 3. Kennedy et al. 2020 ([source][5], [paper][6]) 4. HateXplain ([source][7], [paper][8]) 5. Contextual Abuse Dataset ([source][11], [paper][12]) 6. ElSherief et al. 2021 ([source][13], [paper][14]) 7. Salminen et al. 2018 ([source][15], [paper][16]) Code used to preprocess and group and normalize identity annotations in this dataset is available [here][9]. #### Unsampled data format This file is in JSON lines (newline-delimited JSON) format. Fields: - `dataset`: original source dataset - `text`: the preprocessed text of the comment, post, etc - `target_groups`: a list of the identity groups targeted, normalized names from original source datasets - `identity_groups`: coarser-grained identity groups, grouped from the list in `target_groups` - `target_categories`: larger demographic category that corresponds to groups listed in `target_groups` - `hate`: binary value of hate speech or not, from original dataset [1]: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data [2]: https://dl.acm.org/doi/10.1145/3308560.3317593 [3]: http://tinyurl.com/social-bias-frames [4]: https://aclanthology.org/2020.acl-main.486/ [5]: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech [6]: https://arxiv.org/abs/2009.10277 [7]: https://github.com/punyajoy/HateXplain [8]: https://ojs.aaai.org/index.php/AAAI/article/view/17745 [9]: https://github.com/michaelmilleryoder/hate_speech_identities [10]: https://aclanthology.org/2022.conll-1.3/ [11]: https://zenodo.org/record/4881008 [12]: https://aclanthology.org/2021.naacl-main.182/ [13]: https://github.com/SALT-NLP/implicit-hate [14]: https://aclanthology.org/2021.emnlp-main.29/ [15]: https://www.dropbox.com/s/21wtzy9arc5skr8/ICWSM18%20-%20SALMINEN%20ET%20AL.xlsx?dl=0 [16]: https://ojs.aaai.org/index.php/ICWSM/article/view/15028
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.