Cultural Bias in Hate Speech Datasets

Preprocessed corpora and code for paper: "From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets"

Preprocessed Corpora

The preprocessed corpora for the eight languages we focus on in the paper can be found on 🤗:

Geocoding Code

The Python code to interact with the Google Geocoding API is in google_geocoding_api.ipynb.

Citation

If you find our work useful, please cite:

@inproceedings{tonneau-etal-2024-languages,
    title = "From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets",
    author = {Tonneau, Manuel  and
      Liu, Diyi  and
      Fraiberger, Samuel  and
      Schroeder, Ralph  and
      Hale, Scott  and
      R{\"o}ttger, Paul},
    editor = {Chung, Yi-Ling  and
      Talat, Zeerak  and
      Nozza, Debora  and
      Plaza-del-Arco, Flor Miriam  and
      R{\"o}ttger, Paul  and
      Mostafazadeh Davani, Aida  and
      Calabrese, Agostina},
    booktitle = "Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.woah-1.23",
    pages = "283--311",
    abstract = "Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages{---}English, Arabic and Spanish{---}we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e.g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages. Based on these findings, we formulate recommendations for the creation of future HS datasets.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
google_geocoding_api.ipynb		google_geocoding_api.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cultural Bias in Hate Speech Datasets

Preprocessed Corpora

Geocoding Code

Citation

About

Releases

Packages

Languages

manueltonneau/hs-survey-cultural-bias

Folders and files

Latest commit

History

Repository files navigation

Cultural Bias in Hate Speech Datasets

Preprocessed Corpora

Geocoding Code

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages