Understanding Stereotypes in Language Models: Towards Robust Measurement and Zero-Shot Debiasing
Open access
Date
2022-12-22Type
- Working Paper
ETH Bibliography
yes
Altmetrics
Abstract
Generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. These findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. However, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000653515Publication status
publishedJournal / series
arXivPages / Article No.
Publisher
Cornell UniversityEdition / version
v1Subject
Computation and Language (cs.CL); Machine Learning (cs.LG); FOS: Computer and information sciencesOrganisational unit
09684 - Sachan, Mrinmaya / Sachan, Mrinmaya
More
Show all metadata
ETH Bibliography
yes
Altmetrics