
Open access
Date
2022-07Type
- Conference Paper
Abstract
Notwithstanding the widely held view that data generation and data curation processes are prominent sources of bias in machine learning algorithms, there is little empirical research seeking to document and understand the specific data dimensions affecting algorithmic unfairness. Contra the previous work, which has focused on modeling using simple, small-scale benchmark datasets, we hold the model constant and methodically intervene on relevant dimensions of a much larger, more diverse dataset. For this purpose, we introduce a new dataset on recidivism in 1.5 million criminal cases from courts in the U.S. state of Wisconsin, 2000-2018. From this main dataset, we generate multiple auxiliary datasets to simulate different kinds of biases in the data. Focusing on algorithmic bias toward different race/ethnicity groups, we assess the relevance of training data size, base rate difference between groups, representation of groups in the training data, temporal aspects of data curation, including race/ethnicity or neighborhood characteristics as features, and training separate classifiers by race/ethnicity or crime type. We find that these factors often do influence fairness metrics holding the classifier specification constant, without having a corresponding effect on accuracy metrics. The methodology and the results in the paper provide a useful reference point for a data-centric approach to studying algorithmic fairness in recidivism prediction and beyond. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000570271Publication status
publishedExternal links
Book title
AIES '22: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and SocietyPages / Article No.
Publisher
Association for Computing MachineryEvent
Subject
Algorithmic Fairness; Datasets; Recidivism Prediction; Machine LearningOrganisational unit
09627 - Ash, Elliott / Ash, Elliott
More
Show all metadata