CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Open access
Date
2020Type
- Working Paper
ETH Bibliography
yes
Altmetrics
Abstract
Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML — ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The opensource and extensible CleanML study currently includes 14 realworld datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000444041Publication status
publishedPublisher
ETH Zurich, Institute for Computing PlatformsSubject
data cleaning; machine learning; classification; robust MLOrganisational unit
09588 - Zhang, Ce / Zhang, Ce
02120 - Dep. Management, Technologie und Ökon. / Dep. of Management, Technology, and Ec.
Funding
184628 - EASEML: Toward a More Accessible and Usable Machine Learning Platform for Non-expert Users (SNF)
187132 - Machine‐based Scoring of a Neuropsychological Test: The Rey‐Osterrieth Complex Figure (SNF)
957407 - Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning (EC)
Related publications and datasets
Is original form of: http://hdl.handle.net/20.500.11850/501911
More
Show all metadata
ETH Bibliography
yes
Altmetrics