
Open access
Author
Date
2022Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
As machine learning continues becoming more ubiquitous in various areas of our lives, it will become impossible to imagine software development projects that do not involve some learned component. Consequently, we have an ever increasing number of people developing ML applications, which drives the need for better development tools and processes. Unfortunately, even though there has been tremendous effort spent in building various systems for machine learning, the development experience is still far from what regular software engineers enjoy. This is mainly because the current ML tooling is very much focused on solving specific problems and cover only a part of the development workflow. Furthermore, end-to-end integration of these various tools is still quite limited. This very often leaves the developers stuck without guidance as they try to make their way through a labyrinth of possible choices that could be made at each step.
This thesis aims to tackle the usability problem of modern machine learning systems. This involves taking a broader view which goes beyond the model training part of the ML workflow and developing a system for managing this workflow. This broader workflow includes the data preparation process which comes before model training, as well model management which comes after. We seek to identify various pitfalls and pain points that developers encounter in these ML workflows. We then zoom into one particular kind of a usability pain point -- labor-efficient data debugging. We pinpoint two categories of data errors (missing data and wrong data), and develop two methods for guiding the attention of the developer by helping them choose the instances of data errors that are the most important. We then empirically evaluate those methods in realistic data repair scenarios and demonstrate that they indeed improve the efficiency of the data debugging process, which in turn translates to greater usability. We finish up with some insights which could be applied to design more usable machine learning systems in the future. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000554603Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Zhang, Ce
Examiner: Alonso, Gustavo
Examiner: Interlandi, Matteo
Examiner: Schelter, Sebastian
Publisher
ETH ZurichSubject
data management; machine learning; Data-centric engineeringOrganisational unit
09588 - Zhang, Ce / Zhang, Ce
More
Show all metadata
ETH Bibliography
yes
Altmetrics