Towards A Platform and Benchmark Suite for Model Training on Dynamic Datasets
OPEN ACCESS
Author / Producer
Date
2023-05
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Machine learning (ML) is often applied in use cases where training data evolves and/or grows over time. Training must incorporate data changes for high model quality, however this is often challenging and expensive due to large datasets and models. In contrast, ML researchers often train and evaluate ML models on static datasets or with artificial assumptions about data dynamics. This gap between research and practice is largely due to (i) the absence of an open-source platform that manages dynamic datasets at scale and supports pluggable policies for when and what data to train on, and (ii) the lack of representative open-source benchmarks for ML training on dynamic datasets. To address this gap, we propose to design a platform that enables ML researchers and practitioners to explore training and data selection policies, while alleviating the burdens of managing large dynamic datasets and orchestrating recurring training jobs. We also propose to build an accompanying benchmark suite that integrates public dynamic datasets and ML models from a variety of representative use cases.
Permanent link
Publication status
published
External links
Editor
Book title
EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems
Journal / series
Volume
Pages / Article No.
8 - 17
Publisher
Association for Computing Machinery
Event
3rd Workshop on Machine Learning and Systems (EuroMLSys 2023)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09683 - Klimovic, Ana / Klimovic, Ana
Notes
Funding
204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)