Towards A Platform and Benchmark Suite for Model Training on Dynamic Datasets


Date

2023-05

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Machine learning (ML) is often applied in use cases where training data evolves and/or grows over time. Training must incorporate data changes for high model quality, however this is often challenging and expensive due to large datasets and models. In contrast, ML researchers often train and evaluate ML models on static datasets or with artificial assumptions about data dynamics. This gap between research and practice is largely due to (i) the absence of an open-source platform that manages dynamic datasets at scale and supports pluggable policies for when and what data to train on, and (ii) the lack of representative open-source benchmarks for ML training on dynamic datasets. To address this gap, we propose to design a platform that enables ML researchers and practitioners to explore training and data selection policies, while alleviating the burdens of managing large dynamic datasets and orchestrating recurring training jobs. We also propose to build an accompanying benchmark suite that integrates public dynamic datasets and ML models from a variety of representative use cases.

Publication status

published

Editor

Book title

EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

Journal / series

Volume

Pages / Article No.

8 - 17

Publisher

Association for Computing Machinery

Event

3rd Workshop on Machine Learning and Systems (EuroMLSys 2023)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09683 - Klimovic, Ana / Klimovic, Ana check_circle

Notes

Funding

204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)

Related publications and datasets