Abstract
Data and code working together is fundamental to machine learning (ML), but the context around datasets and interactions between datasets and code are in general captured only rudimentarily. Context such as how the dataset was prepared and created, what source data were used, what code was used in processing, how the dataset evolved, and where it has been used and reused can provide much insight, but this information is often poorly documented. That is unfortunate since it makes datasets into black-boxes with potentially hidden characteristics that have downstream consequences. We argue that making dataset preparation more accessible and dataset usage easier to record and document would have significant benefits for the ML community: it would allow for greater diversity in datasets by inviting modification to published sources, simplify use of alternative datasets and, in doing so, make results more transparent and robust, while allowing for all contributions to be adequately credited. We present a platform, Renku, designed to support and encourage such sustainable development and use of data, datasets, and code, and we demonstrate its benefits through a few illustrative projects which span the spectrum from dataset creation to dataset consumption and showcasing. Show more
Publication status
publishedExternal links
Book title
Advances in Neural Information Processing Systems 36Pages / Article No.
Publisher
CurranEvent
Organisational unit
02286 - Swiss Data Science Center (SDSC) / Swiss Data Science Center (SDSC)
Notes
Datasets and Benchmarks Track.More
Show all metadata
ETH Bibliography
yes
Altmetrics