Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
METADATA ONLY
Loading...
Author / Producer
Date
2024
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
Input data preprocessing is a common bottleneck in machine teaming (ML) jobs, that can significantly increase training time and cost as expensive GPUs or Till's idle waiting for input data. Previous Work has shown that offloading data preprocessing to remote CPU servers successfully alleviates data stalls and improves training time. However, remote CPU workers in disaggregated data processing systems comprise a significant fraction of total training costs. Meanwhile, current disaggregated solutions often underutilize CPU and DRAM resources available on ML accelerator nodes. We propose two approaches to alleviate ML input data stalls while minimizing costs. First, we dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth. Second, we analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput. We observe that relaxing commutativity increases throughput while maintaining high model accuracy for a variety of ML data pipelines. We build Pecan, an ML data preprocessing service that automates data preprocessing worker placement and transformation reordering decisions. Pecan reduces preprocessing costs by 87% on average and total training costs by up to 60% compared to training with slate-of-the-art disaggregated data preprocessing and total training costs by 55% on average compared to collocated data preprocessing.
Permanent link
Publication status
published
Book title
ATC'24: Proceedings of the 2024 USENIX Annual Technical Conference
Journal / series
Volume
Pages / Article No.
649 - 665
Publisher
USENIX Association
Event
USENIX Annual Technical Conference (ATC 2024)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09683 - Klimovic, Ana / Klimovic, Ana
Notes
Funding
204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)