Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement


METADATA ONLY
Loading...

Author / Producer

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

Input data preprocessing is a common bottleneck in machine teaming (ML) jobs, that can significantly increase training time and cost as expensive GPUs or Till's idle waiting for input data. Previous Work has shown that offloading data preprocessing to remote CPU servers successfully alleviates data stalls and improves training time. However, remote CPU workers in disaggregated data processing systems comprise a significant fraction of total training costs. Meanwhile, current disaggregated solutions often underutilize CPU and DRAM resources available on ML accelerator nodes. We propose two approaches to alleviate ML input data stalls while minimizing costs. First, we dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth. Second, we analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput. We observe that relaxing commutativity increases throughput while maintaining high model accuracy for a variety of ML data pipelines. We build Pecan, an ML data preprocessing service that automates data preprocessing worker placement and transformation reordering decisions. Pecan reduces preprocessing costs by 87% on average and total training costs by up to 60% compared to training with slate-of-the-art disaggregated data preprocessing and total training costs by 55% on average compared to collocated data preprocessing.

Permanent link

Publication status

published

Book title

ATC'24: Proceedings of the 2024 USENIX Annual Technical Conference

Journal / series

Volume

Pages / Article No.

649 - 665

Publisher

USENIX Association

Event

USENIX Annual Technical Conference (ATC 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09683 - Klimovic, Ana / Klimovic, Ana check_circle

Notes

Funding

204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)

Related publications and datasets