Metadata only
Date
2024-04Type
- Conference Paper
ETH Bibliography
yes
Altmetrics
Abstract
The widespread adoption of ML has led to a high demand for GPU hardware and consequently, severe shortages of GPUs in the public cloud. Allocating a sufficient number of GPUs to train or fine-tune today’s large ML models in a single cloud region is often difficult. Users can get access to more GPUs if they are willing to run a ML training job using devices across different geographical regions. However, GPU nodes are connected with lower network bandwidth and cloud providers charge extra for data transfers across geographical regions. In this work, we explore when and how it makes sense to leverage GPUs across zones and regions for distributed ML training. We analyze the throughput and cost impact of cross-region training based on the computation and communication patterns of different model parallelism strategies, develop a profile-based analytical model for estimating training throughput and cost, and provide guidelines for allocating geo-distributed resources efficiently. We find that although ML training throughput and cost with pure data parallelism degrades significantly when nodes span geographic regions, cross-region training with pipeline parallelism is practical. Show more
Publication status
publishedExternal links
Book title
EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and SystemsPages / Article No.
Publisher
Association for Computing MachineryEvent
Subject
Machine Learning; Cloud computingFunding
204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)
More
Show all metadata
ETH Bibliography
yes
Altmetrics