Metadata only
Date
2024-04Type
- Conference Paper
Abstract
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of μs. We propose Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator's compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases. Orion significantly improves tail latency compared to state-of-the-art baselines for a high-priority inference job while collocating best-effort inference jobs to increase per-GPU request throughput by up to 7.3×, or while collocating DNN training, saving up to 1.49× in training costs compared to dedicated GPU allocation. Show more
Publication status
publishedExternal links
Book title
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer SystemsPages / Article No.
Publisher
Association for Computing MachineryEvent
Subject
Machine Learning; GPUsOrganisational unit
09683 - Klimovic, Ana / Klimovic, Ana
Funding
204620 - MLin: Machine Learning Input Data Processing as a Service (SNF)
Notes
Conference lecture held on April 25, 2024.More
Show all metadata