LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services


METADATA ONLY
Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.

Publication status

published

Editor

Book title

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Journal / series

Volume

Pages / Article No.

10793215

Publisher

IEEE

Event

2024 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

large language models; inference services; performance; benchmarking; prediction

Organisational unit

Notes

Funding

Related publications and datasets