Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
METADATA ONLY
Loading...
Author / Producer
Date
2023-11-17
Publication Type
Other Conference Item
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
A Retrieval-Augmented Language Model (RALM) augments a large language model (LLM) by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus reducing computational demands by orders of magnitude. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LLM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations including model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a het erogeneous accelerator system integrating both LLM and re trieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both LLM inference and retrieval, while the disaggregation allows independent scaling of LLM and retrieval of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LLM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon’s retrieval accelerators achieve up to 23.72× speedup and 26.2× energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16× reduction in latency and 3.18× speedup in through put compared to the hybrid CPU-GPU architecture. These promising results pave the way for adopting accelerator het erogeneity and disaggregation into future RALM systems
Permanent link
Publication status
unpublished
External links
Editor
Book title
Journal / series
Volume
Pages / Article No.
Publisher
Event
9th International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H²RC 2023)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
03506 - Alonso, Gustavo / Alonso, Gustavo
Notes
Conference lecture held on November 17, 2023.