error
Kurzer Serviceunterbruch am Donnerstag, 11. November 2025, 12 bis 13 Uhr. Sie können in diesem Zeitraum keine neuen Dokumente hochladen oder bestehende Einträge bearbeiten. Das Login wird in diesem Zeitraum deaktiviert. Grund: Wartungsarbeiten // Short service interruption on Thursday, November 11, 2025, 12.00 – 13.00. During this time, you won’t be able to upload new documents or edit existing records. The login will be deactivated during this time. Reason: maintenance work
 

Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models


METADATA ONLY
Loading...

Date

2023-11-17

Publication Type

Other Conference Item

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

A Retrieval-Augmented Language Model (RALM) augments a large language model (LLM) by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus reducing computational demands by orders of magnitude. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LLM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations including model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a het erogeneous accelerator system integrating both LLM and re trieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both LLM inference and retrieval, while the disaggregation allows independent scaling of LLM and retrieval of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LLM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon’s retrieval accelerators achieve up to 23.72× speedup and 26.2× energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16× reduction in latency and 3.18× speedup in through put compared to the hybrid CPU-GPU architecture. These promising results pave the way for adopting accelerator het erogeneity and disaggregation into future RALM systems

Permanent link

Publication status

unpublished

External links

Editor

Book title

Journal / series

Volume

Pages / Article No.

Publisher

Event

9th International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H²RC 2023)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

03506 - Alonso, Gustavo / Alonso, Gustavo check_circle

Notes

Conference lecture held on November 17, 2023.

Funding

Related publications and datasets