RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving


Loading...

Date

2025-06

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.

Publication status

published

Editor

Book title

Proceedings of the 52nd Annual International Symposium on Computer Architecture

Journal / series

Volume

Pages / Article No.

974 - 989

Publisher

Association for Computing Machinery

Event

52nd International Symposium on Computer Architecture (ISCA 2025)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Retrieval-Augmented Generation; Computer System; Computer Architecture; Large Language Model; Performance Optimization

Organisational unit

Notes

Funding

Related publications and datasets