RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
OPEN ACCESS
Loading...
Author / Producer
Date
2025-06
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
Permanent link
Publication status
published
External links
Editor
Book title
Proceedings of the 52nd Annual International Symposium on Computer Architecture
Journal / series
Volume
Pages / Article No.
974 - 989
Publisher
Association for Computing Machinery
Event
52nd International Symposium on Computer Architecture (ISCA 2025)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Retrieval-Augmented Generation; Computer System; Computer Architecture; Large Language Model; Performance Optimization