QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
METADATA ONLY
Loading...
Author / Producer
Date
2023
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
Pages / Article No.
Publisher
OpenReview
Event
Workshop on Efficient Systems for Foundation Models @ ICML2023 (ES-FoMO 2023)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Code Generation; Large Language Models; LLM; Quantization; Model Compression; GPTQ; LlaMA
Organisational unit
03893 - Püschel, Markus / Püschel, Markus
Notes
Poster presentation