QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models


METADATA ONLY
Loading...

Date

2023

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.

Permanent link

Publication status

published

Editor

Book title

Journal / series

Volume

Pages / Article No.

Publisher

OpenReview

Event

Workshop on Efficient Systems for Foundation Models @ ICML2023 (ES-FoMO 2023)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Code Generation; Large Language Models; LLM; Quantization; Model Compression; GPTQ; LlaMA

Organisational unit

03893 - Püschel, Markus / Püschel, Markus check_circle

Notes

Poster presentation

Funding

Related publications and datasets