Exponentially Faster Language Modelling
OPEN ACCESS
Loading...
Author / Producer
Date
2023-11-21
Publication Type
Working Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERTselectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights. (https://github.com/pbelcak/UltraFastBERT)
Permanent link
Publication status
published
Editor
Book title
Journal / series
Volume
Pages / Article No.
2311.1077
Publisher
Cornell University
Event
Edition / version
v2
Methods
Software
Geographic location
Date collected
Date created
Subject
Language models; Feedforward neural network; Fast feedforward network; Model Acceleration
Organisational unit
03604 - Wattenhofer, Roger / Wattenhofer, Roger
Notes
Funding
Related publications and datasets
Is supplemented by: https://github.com/pbelcak/UltraFastBERT