Differentiable subset pruning of transformer heads
OPEN ACCESS
Loading...
Author / Producer
Date
2021-12-17
Publication Type
Journal Article
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
Permanent link
Publication status
published
External links
Editor
Book title
Volume
9
Pages / Article No.
1442 - 1459
Publisher
MIT Press
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09684 - Sachan, Mrinmaya / Sachan, Mrinmaya
09682 - Cotterell, Ryan / Cotterell, Ryan
02219 - ETH AI Center / ETH AI Center
Notes
Funding
201009 - Representation Learning for Arbitrarily Long Richly Formatted Multimedia Documents (SNF)