Notice

This record has been edited as far as possible, missing data will be added when the version of record is issued.

Show simple item record

dc.contributor.author
Noci, Lorenzo
dc.contributor.author
Li, Chuning
dc.contributor.author
Li, Muffan
dc.contributor.author
He, Bobby
dc.contributor.author
Hofmann, Thomas
dc.contributor.author
Maddison, Chris
dc.contributor.author
Roy, Daniel M.
dc.date.accessioned
2024-02-05T12:41:11Z
dc.date.available
2024-01-16T12:58:12Z
dc.date.available
2024-02-05T12:41:11Z
dc.date.issued
2023
dc.identifier.uri
http://hdl.handle.net/20.500.11850/653217
dc.description.abstract
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network’s trainability. Motivated by the success of Transform- ers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer’s attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffu- sion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
en_US
dc.language.iso
en
en_US
dc.title
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
en_US
dc.type
Conference Paper
ethz.book.title
Advances in Neural Information Processing Systems 36
en_US
ethz.size
32 p.
en_US
ethz.event
37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023)
en_US
ethz.event.location
New Orleans, LA, USA
en_US
ethz.event.date
December 10-16, 2023
en_US
ethz.notes
Poster presented on December 13, 2023.
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas
en_US
ethz.identifier.url
https://neurips.cc/virtual/2023/poster/71734
ethz.relation.isNewVersionOf
https://openreview.net/forum?id=PqfPjS9JRX
ethz.date.deposited
2024-01-16T12:58:12Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.exportRequired
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=The%20Shaped%20Transformer:%20Attention%20Models%20in%20the%20Infinite%20Depth-and-Width%20Limit&rft.date=2023&rft.au=Noci,%20Lorenzo&Li,%20Chuning&Li,%20Muffan&He,%20Bobby&Hofmann,%20Thomas&rft.genre=proceeding&rft.btitle=Advances%20in%20Neural%20Information%20Processing%20Systems%2036
 Search print copy at ETH Library

Files in this item

FilesSizeFormatOpen in viewer

There are no files associated with this item.

Publication type

Show simple item record