Show simple item record

dc.contributor.author
Noci, Lorenzo
dc.contributor.author
Anagnostidis, Sotiris
dc.contributor.author
Biggio, Luca
dc.contributor.author
Orvieto, Antonio
dc.contributor.author
Singh, Sidak Pal
dc.contributor.author
Lucchi, Aurelien
dc.contributor.editor
Koyejo, Sanmi
dc.contributor.editor
Mohamed, Shakir
dc.contributor.editor
Agarwal, Alekh
dc.contributor.editor
Belgrave, Danielle
dc.contributor.editor
Cho, Kyunghyun
dc.contributor.editor
Oh, Alice
dc.date.accessioned
2023-04-05T06:37:37Z
dc.date.available
2023-01-11T10:54:29Z
dc.date.available
2023-02-15T15:16:00Z
dc.date.available
2023-04-05T06:37:37Z
dc.date.issued
2022
dc.identifier.isbn
978-1-7138-7108-8
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/591602
dc.description.abstract
Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.
en_US
dc.language.iso
en
en_US
dc.publisher
Curran
en_US
dc.title
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
en_US
dc.type
Conference Paper
ethz.book.title
Advances in Neural Information Processing Systems 35
en_US
ethz.pages.start
27198
en_US
ethz.pages.end
27211
en_US
ethz.event
36th Annual Conference on Neural Information Processing Systems (NeurIPS 2022)
en_US
ethz.event.location
New Orleans, LA, USA
en_US
ethz.event.date
November 28 - December 9, 2022
en_US
ethz.notes
Poster presentation on November 29, 2022.
en_US
ethz.grant
European Learning and Intelligent Systems Excellence
en_US
ethz.publication.place
Red Hook, NY
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09664 - Schölkopf, Bernhard / Schölkopf, Bernhard
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09664 - Schölkopf, Bernhard / Schölkopf, Bernhard
ethz.identifier.url
https://proceedings.neurips.cc/paper_files/paper/2022/hash/ae0cba715b60c4052359b3d52a2cff7f-Abstract-Conference.html
ethz.identifier.url
https://nips.cc/virtual/2022/poster/53861
ethz.grant.agreementno
951847
ethz.grant.fundername
EC
ethz.grant.funderDoi
10.13039/501100000780
ethz.grant.program
H2020
ethz.date.deposited
2023-01-11T10:54:29Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.installDate
2023-04-05T06:37:39Z
ethz.rosetta.lastUpdated
2023-04-05T06:37:39Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Signal%20Propagation%20in%20Transformers:%20Theoretical%20Perspectives%20and%20the%20Role%20of%20Rank%20Collapse&rft.date=2022&rft.spage=27198&rft.epage=27211&rft.au=Noci,%20Lorenzo&Anagnostidis,%20Sotiris&Biggio,%20Luca&Orvieto,%20Antonio&Singh,%20Sidak%20Pal&rft.isbn=978-1-7138-7108-8&rft.genre=proceeding&rft.btitle=Advances%20in%20Neural%20Information%20Processing%20Systems%2035
 Search print copy at ETH Library

Files in this item

FilesSizeFormatOpen in viewer

There are no files associated with this item.

Publication type

Show simple item record