Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
METADATA ONLY
Loading...
Author / Producer
Date
2024-03-25
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these”attentionless Transformers” to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
Permanent link
Publication status
published
External links
Editor
Book title
IAAI-24, EAAI-24, AAAI-24 Student Abstracts, Undergraduate Consortium and Demonstrations
Volume
38 (21)
Pages / Article No.
23477 - 23479
Publisher
AAAI
Event
28th AAAI Conference on Artificial Intelligence (AAAI 2014)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Distillation Learning; Transformer; Attention Mechanism; Feedforward Networks; Natural Language Processing; Optimization
Organisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas