Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers


METADATA ONLY
Loading...

Date

2024-03-25

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these”attentionless Transformers” to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

Publication status

published

Editor

Book title

IAAI-24, EAAI-24, AAAI-24 Student Abstracts, Undergraduate Consortium and Demonstrations

Volume

38 (21)

Pages / Article No.

23477 - 23479

Publisher

AAAI

Event

28th AAAI Conference on Artificial Intelligence (AAAI 2014)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Distillation Learning; Transformer; Attention Mechanism; Feedforward Networks; Natural Language Processing; Optimization

Organisational unit

09462 - Hofmann, Thomas / Hofmann, Thomas check_circle

Notes

Funding

Related publications and datasets