On the Interplay between Scale and Inductive Bias in Deep Learning
OPEN ACCESS
Loading...
Author / Producer
Date
2025-05-26
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Deep learning has seen tremendous progress throughout recent years. Neural networks now outperform human players in strategy games such as chess or Go, they win medals at the math olympiad and have started to write better code than some of the best human programmers. While the empirical progress of this paradigm is astonishing, its theoretical underpinnings and the understanding of its inner workings have been significantly lagging behind. The key driver to the success of neural networks, and most likely also the root cause of the emerging theoretical complexity, is scale. The bigger the model and the more data it consumed, the better the performance on many tasks of interest. As a consequence of this observation, model training has become an evermore costly endeavour, with modern language models now comprising trillions of parameters while learning from almost the entirety of the written human knowledge.
In order to scale these models more, the design of many of their components has been simplified. While initially, every problem or data type required its own carefully crafted architecture, recently the Transformer started to emerge, removing most of the inductive bias previously built into the models. The reason for it was very simple: its ability to process sequences in parallel enabled researchers to train larger networks on more data. In a similar spirit, the objective functions used to train models began to be simplified, and pre-training on losses as straightforward as next-token prediction or image classification all of a sudden started to prove useful for significantly more involved tasks such as instruction-following, coding or image segmentation. This surprising observation has been coined the Bitter Lesson of deep learning bitter.
In this thesis we will closely study the role of scale for model training and how properties of neural networks change, or become less important when subjected to it. We will examine the aforementioned simplifications in model design and probe to what degree limitations can arise when working with such little inductive bias. First, we will focus on the initialization of neural networks, a process crucial for successful training, and we will precisely characterize how scaling network size affects the resulting distribution of the activations. We will see that scaling both width and depth simultaneously needs to be performed with care as the resulting distribution becomes significantly more heavy-tailed. Next, we will more closely examine the generalization error, the metric of interest that we wish to understand. We will focus on an estimator based on the leave-one-out error that is easier to manipulate mathematically. Through its lens, we will analyze several intriguing phenomena typical of deep learning at scale, including double descent or the blessing of overparametrization. We will then examine the role of the architecture more closely and probe the limits of the bitter lesson, by training the most scalable but highly unstructured architecture, the multi-layer perceptron. Even in this adversarial case, we show that surprising performance can still be achieved through the sheer force of scale. Finally, we examine the role of the objective, focusing on the next-token prediction task typically employed during the pre-training of language models. Although seemingly an innocent objective without much structure, the mechanism of teacher-forcing employed during training can strongly shape the resulting model in unintended ways. In particular, we will analyze how the planning abilities of next-token predictors can be limited, no matter the scale, by demonstrating their failure on a very simple task.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner : Hofmann, Thomas
Examiner : Lucchi, Aurélien
Examiner : Flammarion, Nicolas
Examiner : Gu, Quanquan
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas