On the Interplay between Scale and Inductive Bias in Deep Learning


Loading...

Date

2025-05-26

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Deep learning has seen tremendous progress throughout recent years. Neural networks now outperform human players in strategy games such as chess or Go, they win medals at the math olympiad and have started to write better code than some of the best human programmers. While the empirical progress of this paradigm is astonishing, its theoretical underpinnings and the understanding of its inner workings have been significantly lagging behind. The key driver to the success of neural networks, and most likely also the root cause of the emerging theoretical complexity, is scale. The bigger the model and the more data it consumed, the better the performance on many tasks of interest. As a consequence of this observation, model training has become an evermore costly endeavour, with modern language models now comprising trillions of parameters while learning from almost the entirety of the written human knowledge. In order to scale these models more, the design of many of their components has been simplified. While initially, every problem or data type required its own carefully crafted architecture, recently the Transformer started to emerge, removing most of the inductive bias previously built into the models. The reason for it was very simple: its ability to process sequences in parallel enabled researchers to train larger networks on more data. In a similar spirit, the objective functions used to train models began to be simplified, and pre-training on losses as straightforward as next-token prediction or image classification all of a sudden started to prove useful for significantly more involved tasks such as instruction-following, coding or image segmentation. This surprising observation has been coined the Bitter Lesson of deep learning bitter. In this thesis we will closely study the role of scale for model training and how properties of neural networks change, or become less important when subjected to it. We will examine the aforementioned simplifications in model design and probe to what degree limitations can arise when working with such little inductive bias. First, we will focus on the initialization of neural networks, a process crucial for successful training, and we will precisely characterize how scaling network size affects the resulting distribution of the activations. We will see that scaling both width and depth simultaneously needs to be performed with care as the resulting distribution becomes significantly more heavy-tailed. Next, we will more closely examine the generalization error, the metric of interest that we wish to understand. We will focus on an estimator based on the leave-one-out error that is easier to manipulate mathematically. Through its lens, we will analyze several intriguing phenomena typical of deep learning at scale, including double descent or the blessing of overparametrization. We will then examine the role of the architecture more closely and probe the limits of the bitter lesson, by training the most scalable but highly unstructured architecture, the multi-layer perceptron. Even in this adversarial case, we show that surprising performance can still be achieved through the sheer force of scale. Finally, we examine the role of the objective, focusing on the next-token prediction task typically employed during the pre-training of language models. Although seemingly an innocent objective without much structure, the mechanism of teacher-forcing employed during training can strongly shape the resulting model in unintended ways. In particular, we will analyze how the planning abilities of next-token predictors can be limited, no matter the scale, by demonstrating their failure on a very simple task.

Publication status

published

Editor

Contributors

Examiner : Hofmann, Thomas
Examiner : Lucchi, Aurélien
Examiner : Flammarion, Nicolas
Examiner : Gu, Quanquan

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09462 - Hofmann, Thomas / Hofmann, Thomas check_circle

Notes

Funding

Related publications and datasets