Towards an Empirically Guided Understanding of the Loss Landscape of Neural Networks
OPEN ACCESS
Loading...
Author / Producer
Date
2022
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
One of the most important and ubiquitous building blocks of machine learning is gradient based optimization. While it has and continues to contribute to the vast majority of recent successes of deep neural networks, it comes both with some limitations and the potential for further improvements.
Catastrophic forgetting, which is the subject of the fist two parts of this thesis, is one such limitation. It refers to the observation that when gradient based learning algorithms are asked to learn different tasks sequentially, they overwrite knowledge from earlier tasks. In the machine learning community, several different ideas and formalisations of this problem are being investigated. One of the most difficult versions is a setting in which the use of data from earlier distributions is strictly forbidden. In this domain, an important line of work are so-called regularisation based algorithms. Our first contribution is to unify a large family of these algorithms by showing that they all rely on the same theoretical idea to limit catastrophic forgetting. This had not only been unknown, but we also show how this is an accidental feature of at least some of the algorithms. To demonstrate the practical impact of these insights, we also show how they can be used to make some algorithms more robust and performant across a variety of settings.
The second part of the thesis uses tools from the first part and tackles a similar problem, but does so from a different angle. Namely it focusses on the phenomenon of catastrophic forgetting – also known as the stability-plasticity dilemma – from the viewpoint of neuroscience. It proposes and analyses a simple synaptic learning rule, based on the stochasticity of synaptic signal transmission and shows how this learning rule can alleviate catastrophic forgetting in model neural network. Moreover, the learning rule’s effects on energy-efficient information processing are investigated extending prior work which explores computational roles of the aforementioned and somewhat mysterious stochastic nature of synaptic signal transmission.
Finally, the third part of the thesis focuses on potential improvements of standard first-order gradient based optimizers. One of the most successful lines of work in this area are Kronecker-factored optimizers, whose influence has reached beyond optimization to areas like Bayesian machine learning, catastrophic forgetting or meta-learning. Kronecker-factored optimizers are motivated and thought of as approximations of natural gradient descent, a well-known second-order optimization method. We will show that a host of empirical results contradict this view of KFAC as a second-order optimizer and propose an alternative, fundamentally different theoretical explanation for its effectiveness. This does not only give important new insights into one of the most powerful optimizers for neural networks, but can also be used to derive a more efficient optimizer.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner : Steger, Angelika
Examiner : Aitchison, Laurence
Examiner : Pascanu, Razvan
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Optimization; Gradient Descent; Machine Learning; Continual learning
Organisational unit
03672 - Steger, Angelika / Steger, Angelika