Towards an Empirically Guided Understanding of the Loss Landscape of Neural Networks


Loading...

Author / Producer

Date

2022

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

One of the most important and ubiquitous building blocks of machine learning is gradient based optimization. While it has and continues to contribute to the vast majority of recent successes of deep neural networks, it comes both with some limitations and the potential for further improvements. Catastrophic forgetting, which is the subject of the fist two parts of this thesis, is one such limitation. It refers to the observation that when gradient based learning algorithms are asked to learn different tasks sequentially, they overwrite knowledge from earlier tasks. In the machine learning community, several different ideas and formalisations of this problem are being investigated. One of the most difficult versions is a setting in which the use of data from earlier distributions is strictly forbidden. In this domain, an important line of work are so-called regularisation based algorithms. Our first contribution is to unify a large family of these algorithms by showing that they all rely on the same theoretical idea to limit catastrophic forgetting. This had not only been unknown, but we also show how this is an accidental feature of at least some of the algorithms. To demonstrate the practical impact of these insights, we also show how they can be used to make some algorithms more robust and performant across a variety of settings. The second part of the thesis uses tools from the first part and tackles a similar problem, but does so from a different angle. Namely it focusses on the phenomenon of catastrophic forgetting – also known as the stability-plasticity dilemma – from the viewpoint of neuroscience. It proposes and analyses a simple synaptic learning rule, based on the stochasticity of synaptic signal transmission and shows how this learning rule can alleviate catastrophic forgetting in model neural network. Moreover, the learning rule’s effects on energy-efficient information processing are investigated extending prior work which explores computational roles of the aforementioned and somewhat mysterious stochastic nature of synaptic signal transmission. Finally, the third part of the thesis focuses on potential improvements of standard first-order gradient based optimizers. One of the most successful lines of work in this area are Kronecker-factored optimizers, whose influence has reached beyond optimization to areas like Bayesian machine learning, catastrophic forgetting or meta-learning. Kronecker-factored optimizers are motivated and thought of as approximations of natural gradient descent, a well-known second-order optimization method. We will show that a host of empirical results contradict this view of KFAC as a second-order optimizer and propose an alternative, fundamentally different theoretical explanation for its effectiveness. This does not only give important new insights into one of the most powerful optimizers for neural networks, but can also be used to derive a more efficient optimizer.

Publication status

published

Editor

Contributors

Examiner : Steger, Angelika
Examiner : Aitchison, Laurence
Examiner : Pascanu, Razvan

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Optimization; Gradient Descent; Machine Learning; Continual learning

Organisational unit

03672 - Steger, Angelika / Steger, Angelika check_circle

Notes

Funding

Related publications and datasets