Insights on the interplay of network architectures and optimization algorithms in deep learning

Open access
Author
Date
2022-08-19Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in science, industry, and society. Compared to classical machine learning methods, the tremendous success of deep neural networks builds upon their capacity to extract meaningful embeddings of input data that come in representations with low semantic content. The crucial step before deployment of such models is the "training" procedure, in which the connection strength between (usually millions of) artificial neurons is iteratively adjusted until the network converges to a parameter setting where useful patterns are extracted from the input. Despite the indisputable empirical success of "deep learning", the mathematical toolkit suitable for analyzing and describing this learning process is still rather lim- ited. In the thesis at hand, we take a step towards providing more theoretical foundations by studying the curious interplay of archi- tectural features of neural networks, initialization strategies, and the learning dynamics of numerical optimization algorithms.
From an optimization perspective, learning patterns with neural net- works constitute a high-dimensional, non-convex optimization prob- lem. In general, such problems can be arbitrarily hard to optimize due to the presence of saddle points and spurious local minima. Nevertheless, simple gradient-based optimization algorithms, like (stochastic) gradient descent, perform extremely well in practice, de- spite the fact that they have largely been designed and studied for the much simpler case of optimizing convex functions. As a result, the empirical success of training deep networks suggests that the composi- tional structure of neural networks alongside the specific strategies for weight initialization, impose certain structures on the loss landscapes of such optimization problems that allow gradient-based methods to perform well despite the apparent obstacles induced by non-convexity.
In the following, we first provide an argument for the particular effectiveness of stochastic gradient descent (SGD) by showing that the inherent noise in sub-sampled gradients is sufficient to escape saddle points, as long as they are sufficiently curved. This is an important insight since prior works on gradient methods around saddle points required adding exogenous noise, which is not what is used in practice. Yet, as we show subsequently, not all saddles in neural networks are curved. In fact, standard i.i.d. initialization lets optimizers start from a plateau that becomes increasingly flat as networks grow in depth. While this is particularly unfortunate for SGD, we show that "adaptive gradient methods" such as the Adam optimizer successfully adapt to flattened curvature and hence escape such plateaus quickly. On the architectural side, we find that the combination of normalization layers with residual connections cir- cumvents this problem in the first place. Thereafter, we highlight another attractive feature of these novel architectural components. Namely, they guarantee stable information flow at random initializa- tion, irrespective of the network depth. For one such normalization technique, called "batch normalization", we extend this result to the entire optimization process, proving an accelerated convergence rate for batch normalized gradient descent in certain settings. To conclude, we propose a Newton-type algorithm for deep learning that gener- alizes the idea of adaptive gradient methods to second-order methods. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000564937Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Machine learning; Deep learning; Optimization algorithmsOrganisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas
More
Show all metadata
ETH Bibliography
yes
Altmetrics