Insights on the interplay of network architectures and optimization algorithms in deep learning

Kohler, Jonas

doi:10.3929/ethz-b-000564937

Show simple item record

dc.contributor.author

Kohler, Jonas

dc.contributor.supervisor

Hofmann, Thomas

dc.contributor.supervisor

Cartis, Coralia

dc.contributor.supervisor

Jaggi, Martin

dc.date.accessioned

2022-08-19T08:33:20Z

dc.date.available

2022-08-19T08:17:35Z

dc.date.available

2022-08-19T08:33:20Z

dc.date.issued

2022-08-19

dc.identifier.uri

http://hdl.handle.net/20.500.11850/564937

dc.identifier.doi

10.3929/ethz-b-000564937

dc.description.abstract

In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in science, industry, and society. Compared to classical machine learning methods, the tremendous success of deep neural networks builds upon their capacity to extract meaningful embeddings of input data that come in representations with low semantic content. The crucial step before deployment of such models is the "training" procedure, in which the connection strength between (usually millions of) artificial neurons is iteratively adjusted until the network converges to a parameter setting where useful patterns are extracted from the input. Despite the indisputable empirical success of "deep learning", the mathematical toolkit suitable for analyzing and describing this learning process is still rather lim- ited. In the thesis at hand, we take a step towards providing more theoretical foundations by studying the curious interplay of archi- tectural features of neural networks, initialization strategies, and the learning dynamics of numerical optimization algorithms. From an optimization perspective, learning patterns with neural net- works constitute a high-dimensional, non-convex optimization prob- lem. In general, such problems can be arbitrarily hard to optimize due to the presence of saddle points and spurious local minima. Nevertheless, simple gradient-based optimization algorithms, like (stochastic) gradient descent, perform extremely well in practice, de- spite the fact that they have largely been designed and studied for the much simpler case of optimizing convex functions. As a result, the empirical success of training deep networks suggests that the composi- tional structure of neural networks alongside the specific strategies for weight initialization, impose certain structures on the loss landscapes of such optimization problems that allow gradient-based methods to perform well despite the apparent obstacles induced by non-convexity. In the following, we first provide an argument for the particular effectiveness of stochastic gradient descent (SGD) by showing that the inherent noise in sub-sampled gradients is sufficient to escape saddle points, as long as they are sufficiently curved. This is an important insight since prior works on gradient methods around saddle points required adding exogenous noise, which is not what is used in practice. Yet, as we show subsequently, not all saddles in neural networks are curved. In fact, standard i.i.d. initialization lets optimizers start from a plateau that becomes increasingly flat as networks grow in depth. While this is particularly unfortunate for SGD, we show that "adaptive gradient methods" such as the Adam optimizer successfully adapt to flattened curvature and hence escape such plateaus quickly. On the architectural side, we find that the combination of normalization layers with residual connections cir- cumvents this problem in the first place. Thereafter, we highlight another attractive feature of these novel architectural components. Namely, they guarantee stable information flow at random initializa- tion, irrespective of the network depth. For one such normalization technique, called "batch normalization", we extend this result to the entire optimization process, proving an accelerated convergence rate for batch normalized gradient descent in certain settings. To conclude, we propose a Newton-type algorithm for deep learning that gener- alizes the idea of adaptive gradient methods to second-order methods.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.subject

Machine learning

en_US

dc.subject

Deep learning

en_US

dc.subject

Optimization algorithms

en_US

dc.title

Insights on the interplay of network architectures and optimization algorithms in deep learning

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2022-08-19

ethz.size

226 p.

en_US

ethz.code.ddc

DDC - DDC::0 - Computer science, information & general works::000 - Generalities, science

en_US

ethz.identifier.diss

28363

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas

en_US

ethz.tag

PhD Thesis

en_US

ethz.date.deposited

2022-08-19T08:17:41Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2022-08-19T08:33:27Z

ethz.rosetta.lastUpdated

2023-02-07T05:26:06Z

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Insights%20on%20the%20interplay%20of%20network%20architectures%20and%20optimization%20algorithms%20in%20deep%20learning&rft.date=2022-08-19&rft.au=Kohler,%20Jonas&rft.genre=unknown&rft.btitle=Insights%20on%20the%20interplay%20of%20network%20architectures%20and%20optimization%20algorithms%20in%20deep%20learning

Search print copy at ETH Library

Files in this item

Name:: thesis_jonas_final.pdf
Size:: 8.698Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [29822]

Show simple item record

Research Collection

Search

Insights on the interplay of network architectures and optimization algorithms in deep learning Mendeley CSV RIS BibTeX

Files in this item

Publication type

Insights on the interplay of network architectures and optimization algorithms in deep learning

Mendeley

CSV

RIS

BibTeX