Show simple item record

dc.contributor.author
Kohler, Jonas
dc.contributor.supervisor
Hofmann, Thomas
dc.contributor.supervisor
Cartis, Coralia
dc.contributor.supervisor
Jaggi, Martin
dc.date.accessioned
2022-08-19T08:33:20Z
dc.date.available
2022-08-19T08:17:35Z
dc.date.available
2022-08-19T08:33:20Z
dc.date.issued
2022-08-19
dc.identifier.uri
http://hdl.handle.net/20.500.11850/564937
dc.identifier.doi
10.3929/ethz-b-000564937
dc.description.abstract
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in science, industry, and society. Compared to classical machine learning methods, the tremendous success of deep neural networks builds upon their capacity to extract meaningful embeddings of input data that come in representations with low semantic content. The crucial step before deployment of such models is the "training" procedure, in which the connection strength between (usually millions of) artificial neurons is iteratively adjusted until the network converges to a parameter setting where useful patterns are extracted from the input. Despite the indisputable empirical success of "deep learning", the mathematical toolkit suitable for analyzing and describing this learning process is still rather lim- ited. In the thesis at hand, we take a step towards providing more theoretical foundations by studying the curious interplay of archi- tectural features of neural networks, initialization strategies, and the learning dynamics of numerical optimization algorithms. From an optimization perspective, learning patterns with neural net- works constitute a high-dimensional, non-convex optimization prob- lem. In general, such problems can be arbitrarily hard to optimize due to the presence of saddle points and spurious local minima. Nevertheless, simple gradient-based optimization algorithms, like (stochastic) gradient descent, perform extremely well in practice, de- spite the fact that they have largely been designed and studied for the much simpler case of optimizing convex functions. As a result, the empirical success of training deep networks suggests that the composi- tional structure of neural networks alongside the specific strategies for weight initialization, impose certain structures on the loss landscapes of such optimization problems that allow gradient-based methods to perform well despite the apparent obstacles induced by non-convexity. In the following, we first provide an argument for the particular effectiveness of stochastic gradient descent (SGD) by showing that the inherent noise in sub-sampled gradients is sufficient to escape saddle points, as long as they are sufficiently curved. This is an important insight since prior works on gradient methods around saddle points required adding exogenous noise, which is not what is used in practice. Yet, as we show subsequently, not all saddles in neural networks are curved. In fact, standard i.i.d. initialization lets optimizers start from a plateau that becomes increasingly flat as networks grow in depth. While this is particularly unfortunate for SGD, we show that "adaptive gradient methods" such as the Adam optimizer successfully adapt to flattened curvature and hence escape such plateaus quickly. On the architectural side, we find that the combination of normalization layers with residual connections cir- cumvents this problem in the first place. Thereafter, we highlight another attractive feature of these novel architectural components. Namely, they guarantee stable information flow at random initializa- tion, irrespective of the network depth. For one such normalization technique, called "batch normalization", we extend this result to the entire optimization process, proving an accelerated convergence rate for batch normalized gradient descent in certain settings. To conclude, we propose a Newton-type algorithm for deep learning that gener- alizes the idea of adaptive gradient methods to second-order methods.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
Machine learning
en_US
dc.subject
Deep learning
en_US
dc.subject
Optimization algorithms
en_US
dc.title
Insights on the interplay of network architectures and optimization algorithms in deep learning
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2022-08-19
ethz.size
226 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::000 - Generalities, science
en_US
ethz.identifier.diss
28363
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09462 - Hofmann, Thomas / Hofmann, Thomas
en_US
ethz.tag
PhD Thesis
en_US
ethz.date.deposited
2022-08-19T08:17:41Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2022-08-19T08:33:27Z
ethz.rosetta.lastUpdated
2023-02-07T05:26:06Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Insights%20on%20the%20interplay%20of%20network%20architectures%20and%20optimization%20algorithms%20in%20deep%20learning&rft.date=2022-08-19&rft.au=Kohler,%20Jonas&rft.genre=unknown&rft.btitle=Insights%20on%20the%20interplay%20of%20network%20architectures%20and%20optimization%20algorithms%20in%20deep%20learning
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record