The curse of dimensionality and gradient-based training of neural networks: shrinking the gap between theory and applications


Loading...

Date

2023

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Neural networks have gained widespread attention due to their remarkable performance in various applications. Two aspects are particular striking: on the one hand, neural networks seem to enjoy superior approximation capacities than classical methods. On the other hand, neural networks are trained successfully with gradient-based algorithms despite the training task being a highly nonconvex optimization problem. This thesis advances the theory behind these two phenomena. On the aspect of approximation, we develop a framework for showing that neural networks can break the so-called curse of dimensionality in different high-dimensional approximation problems, meaning that the complexity of the neural networks involved scales at most polynomially in the dimension. Our approach is based on the notion of a catalog network, which is a generalization of a feed-forward neural network in which the nonlinear activation functions can vary from layer to layer as long as they are chosen from a predefined catalog of functions. As such, catalog networks constitute a rich family of continuous functions. We show that, under appropriate conditions on the catalog, these catalog networks can efficiently be approximated with rectified linear unit (ReLU)-type networks and provide precise estimates of the number of parameters needed for a given approximation accuracy. As special cases of the general results, we obtain different classes of functions that can be approximated with ReLU networks without the curse of dimensionality. On the aspect of optimization, we investigate the interplay between neural networks and gradient-based training algorithms by studying the loss surface. On the one hand, we discover an obstruction to successful learning due to an unfortunate interplay between the architecture of the network and the initialization of the algorithm. More precisely, we demonstrate that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough. On the other hand, we establish positive results by conducting a landscape analysis and applying dynamical systems theory. These positive results deal with the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. Next, we prove a new variant of a dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements usually imposed. We verify that ReLU networks with one hidden layer fit into the new framework. Building on our classification of critical points, we deduce that gradient descent avoids most saddle points. We proceed to prove convergence to global minima if the initialization is sufficiently good, which is expressed by an explicit threshold on the limiting loss.

Publication status

published

Editor

Contributors

Examiner : Cheridito, Patrick
Examiner : Jentzen, Arnulf

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09557 - Cheridito, Patrick / Cheridito, Patrick check_circle
02204 - RiskLab / RiskLab check_circle

Notes

Funding

175699 - Higher order numerical approximation methods for stochastic partial differential equations (SNF)

Related publications and datasets