Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
METADATA ONLY
Loading...
Author / Producer
Date
2024-07
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $O(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives.
Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods -- Normalized SGD (NSGD), AMSGrad, and AdaGrad -- unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.
Permanent link
Publication status
published
Book title
Advances in Neural Information Processing Systems 36
Journal / series
Volume
Pages / Article No.
74257 - 74288
Publisher
Curran
Event
37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
stochastic optimization; nonconvex optimization; adaptive methods; parameter free algorithms
Organisational unit
09729 - He, Niao / He, Niao
02219 - ETH AI Center / ETH AI Center
Notes
Poster presentation on December 14, 2023.
Funding
207343 - RING: Robust Intelligence with Nonconvex Games (SNF)