Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods


METADATA ONLY
Loading...

Date

2024-07

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $O(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods -- Normalized SGD (NSGD), AMSGrad, and AdaGrad -- unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

Publication status

published

Book title

Advances in Neural Information Processing Systems 36

Journal / series

Volume

Pages / Article No.

74257 - 74288

Publisher

Curran

Event

37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

stochastic optimization; nonconvex optimization; adaptive methods; parameter free algorithms

Organisational unit

09729 - He, Niao / He, Niao check_circle
02219 - ETH AI Center / ETH AI Center

Notes

Poster presentation on December 14, 2023.

Funding

207343 - RING: Robust Intelligence with Nonconvex Games (SNF)

Related publications and datasets