Metadata only
Date
2024-02-05Type
- Conference Paper
ETH Bibliography
yes
Altmetrics
Abstract
Adaptive gradient methods, e.g., ADAM, have achieved tremendous success in data-driven machine learning, especially deep learning. Employing adaptive learning rates according to the gradients, such methods arc able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization capacity compared with stochastic gradient descent (SGD) and tend to be trapped in local minima at an early stage during the training process. Intriguingly, we discover that the issue can be resolved by substituting the gradient in the second raw moment estimate term with its exponential moving average version in ADAM. The intuition is that the gradient with momentum contains more accurate directional information, and therefore its second-moment estimation is a more preferable option for learning rate scaling than that of the raw gradient. Thereby we propose ADAM³ as a new optimizer reaching the goal of training quickly while generalizing much better. Extensive experiments on a variety of tasks and models demonstrate that ADAM³ exhibits state-of-the-art performance and superior training stability consistently. Considering the simplicity and effectiveness of ADAM³, we believe it has the potential to become a new standard method in deep learning. Code is provided at https://github.com/wyzjack/AdaM3. Show more
Publication status
publishedExternal links
Book title
2023 IEEE International Conference on Data Mining (ICDM)Pages / Article No.
Publisher
IEEEEvent
Subject
adaptive gradient method; data-driven deep learningOrganisational unit
02652 - Institut für Bildverarbeitung / Computer Vision Laboratory
Notes
Conference lecture held on December 2, 2023.More
Show all metadata
ETH Bibliography
yes
Altmetrics