Open access
Author
Date
2023-05-02Type
- Master Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Neural Networks (NNs) are getting deeper and more complicated to the point where single accelerator training is no longer an option. Training today’s state-of-the-art NNs is done in parallel over thousands of GPUs. Preconditioning-based optimizers are getting more attention in distributed training as well. We conduct a literature review of existing distributed second-order methods in training NNs. We thoroughly look at two famous preconditioning methods called K-FAC and Shampoo and describe the approaches on how to distribute additional computations across multiple GPUs. We implement distributed K-FAC (distr. K-FAC) and distributed Shampoo (distr. Shampoo) in PyTorch. Based on our analysis of the performance of both algorithms, we introduce 3D-Shampoo, an extension of Shampoo to training in 3D parallelism settings (i.e. a combination of data, operator, and pipeline parallelism). 3D-Shampoo works with 3D parallelism from the DeepSpeed library (Rasley et al. ,2020), a modified version of the Shampoo optimizer (Gupta et al. ,2018), and is designed for very big language models such as GPT-2 which support operator parallelism like Megatron-LM’s GPT-2 (Narayanan et al. ,2021). The final part of this thesis consists of a description of the 3D-Shampoo algorithm, how it works, and the results of its performance on Megatron-LM’s GPT-2 for different levels of parallelism. Further, our 3D-Shampoo has shown a competitive throughput (number of tokens processed per second) with the SGD optimizer for all kinds of parallelism (data parallelism, operator parallelism, pipeline parallelism, and a combination of them) training GPT-2-like Transformer models. The code used for our experiment is publicly available. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000615331Publication status
publishedPublisher
ETH ZurichSubject
Artificial intelligence (AI); Deep Learning; High Performance Computing; Mathematical Optimization; Distributed algorithms; GPUOrganisational unit
03950 - Hoefler, Torsten / Hoefler, Torsten
Related publications and datasets
Is supplemented by: https://github.com/noabauma/3d-shampoo
More
Show all metadata
ETH Bibliography
yes
Altmetrics