BAGUA: Scaling up Distributed Learning with System Relaxations
OPEN ACCESS
Loading...
Author / Producer
Date
2021
Publication Type
Journal Article
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Abstract
Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2x) across a diverse range of tasks. Moreover, we conduct a rigorous tradeo. exploration showing that di.erent algorithms and system relaxations achieve the best performance over di.erent network conditions.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
15 (4)
Pages / Article No.
804 - 813
Publisher
Association for Computing Machinery
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)