Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study

Open access
Date
2018Type
- Conference Paper
ETH Bibliography
yes
Altmetrics
Abstract
Training deep learning models has received tremendous research interest recently. In particular, there has been intensive research on reducing the communication cost of training when using multiple computational devices, through reducing the precision of the underlying data representation. Naturally, such methods induce system trade-offs—lowering communication precision could decrease communication overheads and improve scalability; but, on the other hand, it can also reduce the accuracy of training. In this paper, we study this trade-off space, and ask: Can low-precision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss? From the performance point of view, the answer to this question may appear deceptively easy: compressing communication through low precision should help when the ratio between com- munication and computation is high. However, this answer is less straightforward when we try to generalize this principle across various neural network architectures (e.g., AlexNet vs. ResNet), number of GPUs (e.g., 2 vs. 8 GPUs), machine configurations (e.g., EC2 instances vs. NVIDIA DGX-1), communication primitives (e.g., MPI vs. NCCL), and even different GPU architectures (e.g., Kepler vs. Pascal). Currently, it is not clear how a realistic realization of all these factors maps to the speed up provided by low-precision communication. In this paper, we conduct an empirical study to answer this question and report the insights. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000319485Publication status
publishedBook title
Proceedings of the 21st International Conference on Extending Database TechnologyPages / Article No.
Publisher
OpenProceedingsEvent
Organisational unit
09588 - Zhang, Ce / Zhang, Ce
Funding
167266 - Dapprox: Dependency-ware Approximate Analytics and Processing Platforms (SNF)
More
Show all metadata
ETH Bibliography
yes
Altmetrics