Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study
Tam, Leo K.
- Conference Paper
Training deep learning models has received tremendous research interest recently. In particular, there has been intensive research on reducing the communication cost of training when using multiple computational devices, through reducing the precision of the underlying data representation. Naturally, such methods induce system trade-offs—lowering communication precision could decrease communication overheads and improve scalability; but, on the other hand, it can also reduce the accuracy of training. In this paper, we study this trade-off space, and ask: Can low-precision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss? From the performance point of view, the answer to this question may appear deceptively easy: compressing communication through low precision should help when the ratio between com- munication and computation is high. However, this answer is less straightforward when we try to generalize this principle across various neural network architectures (e.g., AlexNet vs. ResNet), number of GPUs (e.g., 2 vs. 8 GPUs), machine configurations (e.g., EC2 instances vs. NVIDIA DGX-1), communication primitives (e.g., MPI vs. NCCL), and even different GPU architectures (e.g., Kepler vs. Pascal). Currently, it is not clear how a realistic realization of all these factors maps to the speed up provided by low-precision communication. In this paper, we conduct an empirical study to answer this question and report the insights. Show more
External linksSearch via SFX
Book titleProceedings of the 21st International Conference on Extending Database Technology
Pages / Article No.
Organisational unit09588 - Zhang, Ce / Zhang, Ce
167266 - Dapprox: Dependency-ware Approximate Analytics and Processing Platforms (SNF)
MoreShow all metadata