Benchmarking and Optimizations of Data Shuffling on High-performance Networks

Open access
Author
Date
2021Type
- Master Thesis
ETH Bibliography
yes
Altmetrics
Abstract
For handling large scale data communication spanning gigabytes per second, Remote Direct Memory Access (RDMA) technology is significant. InfiniBand (IB) is a widely used RDMA point-to-point interconnect whose full potential can be realized using the most efficient networking library. In this thesis we address this challenge by benchmarking the bandwidth performance of two latest and popular open source RDMA libraries - NetIO from CERN and HPNL from Intel. Our main objective is to design a shuffle operator with these RDMA libraries and perform comprehensive benchmarking by considering two important dimensions in our study: scalability and platform consistency. We study network usage patterns with increasing complexity: from simple unidirectional and bidirectional streaming between two nodes extended to pairwise all-to-all bidirectional streaming with multiple nodes and finally to a data-dependent shuffling operation for the full cluster. We create multiple designs, execute and compare the results on two different IB clusters: euler (QDR) and r630 (FDR) to observe consistency on different cluster infrastructure. Using proper performance tuning factors such as message size and in-flight messages, our results show that both libraries do not scale assuredly well for shuffling, showing a lowering trend as the cluster size increases, but still analogous to their all-to-all streaming counterparts. Compared to the peak reference bidirectional bandwidth, the best performing shuffling design of NetIO with optimal parameters manages to achieve about 80-90% on euler and about 80% on r630 upto a node count of 6 and 5 respectively for a message size of 128 KB. Whereas, HPNL's best design reaches >70% on euler and around 50% on r630 for 1 MB messages upto a cluster size of 7, showing that NetIO marginally outperforms HPNL. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000501090Publication status
publishedPublisher
ETH ZurichOrganisational unit
03506 - Alonso, Gustavo / Alonso, Gustavo
More
Show all metadata
ETH Bibliography
yes
Altmetrics