Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI


METADATA ONLY
Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2× traffic reduction on a 188 -node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

Publication status

published

Editor

Book title

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Journal / series

Volume

Pages / Article No.

10793060

Publisher

IEEE

Event

International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Networking; AI accelerators; Clusters

Organisational unit

03950 - Hoefler, Torsten / Hoefler, Torsten check_circle

Notes

Funding

Related publications and datasets