Efficient flow scheduling in distributed deep learning training with echelon formation

Pan, Rui; Lei, Yiming; Li, Jialong; Xie, Zhiqiang; Yuan, Binhang; Xia, Yiting

doi:10.1145/3563766.3564096

Download

Full text (published version) (PDF, 458.7Kb)

Open access

Author

Date

2022-11

Type

Conference Paper

ETH Bibliography

yes

Altmetrics

Download

Full text (published version) (PDF, 458.7Kb)

Rights / license

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Abstract

This paper discusses why flow scheduling does not apply to distributed deep learning training and presents EchelonFlow, the first network abstraction to bridge the gap. EchelonFlow deviates from the common belief that semantically related flows should finish at the same time. We reached the key obs Show more

Permanent link

https://doi.org/10.3929/ethz-b-000591849

Publication status

published

External links

https://doi.org/10.1145/3563766.3564096

Book title

HotNets '22: Proceedings of the 21st ACM Workshop on Hot Topics in Networks

Pages / Article No.

93 - 100

Publisher

Association for Computing Machinery

Event

21st ACM Workshop on Hot Topics in Networks (HotNets 2022), Austin, TX, USA, November 14-15, 2022

Subject

flow scheduling; data center networks; deep learning

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Efficient flow scheduling in distributed deep learning training with echelon formation Mendeley CSV RIS BibTeX

Efficient flow scheduling in distributed deep learning training with echelon formation

Mendeley

CSV

RIS

BibTeX