Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning

Hochan Lee, Jaewook Lee, Heewon Kim, Sangheon Pack

Research output: Contribution to journalArticlepeer-review

Abstract

In-network aggregation facilitates accelerated distributed deep learning by utilizing a programmable switch to aggregate gradient packets. However, a straggler problem should be addressed to avoid performance degradation in terms of training time. In this paper, we propose a straggler-aware in-network aggregation (SAINA) scheme to mitigate the straggler problem while preventing accuracy degradation. In SAINA, the programmable switch aggregates local gradients of the fastest $k$k workers to exclude stragglers and changes $k$k adaptively to balance the tradeoff between training speed and accuracy. To this end, we design a switch-friendly convergence detection (SFCD) algorithm which detects a convergence point and determines $k$k at the convergence point. SAINA is implemented over a software programmable switch and experimental results show that the accuracy of SAINA can reach a target accuracy up to 2.84x faster than the existing in-network aggregation scheme.

Original languageEnglish
Pages (from-to)4198-4204
Number of pages7
JournalIEEE Transactions on Services Computing
Volume16
Issue number6
DOIs
Publication statusPublished - 2023 Nov 1

Bibliographical note

Publisher Copyright:
© 2008-2012 IEEE.

Keywords

  • Distributed deep learning
  • in-network aggregation
  • programmable switch
  • straggler problem

ASJC Scopus subject areas

  • Information Systems and Management
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this