Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters

Hyung Jun Kim, Chunggeon Song, Hwa Min Lee, Heonchang Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Distributed deep learning is an inevitable choice in learning large-scale deep learning models today. Beyond a certain level, training deep learning models can take days or months, which can lead to catastrophic consequences in applications that require rapid trend reflection or decision-making. Distributed deep learning is largely divided into synchronous and asynchronous methods according to the synchronization method at the time of parameter update. The former updates the parameter with the average value of gradient calculated by all workers, and there is a problem that the processing speed is matched to the slowest worker. The latter is faster because it updates parameters without waiting for the slowest worker, but can converge more slowly to the optimal state due to the stale gradient problem. In this paper, we propose Dynamic Partial All-Reduce, a distributed learning algorithm that uses a synchronous method but dynamically manages whether workers participate in global synchronization to autonomously control the effects of the straggler problem. In this algorithm, if a slow worker is detected, the influence of straggler is limited by excluding the worker from global communication and allowing the remaining workers to update the parameters. Then, when the slow worker recovers the normal speed, it returns to the synchronization group again. In this way, the decision is made by comparing what causes the greater loss in speed and convergence between the loss of computational power due to the exclusion of one GPU from the learning process, the omission of as much learning data as distributed to the worker, and the slowdown due to straggler. We implemented this algorithm based on PyTorch and Horovod, and all experiments were conducted on Tencent Cloud.

Original languageEnglish
Title of host publication2023 IEEE International Conference on Consumer Electronics, ICCE 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665491303
DOIs
Publication statusPublished - 2023
Event2023 IEEE International Conference on Consumer Electronics, ICCE 2023 - Las Vegas, United States
Duration: 2023 Jan 62023 Jan 8

Publication series

NameDigest of Technical Papers - IEEE International Conference on Consumer Electronics
Volume2023-January
ISSN (Print)0747-668X

Conference

Conference2023 IEEE International Conference on Consumer Electronics, ICCE 2023
Country/TerritoryUnited States
CityLas Vegas
Period23/1/623/1/8

Keywords

  • Distributed Deep Learning
  • Straggler

ASJC Scopus subject areas

  • Industrial and Manufacturing Engineering
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters'. Together they form a unique fingerprint.

Cite this