Korel: Mitigating Stragglers via Real-Time Automatic Mixed Precision in Distributed Deep Learning Environments

  • Hyunseung Jung*
  • , Hyung Jun Kim
  • , Heonchang Yu
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In distributed deep learning systems, straggler nodes are a primary factor in delaying gradient synchronization during synchronous training, thereby diminishing overall efficiency. This issue is particularly pronounced in multi-tenant environments, where resource contention from concurrent tasks exacerbates node slowdowns. Existing approaches typically mitigate this problem by excluding slower nodes or reducing their influence during gradient aggregation, which often leads to resource underutilization. In this study, we introduce a novel solution that leverages Automatic Mixed Precision (AMP) to tackle the straggler problem. Our method employs a MAD-based threshold to detect workers impaired by resource contention and dynamically applies AMP to accelerate their computations. Once contention subsides and node performance recovers, AMP is deactivated to maximize efficiency. Experimental evaluations across diverse deep learning tasks reveal that our approach significantly reduces time-to-accuracy in synchronous Distributed Data Parallel (DDP) environments, enabling rapid convergence to high accuracy. In addition, by effectively counteracting the delay imposed by stragglers-quantified by our mitigation metric which indicates that up to 79.4% of the straggler-induced delay is offset-our approach achieves convergence up to 14 % faster than traditional BSP methods under adverse conditions.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE 18th International Conference on Cloud Computing, CLOUD 2025
EditorsRong N. Chang, Carl K. Chang, Jingwei Yang, Nimanthi Atukorala, Dan Chen, Sumi Helal, Sasu Tarkoma, Qiang He, Tevfik Kosar, Claudio Ardagna, Yehia Elkhatib, Petteri Nurmi, Santonu Sarkar
PublisherIEEE Computer Society
Pages23-31
Number of pages9
ISBN (Electronic)9798331555573
DOIs
Publication statusPublished - 2025
Event18th IEEE International Conference on Cloud Computing, CLOUD 2025 - Helsinki, Finland
Duration: 2025 Jul 72025 Jul 12

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference18th IEEE International Conference on Cloud Computing, CLOUD 2025
Country/TerritoryFinland
CityHelsinki
Period25/7/725/7/12

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Keywords

  • automatic mixed precision
  • distributed deep learning
  • resource contention
  • straggler
  • synchronous training

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Korel: Mitigating Stragglers via Real-Time Automatic Mixed Precision in Distributed Deep Learning Environments'. Together they form a unique fingerprint.

Cite this