Abstract
In distributed deep learning systems, straggler nodes are a primary factor in delaying gradient synchronization during synchronous training, thereby diminishing overall efficiency. This issue is particularly pronounced in multi-tenant environments, where resource contention from concurrent tasks exacerbates node slowdowns. Existing approaches typically mitigate this problem by excluding slower nodes or reducing their influence during gradient aggregation, which often leads to resource underutilization. In this study, we introduce a novel solution that leverages Automatic Mixed Precision (AMP) to tackle the straggler problem. Our method employs a MAD-based threshold to detect workers impaired by resource contention and dynamically applies AMP to accelerate their computations. Once contention subsides and node performance recovers, AMP is deactivated to maximize efficiency. Experimental evaluations across diverse deep learning tasks reveal that our approach significantly reduces time-to-accuracy in synchronous Distributed Data Parallel (DDP) environments, enabling rapid convergence to high accuracy. In addition, by effectively counteracting the delay imposed by stragglers-quantified by our mitigation metric which indicates that up to 79.4% of the straggler-induced delay is offset-our approach achieves convergence up to 14 % faster than traditional BSP methods under adverse conditions.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2025 IEEE 18th International Conference on Cloud Computing, CLOUD 2025 |
| Editors | Rong N. Chang, Carl K. Chang, Jingwei Yang, Nimanthi Atukorala, Dan Chen, Sumi Helal, Sasu Tarkoma, Qiang He, Tevfik Kosar, Claudio Ardagna, Yehia Elkhatib, Petteri Nurmi, Santonu Sarkar |
| Publisher | IEEE Computer Society |
| Pages | 23-31 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798331555573 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 18th IEEE International Conference on Cloud Computing, CLOUD 2025 - Helsinki, Finland Duration: 2025 Jul 7 → 2025 Jul 12 |
Publication series
| Name | IEEE International Conference on Cloud Computing, CLOUD |
|---|---|
| ISSN (Print) | 2159-6182 |
| ISSN (Electronic) | 2159-6190 |
Conference
| Conference | 18th IEEE International Conference on Cloud Computing, CLOUD 2025 |
|---|---|
| Country/Territory | Finland |
| City | Helsinki |
| Period | 25/7/7 → 25/7/12 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Keywords
- automatic mixed precision
- distributed deep learning
- resource contention
- straggler
- synchronous training
ASJC Scopus subject areas
- Software
- Information Systems
- Artificial Intelligence