Abstract
In-network aggregation facilitates accelerated distributed deep learning by utilizing a programmable switch to aggregate gradient packets. However, a straggler problem should be addressed to avoid performance degradation in terms of training time. In this paper, we propose a straggler-aware in-network aggregation (SAINA) scheme to mitigate the straggler problem while preventing accuracy degradation. In SAINA, the programmable switch aggregates local gradients of the fastest $k$k workers to exclude stragglers and changes $k$k adaptively to balance the tradeoff between training speed and accuracy. To this end, we design a switch-friendly convergence detection (SFCD) algorithm which detects a convergence point and determines $k$k at the convergence point. SAINA is implemented over a software programmable switch and experimental results show that the accuracy of SAINA can reach a target accuracy up to 2.84x faster than the existing in-network aggregation scheme.
Original language | English |
---|---|
Pages (from-to) | 4198-4204 |
Number of pages | 7 |
Journal | IEEE Transactions on Services Computing |
Volume | 16 |
Issue number | 6 |
DOIs | |
Publication status | Published - 2023 Nov 1 |
Bibliographical note
Publisher Copyright:© 2008-2012 IEEE.
Keywords
- Distributed deep learning
- in-network aggregation
- programmable switch
- straggler problem
ASJC Scopus subject areas
- Information Systems and Management
- Hardware and Architecture
- Computer Networks and Communications
- Computer Science Applications