TY - GEN
T1 - Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
AU - Kim, Hyung Jun
AU - Song, Chunggeon
AU - Lee, Hwa Min
AU - Yu, Heonchang
N1 - Funding Information:
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ICAN(ICT Challenge and Advanced Network of HRD) program(IITP-2022-RS-2022-00156439) supervised by the IITP(Institute of Information & Communications Technology Planning & Evaluation).
Funding Information:
This study was carried out with the support of ´R&D Program for Forest Science Technology (Project No. 2022427C10-2224-0801)´ provided by Korea Forest Service(Korea Forestry Promotion Institute).
Funding Information:
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-0-01405) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).
Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Distributed deep learning is an inevitable choice in learning large-scale deep learning models today. Beyond a certain level, training deep learning models can take days or months, which can lead to catastrophic consequences in applications that require rapid trend reflection or decision-making. Distributed deep learning is largely divided into synchronous and asynchronous methods according to the synchronization method at the time of parameter update. The former updates the parameter with the average value of gradient calculated by all workers, and there is a problem that the processing speed is matched to the slowest worker. The latter is faster because it updates parameters without waiting for the slowest worker, but can converge more slowly to the optimal state due to the stale gradient problem. In this paper, we propose Dynamic Partial All-Reduce, a distributed learning algorithm that uses a synchronous method but dynamically manages whether workers participate in global synchronization to autonomously control the effects of the straggler problem. In this algorithm, if a slow worker is detected, the influence of straggler is limited by excluding the worker from global communication and allowing the remaining workers to update the parameters. Then, when the slow worker recovers the normal speed, it returns to the synchronization group again. In this way, the decision is made by comparing what causes the greater loss in speed and convergence between the loss of computational power due to the exclusion of one GPU from the learning process, the omission of as much learning data as distributed to the worker, and the slowdown due to straggler. We implemented this algorithm based on PyTorch and Horovod, and all experiments were conducted on Tencent Cloud.
AB - Distributed deep learning is an inevitable choice in learning large-scale deep learning models today. Beyond a certain level, training deep learning models can take days or months, which can lead to catastrophic consequences in applications that require rapid trend reflection or decision-making. Distributed deep learning is largely divided into synchronous and asynchronous methods according to the synchronization method at the time of parameter update. The former updates the parameter with the average value of gradient calculated by all workers, and there is a problem that the processing speed is matched to the slowest worker. The latter is faster because it updates parameters without waiting for the slowest worker, but can converge more slowly to the optimal state due to the stale gradient problem. In this paper, we propose Dynamic Partial All-Reduce, a distributed learning algorithm that uses a synchronous method but dynamically manages whether workers participate in global synchronization to autonomously control the effects of the straggler problem. In this algorithm, if a slow worker is detected, the influence of straggler is limited by excluding the worker from global communication and allowing the remaining workers to update the parameters. Then, when the slow worker recovers the normal speed, it returns to the synchronization group again. In this way, the decision is made by comparing what causes the greater loss in speed and convergence between the loss of computational power due to the exclusion of one GPU from the learning process, the omission of as much learning data as distributed to the worker, and the slowdown due to straggler. We implemented this algorithm based on PyTorch and Horovod, and all experiments were conducted on Tencent Cloud.
KW - Distributed Deep Learning
KW - Straggler
UR - http://www.scopus.com/inward/record.url?scp=85149148122&partnerID=8YFLogxK
U2 - 10.1109/ICCE56470.2023.10043527
DO - 10.1109/ICCE56470.2023.10043527
M3 - Conference contribution
AN - SCOPUS:85149148122
T3 - Digest of Technical Papers - IEEE International Conference on Consumer Electronics
BT - 2023 IEEE International Conference on Consumer Electronics, ICCE 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Consumer Electronics, ICCE 2023
Y2 - 6 January 2023 through 8 January 2023
ER -