Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning

Hochan Lee, Jaewook Lee, Heewon Kim, Sangheon Pack

    Research output: Contribution to journalArticlepeer-review

    3 Citations (Scopus)

    Abstract

    In-network aggregation facilitates accelerated distributed deep learning by utilizing a programmable switch to aggregate gradient packets. However, a straggler problem should be addressed to avoid performance degradation in terms of training time. In this paper, we propose a straggler-aware in-network aggregation (SAINA) scheme to mitigate the straggler problem while preventing accuracy degradation. In SAINA, the programmable switch aggregates local gradients of the fastest $k$k workers to exclude stragglers and changes $k$k adaptively to balance the tradeoff between training speed and accuracy. To this end, we design a switch-friendly convergence detection (SFCD) algorithm which detects a convergence point and determines $k$k at the convergence point. SAINA is implemented over a software programmable switch and experimental results show that the accuracy of SAINA can reach a target accuracy up to 2.84x faster than the existing in-network aggregation scheme.

    Original languageEnglish
    Pages (from-to)4198-4204
    Number of pages7
    JournalIEEE Transactions on Services Computing
    Volume16
    Issue number6
    DOIs
    Publication statusPublished - 2023 Nov 1

    Bibliographical note

    Publisher Copyright:
    © 2008-2012 IEEE.

    Keywords

    • Distributed deep learning
    • in-network aggregation
    • programmable switch
    • straggler problem

    ASJC Scopus subject areas

    • Information Systems and Management
    • Hardware and Architecture
    • Computer Networks and Communications
    • Computer Science Applications

    Fingerprint

    Dive into the research topics of 'Straggler-Aware In-Network Aggregation for Accelerating Distributed Deep Learning'. Together they form a unique fingerprint.

    Cite this