TensorExpress: In-network communication scheduling for distributed deep learning

Minkoo Kang, Gyeongsik Yang, Yeonho Yoo, Chuck Yoo

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    15 Citations (Scopus)

    Abstract

    TensorExpress provides in-network communication scheduling for distributed deep learning (DDL). In cloud-based DDL, parameter communication over a network is a key bottleneck. Previous studies proposed tensor packet reordering approaches to reduce network blocking time. However, network contention still exists in DDL. TensorExpress mitigates network contention and reduces overall training time. It schedules tensor packets in-network using P4, a switch programming language. TensorExpress improves latency and network blocking time up to 2.5 and 2.44 times, respectively.

    Original languageEnglish
    Title of host publicationProceedings - 2020 IEEE 13th International Conference on Cloud Computing, CLOUD 2020
    PublisherIEEE Computer Society
    Pages25-27
    Number of pages3
    ISBN (Electronic)9781728187808
    DOIs
    Publication statusPublished - 2020 Oct
    Event13th IEEE International Conference on Cloud Computing, CLOUD 2020 - Virtual, Beijing, China
    Duration: 2020 Oct 182020 Oct 24

    Publication series

    NameIEEE International Conference on Cloud Computing, CLOUD
    Volume2020-October
    ISSN (Print)2159-6182
    ISSN (Electronic)2159-6190

    Conference

    Conference13th IEEE International Conference on Cloud Computing, CLOUD 2020
    Country/TerritoryChina
    CityVirtual, Beijing
    Period20/10/1820/10/24

    Bibliographical note

    Funding Information:
    †This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation grant funded by the Korea government (No. 2015-0-00280, (SW Starlab) Next generation cloud infra-software toward the guarantee of performance and security SLA). This research was also supported by National Research Foundation of Korea funded by the Ministry of Science, ICT (No. NRF-2019H1D8A2105513).

    Publisher Copyright:
    © 2020 IEEE.

    Keywords

    • Communication scheduling
    • Distributed deep learning
    • In-network delay
    • P4
    • Parameter server architecture

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Information Systems
    • Software

    Fingerprint

    Dive into the research topics of 'TensorExpress: In-network communication scheduling for distributed deep learning'. Together they form a unique fingerprint.

    Cite this