Skip to main navigation Skip to search Skip to main content

TensorExpress: In-network communication scheduling for distributed deep learning

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

TensorExpress provides in-network communication scheduling for distributed deep learning (DDL). In cloud-based DDL, parameter communication over a network is a key bottleneck. Previous studies proposed tensor packet reordering approaches to reduce network blocking time. However, network contention still exists in DDL. TensorExpress mitigates network contention and reduces overall training time. It schedules tensor packets in-network using P4, a switch programming language. TensorExpress improves latency and network blocking time up to 2.5 and 2.44 times, respectively.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 13th International Conference on Cloud Computing, CLOUD 2020
PublisherIEEE Computer Society
Pages25-27
Number of pages3
ISBN (Electronic)9781728187808
DOIs
Publication statusPublished - 2020 Oct
Event13th IEEE International Conference on Cloud Computing, CLOUD 2020 - Virtual, Beijing, China
Duration: 2020 Oct 182020 Oct 24

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2020-October
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference13th IEEE International Conference on Cloud Computing, CLOUD 2020
Country/TerritoryChina
CityVirtual, Beijing
Period20/10/1820/10/24

Bibliographical note

Funding Information:
†This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation grant funded by the Korea government (No. 2015-0-00280, (SW Starlab) Next generation cloud infra-software toward the guarantee of performance and security SLA). This research was also supported by National Research Foundation of Korea funded by the Ministry of Science, ICT (No. NRF-2019H1D8A2105513).

Publisher Copyright:
© 2020 IEEE.

Keywords

  • Communication scheduling
  • Distributed deep learning
  • In-network delay
  • P4
  • Parameter server architecture

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'TensorExpress: In-network communication scheduling for distributed deep learning'. Together they form a unique fingerprint.

Cite this