Traffic-Aware In-Network Aggregation Placement for Multi-Tenant Distributed Machine Learning

Heewon Kim, Hochan Lee, Sangheon Pack

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Distributed machine learning is an effective method to alleviate intensive computation costs of training; however it suffers from network bottlenecks while gathering local results. Recent advent of programmable data planes opened a new avenue, in-network aggregation, which executes gradient aggregations in the middle of the network resolving network bottlenecks and further accelerates distributed machine learning. However, due to resource-constrained features of current programmable data planes, installation of in-network aggregation functionalities throughout the network would impose unacceptable burden, posing a need for sophisticated deployment. In this paper, we consider a problem of deploying in-network aggregation functionalities, so as to minimize the total network traffic in multi-tenant distributed machine learning. Since the formulated problem is an integer linear programming problem, which is known as NP-hard, we propose a traffic aware placement of in-network aggregation (TAPINA) algorithm with lower complexity and near-optimal performance. TAPINA decides aggregation points of multiple tenants sequentially in order of their expected traffics and reuses the already selected aggregation points by other tenants to reduce the overall deployment cost. Simulation results demonstrate that TAPINA shows near-optimal performance, achieving up to 20 % traffic reduction compared to the state-of-the-art algorithm in most cases.

Original languageEnglish
Title of host publicationICCCN 2023 - 2023 32nd International Conference on Computer Communications and Networks
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350336184
DOIs
Publication statusPublished - 2023
Event32nd International Conference on Computer Communications and Networks, ICCCN 2023 - Honolulu, United States
Duration: 2023 Jul 242023 Jul 27

Publication series

NameProceedings - International Conference on Computer Communications and Networks, ICCCN
Volume2023-July
ISSN (Print)1095-2055

Conference

Conference32nd International Conference on Computer Communications and Networks, ICCCN 2023
Country/TerritoryUnited States
CityHonolulu
Period23/7/2423/7/27

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Distributed Machine Learning
  • In-Network Aggregation
  • Programmable data plane

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Traffic-Aware In-Network Aggregation Placement for Multi-Tenant Distributed Machine Learning'. Together they form a unique fingerprint.

Cite this