Xonar: Profiling-based Job Orderer for Distributed Deep Learning

Changyong Shin, Gyeongsik Yang, Yeonho Yoo, Jeunghwan Lee, Chuck Yoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 15th International Conference on Cloud Computing, CLOUD 2022
EditorsClaudio Agostino Ardagna, Nimanthi Atukorala, Rajkumar Buyya, Carl K. Chang, Rong N. Chang, Ernesto Damiani, Gargi Banerjee Dasgupta, Fabrizio Gagliardi, Christoph Hagleitner, Dejan Milojicic, Tuan M Hoang Trong, Robert Ward, Fatos Xhafa, Jia Zhang
PublisherIEEE Computer Society
Pages112-114
Number of pages3
ISBN (Electronic)9781665481373
DOIs
Publication statusPublished - 2022
Event15th IEEE International Conference on Cloud Computing, CLOUD 2022 - Barcelona, Spain
Duration: 2021 Jul 102021 Jul 16

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2022-July
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference15th IEEE International Conference on Cloud Computing, CLOUD 2022
Country/TerritorySpain
CityBarcelona
Period21/7/1021/7/16

Bibliographical note

Funding Information:
This work was partly supported by Institute of Information & communications Technology Planning & Evaluation funded by the Korea government (Ministry of Science and ICT) (2015-0-00280, (SW Starlab) Next generation cloud infra-software toward the guarantee of performance and security SLA) and by Basic Science Research Program through National Research Foundation of Korea funded by the Ministry of Education (NRF-2021R1A 6A1A13044830). Co-corresponding authors: Chuck Yoo and Gyeongsik Yang.

Publisher Copyright:
© 2022 IEEE.

Keywords

  • distributed deep learning
  • GPU cloud
  • GPU utilization
  • job completion time
  • parallel training

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Xonar: Profiling-based Job Orderer for Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this