Xonar: Profiling-based Job Orderer for Distributed Deep Learning

Changyong Shin, Gyeongsik Yang, Yeonho Yoo, Jeunghwan Lee, Chuck Yoo

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    7 Citations (Scopus)

    Abstract

    Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.

    Original languageEnglish
    Title of host publicationProceedings - 2022 IEEE 15th International Conference on Cloud Computing, CLOUD 2022
    EditorsClaudio Agostino Ardagna, Nimanthi Atukorala, Rajkumar Buyya, Carl K. Chang, Rong N. Chang, Ernesto Damiani, Gargi Banerjee Dasgupta, Fabrizio Gagliardi, Christoph Hagleitner, Dejan Milojicic, Tuan M Hoang Trong, Robert Ward, Fatos Xhafa, Jia Zhang
    PublisherIEEE Computer Society
    Pages112-114
    Number of pages3
    ISBN (Electronic)9781665481373
    DOIs
    Publication statusPublished - 2022
    Event15th IEEE International Conference on Cloud Computing, CLOUD 2022 - Barcelona, Spain
    Duration: 2021 Jul 102021 Jul 16

    Publication series

    NameIEEE International Conference on Cloud Computing, CLOUD
    Volume2022-July
    ISSN (Print)2159-6182
    ISSN (Electronic)2159-6190

    Conference

    Conference15th IEEE International Conference on Cloud Computing, CLOUD 2022
    Country/TerritorySpain
    CityBarcelona
    Period21/7/1021/7/16

    Bibliographical note

    Funding Information:
    This work was partly supported by Institute of Information & communications Technology Planning & Evaluation funded by the Korea government (Ministry of Science and ICT) (2015-0-00280, (SW Starlab) Next generation cloud infra-software toward the guarantee of performance and security SLA) and by Basic Science Research Program through National Research Foundation of Korea funded by the Ministry of Education (NRF-2021R1A 6A1A13044830). Co-corresponding authors: Chuck Yoo and Gyeongsik Yang.

    Publisher Copyright:
    © 2022 IEEE.

    Keywords

    • distributed deep learning
    • GPU cloud
    • GPU utilization
    • job completion time
    • parallel training

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Information Systems
    • Software

    Fingerprint

    Dive into the research topics of 'Xonar: Profiling-based Job Orderer for Distributed Deep Learning'. Together they form a unique fingerprint.

    Cite this