Abstract
Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.
Original language | English |
---|---|
Title of host publication | Proceedings - 2022 IEEE 15th International Conference on Cloud Computing, CLOUD 2022 |
Editors | Claudio Agostino Ardagna, Nimanthi Atukorala, Rajkumar Buyya, Carl K. Chang, Rong N. Chang, Ernesto Damiani, Gargi Banerjee Dasgupta, Fabrizio Gagliardi, Christoph Hagleitner, Dejan Milojicic, Tuan M Hoang Trong, Robert Ward, Fatos Xhafa, Jia Zhang |
Publisher | IEEE Computer Society |
Pages | 112-114 |
Number of pages | 3 |
ISBN (Electronic) | 9781665481373 |
DOIs | |
Publication status | Published - 2022 |
Event | 15th IEEE International Conference on Cloud Computing, CLOUD 2022 - Barcelona, Spain Duration: 2021 Jul 10 → 2021 Jul 16 |
Publication series
Name | IEEE International Conference on Cloud Computing, CLOUD |
---|---|
Volume | 2022-July |
ISSN (Print) | 2159-6182 |
ISSN (Electronic) | 2159-6190 |
Conference
Conference | 15th IEEE International Conference on Cloud Computing, CLOUD 2022 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 21/7/10 → 21/7/16 |
Bibliographical note
Funding Information:This work was partly supported by Institute of Information & communications Technology Planning & Evaluation funded by the Korea government (Ministry of Science and ICT) (2015-0-00280, (SW Starlab) Next generation cloud infra-software toward the guarantee of performance and security SLA) and by Basic Science Research Program through National Research Foundation of Korea funded by the Ministry of Education (NRF-2021R1A 6A1A13044830). Co-corresponding authors: Chuck Yoo and Gyeongsik Yang.
Publisher Copyright:
© 2022 IEEE.
Keywords
- distributed deep learning
- GPU cloud
- GPU utilization
- job completion time
- parallel training
ASJC Scopus subject areas
- Artificial Intelligence
- Information Systems
- Software