ShmCaffe: A distributed deep learning platform with shared memory buffer for HPC Architecture

  • Shinyoung Ahn
  • , Joongheon Kim
  • , Eunji Lim
  • , Wan Choi
  • , Aziz Mohaisen
  • , Sungwon Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

One of the reasons behind the tremendous success of deep learning theory and applications in the recent days is advances in distributed and parallel high performance computing (HPC). This paper proposes a new distributed deep learning platform, named ShmCaffe, which utilizes remote shared memory for communication overhead reduction in massive deep neural network training parameter sharing. ShmCaffe is designed based on Soft Memory Box (SMB), a virtual shared memory framework. In the SMB framework, the remote shared memory is used as a shared buffer for asynchronous massive parameter sharing among many distributed deep learning processes. Moreover, a hybrid method that combines asynchronous and synchronous parameter sharing methods is also discussed in this paper for improving scalability. As a result, ShmCaffe is 10.1 times faster than Caffe and 2.8 times faster than Caffe-MPI for deep neural network training when Inception\-v1 is trained with 16 GPUs. We verify the convergence of the Inception\-v1 model training using ShmCaffe-A and ShmCaffe-H by varying the number of workers. Furthermore, we evaluate scalability of ShmCaffe by analyzing the computation and communication times per one iteration of deep learning training in four convolutional neural network (CNN) models.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1118-1128
Number of pages11
ISBN (Electronic)9781538668719
DOIs
Publication statusPublished - 2018 Jul 19
Externally publishedYes
Event38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018 - Vienna, Austria
Duration: 2018 Jul 22018 Jul 5

Publication series

NameProceedings - International Conference on Distributed Computing Systems
Volume2018-July

Conference

Conference38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018
Country/TerritoryAustria
CityVienna
Period18/7/218/7/5

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Keywords

  • Deep learning
  • Distributed deep learning
  • Shared memory
  • ShmCaffe
  • Soft memory box

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'ShmCaffe: A distributed deep learning platform with shared memory buffer for HPC Architecture'. Together they form a unique fingerprint.

Cite this