TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

Seil Lee, Hanjoo Kim, Jaehong Park, Jaehee Jang, Chang Sung Jeong, Sungroh Yoon

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67% with faster convergence.

Original languageEnglish
Pages (from-to)27671-27680
Number of pages10
JournalIEEE Access
Volume6
DOIs
Publication statusPublished - 2018 May 29

Bibliographical note

Funding Information:
This work was supported in part by the National Research Foundation of Korea through the Korean Government, Ministry of Science and ICT (MSIT), under Grant NRF-2018R1A2B3001628, in part by the Projects for Research and Development of Police Science and Technology under the Center for Research and Development of Police Science and Technology and the Korean National Police Agency through the Korean Government, MSIT, under Grant PA-C000001, in part by the Institute for Information & Communications Technology Promotion through the Korea Government, MSIT, under Grant 2016-0-00087, in part by the Brain Korea 21 Plus Project (Electrical and Computer Engineering, Seoul National University) in 2018, and in part by the Creative Industrial Technology Development Program through the Korea Government, Ministry of Trade, Industry & Energy, under Grant 10053249.

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Apache Spark
  • TensorLightning
  • commodity servers
  • deep learning
  • distributed system

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters'. Together they form a unique fingerprint.

Cite this