Abstract
We design a parallel huge matrix multiplication algorithm on a cluster of GPU nodes. Since input matrices are too big to accommodate in the memory, the algorithm repeats the loading, computing, storing partial matrix data from/to disk and GPU buffer. The key to achieve the best speedup is not only to use GPU with full performance, but to reduce the overhead in data movement between disk and GPU buffer. We devise an efficient way to lower the latency of supplying the matching pair of the partial matrices to the GPU buffer, and to optimize the data partition, distribution, and disk access using the pipelined way. Experimental results show our algorithm outperforms a generic algorithm, resulting in the computing time reduction by 45%. Also, the scalability of the algorithm enhances with more GPU nodes.
Original language | English |
---|---|
Title of host publication | Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 877-882 |
Number of pages | 6 |
ISBN (Print) | 9781538655559 |
DOIs | |
Publication status | Published - 2018 Aug 3 |
Event | 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 - Vancouver, Canada Duration: 2018 May 21 → 2018 May 25 |
Other
Other | 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 |
---|---|
Country/Territory | Canada |
City | Vancouver |
Period | 18/5/21 → 18/5/25 |
Keywords
- GPU computing
- Matrix multiplication
- MPI
- Parallel computing
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Networks and Communications
- Hardware and Architecture
- Information Systems and Management