Abstract
Training on time-series data generated from mobile networks is a resource-intensive and time-consuming task that encounters various training failures. To cope with this issue, we propose CheckBullet, a lightweight checkpoint system to minimize storage requirements and enable fast recovery in mobile networks. First, CheckBullet determines a checkpointing interval based on the characteristics of the model and the timing of failure occurrences. This approach ensures fast recovery while preserving the existing training runtime. Second, CheckBullet quantizes the weight tensor and eliminates duplicate weights, which significantly reduces the overall checkpoint size, leading to a substantial decrease in storage requirements. Third, CheckBullet selects the minimum training loss among the deduplicated checkpoints and merges the selected checkpoints. This approach reduces recovery time while preserving existing training loss. The experimental results show that CheckBullet can reduce the recovery time by 6× to 11× barely increasing the training runtime. Furthermore, CheckBullet can save storage requirements by up to 70% while maintaining the minimum training loss.
Original language | English |
---|---|
Pages (from-to) | 14946-14958 |
Number of pages | 13 |
Journal | IEEE Transactions on Mobile Computing |
Volume | 23 |
Issue number | 12 |
DOIs | |
Publication status | Published - 2024 |
Bibliographical note
Publisher Copyright:© 2002-2012 IEEE.
Keywords
- Checkpointing
- failure resiliency
- robust model training
- time-series data
ASJC Scopus subject areas
- Software
- Computer Networks and Communications
- Electrical and Electronic Engineering