Abstract
The memory capacity constraint of GPUs is a major challenge in running large deep learning workloads with their ever increasing memory requirements. To run a large Transformer model with limited GPU memory, programmers need to manually allocate and copy data between CPU and GPUs. This programming burden is eased by Unified Virtual Memory (UVM), which automatically manages data transfer through its demand paging scheme. However, using UVM can cause performance degradation, especially under memory oversubscription. In this paper, we analyze the memory behavior of inference in large Transformer models using real hardware and the open-source NVIDIA UVM driver. The default Tree-Based Neighborhood (TBN) prefetcher in the UVM driver supports page prefetching within a 2MB virtual address block (VABlock), but it only detects locality within a VABlock, limiting its effectiveness for large models. Our analysis reveals that this locality extends beyond the VABlock, which the default prefetcher cannot exploit. To address this, we propose a block-aware prefetcher that prefetches multiple contiguous VABlocks with greater aggressiveness. Our evaluation shows that this approach delivers an average 2.7x performance improvement over the default TBN prefetcher when GPU memory is oversubscribed.
Original language | English |
---|---|
Article number | 103389 |
Journal | Journal of Systems Architecture |
Volume | 162 |
DOIs | |
Publication status | Published - 2025 May |
Bibliographical note
Publisher Copyright:© 2025
Keywords
- Demand paging
- Graphics processing units
- Large language models
- Memory oversubscription
- Prefetching
- Real-time analysis
- Unified virtual memory
ASJC Scopus subject areas
- Software
- Hardware and Architecture