Beyond VABlock: Improving Transformer workloads through aggressive prefetching

Jane Rhee, Ikyoung Choi, Gunjae Koo, Yunho Oh, Myung Kuk Yoon

Research output: Contribution to journalArticlepeer-review

Abstract

The memory capacity constraint of GPUs is a major challenge in running large deep learning workloads with their ever increasing memory requirements. To run a large Transformer model with limited GPU memory, programmers need to manually allocate and copy data between CPU and GPUs. This programming burden is eased by Unified Virtual Memory (UVM), which automatically manages data transfer through its demand paging scheme. However, using UVM can cause performance degradation, especially under memory oversubscription. In this paper, we analyze the memory behavior of inference in large Transformer models using real hardware and the open-source NVIDIA UVM driver. The default Tree-Based Neighborhood (TBN) prefetcher in the UVM driver supports page prefetching within a 2MB virtual address block (VABlock), but it only detects locality within a VABlock, limiting its effectiveness for large models. Our analysis reveals that this locality extends beyond the VABlock, which the default prefetcher cannot exploit. To address this, we propose a block-aware prefetcher that prefetches multiple contiguous VABlocks with greater aggressiveness. Our evaluation shows that this approach delivers an average 2.7x performance improvement over the default TBN prefetcher when GPU memory is oversubscribed.

Original languageEnglish
Article number103389
JournalJournal of Systems Architecture
Volume162
DOIs
Publication statusPublished - 2025 May

Bibliographical note

Publisher Copyright:
© 2025

Keywords

  • Demand paging
  • Graphics processing units
  • Large language models
  • Memory oversubscription
  • Prefetching
  • Real-time analysis
  • Unified virtual memory

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Beyond VABlock: Improving Transformer workloads through aggressive prefetching'. Together they form a unique fingerprint.

Cite this