Abstract
This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference systems on embedded graphics processing units (GPUs). Embedded systems equipped with GPUs face challenges in managing diverse deep learning workloads while adhering to quality-of-service (QoS) standards, primarily due to limited hardware resources and the varied nature of deep learning models. Prior work has relied on static thread allocation strategies, often leading to suboptimal hardware utilization. To address these challenges, we propose a new software technique called thread-level parallelism (TLP) Balancer. TLP Balancer automatically identifies the best-performing number of threads based on performance modeling. This approach significantly enhances hardware utilization and ensures QoS compliance, outperforming traditional fixed-thread allocation methods. Our evaluation shows that TLP Balancer improves throughput by 40% compared to the state-of-the-art automated kernel merge and fusion techniques.
| Original language | English |
|---|---|
| Pages (from-to) | 180-183 |
| Number of pages | 4 |
| Journal | IEEE Embedded Systems Letters |
| Volume | 17 |
| Issue number | 3 |
| DOIs | |
| Publication status | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2009-2012 IEEE.
Keywords
- Embedded graphics processing unit (GPU)
- inference
- multitenancy
ASJC Scopus subject areas
- Control and Systems Engineering
- General Computer Science
Fingerprint
Dive into the research topics of 'TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS