Download presentation
Presentation is loading. Please wait.
Published byShonda Edith McCoy Modified over 9 years ago
1
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta
2
GP-GPU Computing GPUs enable high throughput data & compute intensive computations Data is partitioned into a grid of “Thread Blocks” (TBs) Thousands of TBs in a grid can be executed in any order No HW support for efficient inter-TB communication High scalability & throughput for independent data Challenging & inefficient for inter-TB dependent data
3
The Problem Data-dependent & irregular applications Simulations (n-body, heat) Graph algorithms (BFS, SSSP) Inter-TB synchronization Sync through global memory Irregular task graphs Static partitioning fails Heterogeneous execution Unbalanced distribution ! ! ! Data Dependency Graph
4
The Solution “Task based execution” Transition from SIMD -> MIMD
5
5 Challenges Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing
6
6 Proposed Task Based Execution Framework Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies
7
7 Overview (1). Grab a ready Task (2). Queue (3). Retrieve & Execute (4). Output (5). Resolve Dependencies (6). Grab new
8
8 Concurrent Worker&Scheduler Worker Scheduler
9
Queue Access & Dependency Tracking IQS and OQS Efficient signaling mechanism via global memory Parallel task pointer retrieval Queues store pointers to tasks Parallel dependency check
10
10 Queue Insertion Policy Round robin: Better load balancing Poor cache locality Tail submit: [J. Hoogerbrugge et al. ]: First child task is always processed by the same SM with parent. Increased locality t t+1 t+2
11
11 API Application specific data is added under WorkerContext and Task user_task is called by worker_kernel
12
12 Experimental Results NVIDIA Tesla 2050 14 SMs, 3GB memory Applications: Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison: Central queue vs. distributed queue
13
13 Applications Heat 2D: Regular dependencies, wavefront parallelism. Each tile is a task, intra-tile and inter-tile parallelism
14
14 Applications BFS: Irregular dependencies. Unreached neighbors of a node forms a task
15
15 Runtime
16
16 Scalability
17
17 Future Work S/W support for: Better task representation More task insertion policies Automated task graph partitioning for higher SM utilization.
18
18 Future Work H/W support for: Fast inter-TB sync Support for TB to SM affinity “Sleep” support for TBs
19
19 Conclusion Transition from SIMD -> MIMD Task-based execution model Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.