Download presentation
Presentation is loading. Please wait.
Published byGinger Collins Modified over 9 years ago
1
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research 3 University of Washington
2
Hadi JooybarGPUDet: A Deterministic GPU Architecture2 GPUs are … Fast Energy efficient Commodity hardware But… ×Mostly use for certain range of applications Why? Communication among concurrent threads 1000s of Threads
3
Hadi JooybarGPUDet: A Deterministic GPU Architecture3 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = - Active = - Cost = - Active = - V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 1 Active = 1 V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007
4
Hadi JooybarGPUDet: A Deterministic GPU Architecture4 I will debug it this time What about debuggers?! The bug may appear occasionally or in different places in each run. OMG! Where was that bug?! Motivation
5
Hadi JooybarGPUDet: A Deterministic GPU Architecture5 GPUDet Strong Determinism (hardware proposal) Same Outputs Same Execution Path Makes the program easier to Debug Test
6
Hadi JooybarGPUDet: A Deterministic GPU Architecture6 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007
7
Hadi JooybarGPUDet: A Deterministic GPU Architecture7 GPUDet Strong Determinism Same Outputs Same Execution Path Makes the program easier to Debug Test ×There is no free lunch ×Performance Overhead Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead
8
Hadi JooybarGPUDet: A Deterministic GPU Architecture8 DRAM GPU Architecture Compute Unit Memory Unit L1 Cache ALU DRAM L2 Cache Workgroups CPU Kernel launch workgroup 2 workgroup 1 workgroup 0 x = input[threadID]; y= func(x); output[threadID] = y;
9
Hadi JooybarGPUDet: A Deterministic GPU Architecture9 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion
10
Hadi JooybarGPUDet: A Deterministic GPU Architecture10 Normal Execution T0T0 T1T1 T2T2 T3T3 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Quantum 0 T0T0 T1T1 T2T2 T3T3 Quantum n T0T0 T1T1 T2T2 T3T3 … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication
11
Hadi JooybarGPUDet: A Deterministic GPU Architecture11 … Deterministic GPU Execution Challenges Isolation mechanism Lack of private caches Lack of cache coherency Provide method to pause execution of a thread Single Instruction Multiple Threads (SIMT) Potential deadlock condition Major changes in control flow hardware Performance overhead workgroup n wavefront
12
Hadi JooybarGPUDet: A Deterministic GPU Architecture12 Deterministic GPU Execution Challenges Very large number of threads Expensive global synchronization Expensive serialization Different program properties Large number of short running threads Frequent workgroup synchronization Less locality in intra thread memory accesses
13
Hadi JooybarGPUDet: A Deterministic GPU Architecture13 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion
14
Hadi JooybarGPUDet: A Deterministic GPU Architecture14 if (tid < 16) x[tid%2] = tid; x[0] = 0 T0 Coalescing Unit x[1] = 1 T1 x[0] = 2 T2 x[1] = 15 T15 Deterministic Execution of a Wavefront Data Race Mask v v - - - - - - … - Address x Data 14 15 - - - - - - … - x[0] = 14 x[1] = 15 Not modified To memory … Execution of one wavefront is deterministic
15
Hadi JooybarGPUDet: A Deterministic GPU Architecture15 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication wavefront granularity not a challenge anymore
16
Hadi JooybarGPUDet: A Deterministic GPU Architecture16 Reaching Quantum Boundary Global Memory Read Only Store Buffers Local Memory Wavefronts … Load Op Commit Atomic Op GPUDet-Basic 1.Instruction Count 2.Atomic Operations 3.Memory Fences 4.Workgroup Barriers 5.Execution Complete
17
Hadi JooybarGPUDet: A Deterministic GPU Architecture17 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion
18
Hadi JooybarGPUDet: A Deterministic GPU Architecture18 Workgroup-Aware Quantum Formation Extra global synchronizations Load Imbalance Reducing number of synchronizations Avoid unnecessary quantum termination Reducing number of synchronizations Avoid unnecessary quantum termination
19
Hadi JooybarGPUDet: A Deterministic GPU Architecture19 Workgroup-Aware Quantum Formation Quanta are finished by workgroup barriers All reach a workgroup barrier Continue execution in the parallel mode Workgroup-Aware Decision Making
20
Hadi JooybarGPUDet: A Deterministic GPU Architecture20 Finish execution of the Kernel function Workgroup-Aware Decision Making Workgroup-Aware Quantum Formation Deterministic workgroup partitioning
21
Hadi JooybarGPUDet: A Deterministic GPU Architecture21 Deterministic Parallel Commit using the Z-Buffer Unit ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ 777∞∞∞ 777∞∞∞ 777∞∞∞ 888888 888888 777888 777888 777888 885588 885558 755555 755555 555555 Depth Buffer Store Buffer Contents ≈ Color Values Wavefront ID ≈ Depth Values Z-Buffer Unit
22
Hadi JooybarGPUDet: A Deterministic GPU Architecture22 GPUs preserve Point to Point Ordering Serialization is only among compute units Compute Unit Level Serialization
23
Hadi JooybarGPUDet: A Deterministic GPU Architecture23 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion
24
Hadi JooybarGPUDet: A Deterministic GPU Architecture24 Results 2x Slowdown GPGPU-Sim 3.0.2 Applications with atomic operations
25
Hadi JooybarGPUDet: A Deterministic GPU Architecture25 20% Performance Improvement for application with barriers 19% Performance Improvement for application with small kernel functions Quantum Formation
26
Hadi JooybarGPUDet: A Deterministic GPU Architecture26 Deterministic Parallel Commit using the Z-Buffer Unit 60% Performance Improvement on Average
27
Hadi JooybarGPUDet: A Deterministic GPU Architecture27 Compute Unit Level Serialization 6.1x Performance Improvement in Serial Mode
28
Hadi JooybarGPUDet: A Deterministic GPU Architecture28 Conclusion Encourages programmers to use GPUs in broader range of applications Exploits GPU characteristics to reduce performance overhead Deterministic execution within a wavefront Workgroup-aware quantum formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Questions?
29
Hadi JooybarGPUDet: A Deterministic GPU Architecture29 if(tid == 0) x = 0; Else if (tid ==1) x = 1; Racey code in CPU multi-threaded programming model SIMT Execution within a wavefront Handled by SMIT Stack SIMT Execution within a wavefront Handled by SMIT Stack Data-race Different instructions The execution order of instructions within a wavefront is deterministic
30
Hadi JooybarGPUDet: A Deterministic GPU Architecture30 Deterministic parallel commit using the Z-Buffer Unit The Z-Buffer Unit manages Z-Buffer ensure each pixel on the screen displays the color of the foremost triangle covering that pixel The Z-Buffer Unit allows out-of-order writes to produce a deterministic result GPUDet uses the wavefront ID as the depth value for Z-Buffer operations
31
Hadi JooybarGPUDet: A Deterministic GPU Architecture31 Interconnect A : = 6 A : = 2 B:=7 B : = 2 A:=6 D(A):-0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A - ∞ B - ∞ … … LocValueDepth L2 Cache Z-Buffer Unit Memory Partition DRAM Interface A:=6 D(A):-0 B:=7 D(B):=1 B:=2 D(B):=2 B:=7 D(B):=1 B:=2 D(B):=2 B:=2 D(B):=2 A:=6 D(A):=0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A21 B - ∞ … … Depth Comparison A60 B - ∞ … … A21 B - ∞ … … A - ∞ B - ∞ … … A60 B - ∞ … … A60 B71 … … A60 B71 … … A60 B71 … … Deterministic Parallel Commit Using Z-Buffer Unit W0W0 W1W1 W2W2 Store Buffers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.