(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Outline GPU Computing GPGPU-Sim / Manycore Accelerators (Micro)Architecture Challenges: –Branch Divergence (DWF, TBC) –On-Chip Interconnect 2.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃 翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Operation of the SM Pipeline
Introduction to Computer Organization Pipelining.
Performance in GPU Architectures: Potentials and Distances
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
My Coordinates Office EM G.27 contact time:
CUDA programming Performance considerations (CUDA best practices)
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Gwangsun Kim, Jiyun Jeong, John Kim
Prof. Hsien-Hsin Sean Lee
Sathish Vadhiyar Parallel Programming
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Multiscalar Processors
/ Computer Architecture and Design
CDA 3101 Spring 2016 Introduction to Computer Organization
Morgan Kaufmann Publishers The Processor
How to improve (decrease) CPI
Lecture 10: Branch Prediction and Instruction Delivery
Instruction Level Parallelism (ILP)
Operation of the Basic SM Pipeline
Introduction to Control Divergence
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
ECE 498AL Lecture 10: Control Flow
Dynamic Hardware Prediction
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presented by Ondrej Cernin
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted

(2) Objective Understand the occurrence of control divergence and the concept of thread reconvergence  Also described as branch divergence and thread divergence Cover a basic thread reconvergence mechanism – PDOM  Set up discussion of further optimizations and advanced techniques Explore one approach for mitigating the performance degradation due to control divergence – dynamic warp formation

(3) Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009 W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 M. Rhu and M. Erez, “CAPRI: Prediction of Compaction- Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012 Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(4) Handling Branches CUDA Code: if(…) … (True for some threads) else … (True for others) What if threads takes different branches? Branch Divergence! takennot taken

(5) Branch Divergence Occurs within a warp Branches lead serialization branch dependent code  Performance issue: low warp utilization if(…) {… } else { …} Idle threads Reconvergence!

Branch Divergence Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B CD E F A G Different threads follow different control flow paths through the kernel code Thread execution is (partially) serialized  Subset of threads that follow the same path execute in parallel

(7) Basic Idea Split: partition a warp  Two mutually exclusive thread subsets, each branching to a different target  Identify subsets with two activity masks  effectively two warps Join: merge two subsets of a previously split warp  Reconverge the mutually exclusive sets of threads Orchestrate the correct execution for nested branches Note the long history of technques in SIMD processors (see background in Fung et. al.) Thread Warp Common PC T2T3T4T4T activity mask

(8) Thread Reconvergence Fundamental problem:  Merge threads with the same PC  How do we sequence execution of threads? Since this can effect the ability to reconverge Question: When can threads productively reconverge? Question: When is the best time to reconverge? B CD E F A G

(9) Dominator Node d dominates node n if every path from the entry node to n must go through d B CD E F A G

(10) Immediate Dominator Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n B CD E F A G

(11) Post Dominator Node d post dominates node n if every path from the node n to the exit node must go through d B CD E F A G

(12) Immediate Post Dominator Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n B CD E F A G

Baseline: PDOM Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt -G1111 TOS B CD E F A G Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111

(14) Stack Entry -G1111 TOS B CD E F A G B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111-B TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111 A stack entry is a specification of a group of active threads that will execute that basic block The natural nested structure of control exposes the use stack-based serialization

(15) More Complex Example From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 Stack based implementation for nested control flow  Stack entry RPC set to IPDOM Re-convergence at the immediate post-dominator of the branch

(16) Implementation I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps FromGPGPU-Sim Documentation Sim_3.x_Manual#SIMT_Cores -G1111 TOS -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111-B TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111 GPGPUSim model:  Implement per warp stack at issue stage  Acquire the active mask and PC from the TOS  Scoreboard check prior to issue  Register writeback updates scoreboard and ready bit in instruction buffer  When RPC = Next PC, pop the stack Implications for instruction fetch?

(17) Implementation (2) warpPC (next instruction) compared to reconvergence PC On a branch  Can store the reconvergence PC as part of the branch instruction  Branch unit has NextPC, TargetPC and reconvergence PC to update the stack On reaching a reconvergence point  Pop the stack  Continue fetching from the NextPC of the next entry on the stack

(18) Can We Do Better? Warps are formed statically Key idea of dynamic warp formation  Show a pool of warps and how they can be merged At a high level what are the requirements B CD E F A G

(19) Compaction Techniques Can we reform warps so as to increase utilization? Basic idea: Compaction  Reform warps with threads that follow the same control flow path  Increase utilization of warps Two basic types of compaction techniques Inter-warp compaction  Group threads from different warps  Group threads within a warp oChanging the effective warp size

(20) Inter-Warp Thread Compaction Techniques Lectures Slides and Figures contributed from sources as noted

(21) Goal Warp 0 Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 TakenNot Taken if(…) {… } else { …} Merge threads?

(22) Reading W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009 W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

DWF: Example Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt AABBGGAACCDDEEFF Time AABBGGAA CD EE F A x/1111 y/1111 B x/1110 y/0011 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 G x/1111 y/1111 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D Execution of Warp x at Basic Block A Execution of Warp y at Basic Block A Legend AA Baseline Dynamic Warp Formation

(24) How Does This Work? Criteria for merging  Same PC  Complements of active threads in each warp  Recall: many warps/TB all executing the same code What information do we need to merge two warps  Need thread IDs and PCs Ideally how would you find/merge warps?

DWF: Microarchitecture Implementation Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Thread Scheduler PC-Warp LUTWarp Pool I s s u e L o g i c Warp Allocator TID x N PC A TID x N PC B H H TID x NPCPrio TID x NPCPrio OCCPCIDX OCCPCIDX Warp Update Register T Warp Update Register NT REQ TID x N PCPrio Assists in aggregating threads Identify available lanes Identify occupied lanes Point to warp being formed Warps formed dynamically in the warp pool After commit check PC-Warp LUT and merge or allocated newly forming warp

DWF: Microarchitecture Implementation Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt A 5678 A B C B B 0 B 5238 B 0010B B C C 1 C 1 4 C C 1 No Lane Conflict A: BEQ R2, B C: … X Y X X X X Y Y Y Y Z Z Z

(27) Resource Usage Ideally would like a small number of unique PCs in progress at a time  minimize overhead Warp divergence will increase the number of unique PCs  Mitigate via warp scheduling Scheduling policies  FIFO  Program counter – address variation measure of divergence  Majority/Minority- most common vs. helping stragglers  Post dominator (catch up)

(28) Hardware Consequences Expose the implications that warps have in the base design  Implications for register file access  lane aware DWF Register bank conflicts From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

(29) Relaxing Implications of Warps Thread swizzling  Essentially remap work to threads so as to create more opportunities for DWF  requires deep understanding of algorithm behavior and data sets Lane swizzling in hardware  Provide limited connectivity between register banks and lanes  avoiding full crossbars

(30) Summary Control flow divergence is a fundamental performance limiter for SIMT execution Dynamic warp formation is one way to mitigate these effects  We will look at several others Must balance a complex set of effects  Memory behaviors  Synchronization behaviors  Scheduler behaviors

Thread Block Compaction W. Fung and T. Aamodt HPCA 2011

(32) Goal Overcome some of the disadvantages of dynamic warp formation  Impact of Scheduling  Breaking implicit synchronization  Reduction of memory coalescing opportunities

Wilson Fung, Tor AamodtThread Block Compaction D D E E E DWF Pathologies: Starvation Majority Scheduling – Best Performing – Prioritize largest group of threads with same PC Starvation – LOWER SIMD Efficiency! Other Warp Scheduler? – Tricky: Variable Memory Latency Time C C D D E E E E B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 1000s cycles

Wilson Fung, Tor AamodtThread Block Compaction34 DWF Pathologies: Extra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD – 1 st Order CUDA Programmer Optimization Not preserved by DWF E: B = C[tid.x] + K; E E E Memory 0x100 0x140 0x E E E Memory 0x100 0x140 0x180 #Acc = 3 #Acc = 9 No DWF With DWF L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict

(35) Wilson Fung, Tor Aamodt DWF Pathologies: Implicit Warp Sync. Some CUDA applications depend on the lockstep execution of “static warps” Thread Thread Thread Warp 0 Warp 1 Warp 2 From W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009

(36) Performance Impact From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 37 Thread Block Compaction Block-wide Reconvergence Stack  Regroup threads within a block Better Reconv. Stack: Likely Convergence  Converge before Immediate Post-Dominator Robust  Avg. 22% speedup on divergent CUDA apps  No penalty on others PCRPCAMask Warp 0 E DE0011 CE1100 PCRPCAMask Warp 1 E DE0100 CE1011 PCRPCAMask Warp 2 E DE1100 CE0011 PCRPCActive Mask Thread Block 0 E DE CE CWarp X CWarp Y DWarp U DWarp T EWarp 0 EWarp 1 EWarp 2

GPU Microarchitecture Wilson Fung, Tor Aamodt Thread Block Compaction Interconnection Network Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel SIMT Core SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMemL1 D$Tex $Const$ More Details in Paper

Wilson Fung, Tor Aamodt Thread Block Compaction 39 Static Warp Dynamic Warp Static Warp Observation Compute kernels usually contain divergent and non-divergent (coherent) code segments Coalesced memory access usually in coherent code segments  DWF no benefit there Coherent Divergent Coherent Reset Warps Divergence Recvg Pt. Coales. LD/ST

Wilson Fung, Tor Aamodt Thread Block Compaction 40 Thread Block Compaction Branch/reconverge pt.  All avail. threads arrive at branch  Insensitive to warp scheduling Run a thread block like a warp  Whole block move between coherent/divergent code  Block-wide stack to track exec. paths reconvg. Warp compaction  Regrouping with all avail. threads  If no divergence, gives static warp arrangement Starvation Implicit Warp Sync. Extra Uncoalesced Memory Access

Wilson Fung, Tor Aamodt Thread Block Compaction 41 Thread Block Compaction PCRPCActive Threads A DE CE E Time C C D D A A A E E E A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; A A A C C C D D D E E E --

Wilson Fung, Tor Aamodt Thread Block Compaction 42 Thread Block Compaction Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks  Multiple thread blocks run on a core  Already done in most CUDA applications Block 0 Block 1 Block 2 BranchWarp Compaction Execution Time

(43) High Level View DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool) TBC: warps broken down at potentially divergent points and threads compacted across the thread block

Wilson Fung, Tor Aamodt Thread Block Compaction 44 Microarchitecture Modification Per-Warp Stack  Block-Wide Stack I-Buffer + TIDs  Warp Buffer  Store the dynamic warps New Unit: Thread Compactor  Translate activemask to compact dynamic warps More Detail in Paper ALU I-CacheDecode Warp Buffer Score- Board Issue RegFile MEM ALU Fetch Block-Wide Stack Done (WID) Valid[1:N] Branch Target PC Active Mask Pred. Thread Compactor

(45) Microarchitecture Modification (2) #compacted warps TIDs of #compacted warps From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

(46) Operation All threads mapped to the same lane When this is zero – compact (priority encoder) Pick a thread mapped to this lane Warp 2 arrives first creating 2 target entries Next warps update the active mask From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 47 Thread Compactor Convert activemask from block-wide stack to thread IDs in warp buffer Array of Priority-Encoder P-Enc CE C C Warp Buffer

Wilson Fung, Tor Aamodt Thread Block Compaction 48 Rarely Taken Likely-Convergence Immediate Post-Dominator: Conservative  All paths from divergent branch must merge there Convergence can happen earlier  When any two of the paths merge Extended Recvg. Stack to exploit this  TBC: 30% speedup for Ray Tracing while (i < K) { X = data[i]; A: if ( X = 0 ) B: result[i] = Y; C: else if ( X = 1 ) D: break; E: i++; } F: return result[i]; A BC DE F iPDom of A Details in Paper

Wilson Fung, Tor Aamodt Thread Block Compaction 49 Likely-Convergence (2) NVIDIA uses break instruction for loop exits  That handles last example Our solution: Likely-Convergence Points This paper: only used to capture loop-breaks PCRPCLPCLPosActiveThds F EF-- BFE11CFE12 3 4EFE12DFE13 4EF-- 2EFE11EF 1 2 Convergence!

(50) Likely-Convergence (3) Check ! Merge inside the stack From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011

Wilson Fung, Tor Aamodt Thread Block Compaction 51 Likely-Convergence (4) Applies to both per-warp stack (PDOM) and thread block compaction (TBC)  Enable more threads grouping for TBC  Side effect: Reduce stack usage in some case

Wilson Fung, Tor Aamodt Thread Block Compaction 52 Evaluation Simulation: GPGPU-Sim (2.2.1b)  ~Quadro FX L1 & L2 Caches 21 Benchmarks  All of GPGPU-Sim original benchmarks  Rodinia benchmarks  Other important applications: Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research

Wilson Fung, Tor Aamodt Thread Block Compaction 53 Experimental Results 2 Benchmark Groups:  COHE = Non-Divergent CUDA applications  DIVG = Divergent CUDA applications Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG Per-Warp Stack

Wilson Fung, Tor Aamodt Thread Block Compaction 54 Effect on Memory Traffic TBC does still generate some extra uncoalesced memory access Memory 0x100 0x140 0x180 #Acc = C C 2 nd Acc will hit the L1 cache No Change to Overall Memory Traffic In/out of a core Normalized Memory Stalls 0% 50% 100% 150% 200% 250% 300% TBCDWFBaseline

(55) Wilson Fung, Tor Aamodt Thread Block Compaction Conclusion Thread Block Compaction  Addressed some key challenges of DWF  One significant step closer to reality Benefit from advancements on reconvergence stack  Likely-Convergence Point  Extensible: Integrate other stack-based proposals

CAPRI: Prediction of Compaction- Adequacy for Handling Control- Divergence in GPGPU Architectures M. Rhu and M. Erez ISCA 2012

(57) Goals Improve the performance of inter-warp compaction techniques Predict when branches diverge  Borrow philosophy from branch prediction Use prediction to apply compaction only when it is beneficial

(58) Issues with Thread Block Compaction B CD E F A G TBC: warps broken down at potentially divergent points and threads compacted across the thread block Barrier synchronization overhead cannot always be hidden When it works, it works well Implicit barrier across warps to collect compaction candidates

(59) Divergence Behavior: A Closer Look Loop executes a fixed number of times Compaction ineffective branch Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(60) Compaction-Resistant Branches Control flow graph Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(61) Compaction-Resistant Branches(2) Control flow graph Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(62) Impact of Ineffective Compaction Threads shuffled around with no performance improvement However, can lead to increased memory divergence! Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(63) Basic Idea Only stall and compact when there is a high probability of compaction success Otherwise allow warps to bypass the (implicit) barrier Compaction adequacy predictor!  Think branch prediction!

(64) Example: TBC vs. CAPRI Bypassing enables increased overlap of memory references Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(65) CAPRI: Example No divergence, no stalling Diverge, stall, initialize history, all others will now stall Diverge and history available, predict, update prediction All other warps follow (one prediction/branc h CAPT updated Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(66) The Predictor Prediction uses active masks of all warps  Need to understand what could have happened Actual compaction only uses actual stalled warps Minimum provides the maximum compaction ability, i.e., #compacted warps Update history predictor accordingly Available threads in a lane Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(67) Behavior Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(68) Impact of Implicit Barriers Idle cycle count helps us understand the negative effects of implicit barriers in TBC Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control- Divergence in GPGPU Architectures,” ISCA 2012

(69) Summary The synchronization overhead of thread block compaction can introduce performance degradation Some branches more divergence than others Apply TBC judiciously  predict when it is beneficial Effectively predict when the inter-warp compaction is effective.

M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(71) Goals Understand the limitations of compaction techniques and proximity to ideal compaction Provide mechanisms to overcome these limitations and approach ideal compaction rates

(72) Limitations of Compaction Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(73) Mapping Threads to Lanes: Today scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) Warp 0 Warp 1 Modulo assignment to lanes Linearization of thread IDs

(74) Data Dependent Branches Data dependent control flows less likely to produce lane conflicts Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(75) Programmatic Branches Programmatic branches can be correlated to lane assignments (with modulo assignment) Program variables that operate like constants across threads can produced correlated branching behaviors Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(76) P-Branches vs. D-Branches P-branches are the problem! Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(77) Compaction Opportunities Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 Lane reassignment can improve compaction opportunities

(78) Aligned Divergence Threads mapped to a lane tend to evaluate (programmatic) predicates the same way  Empirically, rarely exhibited for input, data dependent control flow behavior Compaction cannot help in the presence of lane conflicts  performance of compaction mechanisms depends on both divergence patterns and lane conflicts We need to understand impact of lane assignment

(79) Impact of Lane Reassignment Goal: Improve “compactability” Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(80) Random Permutations Does not always work well  works well on average Better understanding of programs can lead to better permutations choices Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(81) Mapping Threads to Lanes; New Warp 0 Warp 1 Warp N-1 SIMD Width = 8 scalar Pipeline scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline scalar pipeline RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 What criteria do we use for lane assignment?

(82) Balanced Permutation Each lane has a single instance of a logical thread from each warp Even warps: permutation within a half warp Odd warps: swap upper and lower Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(83) Balanced Permutation (2) Logical TID of 0 in each warp is now assigned a different lane Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(84) Characteristics Vertical Balance: Each lane only has logical TIDs of distinct threads in a warp Horizontal balance: Logical TID x in all of the warps is bound to different lanes This works when CTA have fewer than SIMD_Width warps: why? Note that random permutations achieve this only on average

(85) Impact on Memory Coalescing Modern GPUs do not require ordered requests Coalescing can occur across a set of requests  speicific lane assignments do not affect coalescing behavior Increase is L1 miss rate offset by benefits of compaction Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(86) Speedup of Compaction Can improve the compaction rate of divergence dues to the majority of programmatic branches Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(87) Compaction Rate vs. Utilization Distinguish between compaction rate and utilization! Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013

(88) Application of Balanced Permutation I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps Permutation is applied when the warp is launched Maintained for the life of the warp Does not affect the baseline compaction mechanism Enable/disable SLP to preserve target specific, programmer implemented optimizations

(89) Summary Structural hazards limit the performance improvements from inter-warp compaction Program behaviors produce correlated lane assignments today Remapping threads to lanes enables extension of compaction opportunities

(90) Summary: Inter-Warp Compaction B CD E F A G Thread Block Program Properties Thread Block Co-Design of applications, resource management, software, microarchitecture,