© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques
(2) Goal Idle thread Active thread Compaction Compact threads in a warp to coalesce (and eliminate) idle cycles improve utilization
(3) References V. Narasiman, et. al., “Improving GPU Performance via Large Warps and Two-Level Scheduling,” MICRO 2011 A. S. Vaidya, et.al., “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Improving GPU Performance via Large Warps and Two-Level Scheduling V. Narsiman et. al MICRO 2011
(5) Goals Improve performance of divergent code via compaction of threads within a warp Integrate warp scheduling optimization with intra-warp compaction
(6) Resource Underutilization Due to control divergence Due to memory divergence 32 warps, 32 threads per warp, single SM
(7) Warp Scheduling and Locality Warp0 Warp1 WarpN Time Warp0 WarpN Warp1 Time Load Degrades memory reference locality and row buffer locality Overlaps memory accesses Opportunities for memory coalescing Potential for exposing memory sta lls
(8) Key Ideas Conventional design today warp size = #SIMD lanes Use large warps multi- cycle issue of sub-warps Compact threads in a warp to form fully utilized sub- warps 2-level scheduler to spread memory accesses in time Reduce memory related stall cycles Conventional GPU RR fetch policy
(9) Large Warps Large warps converted to a sequence of sub-warps RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Warp size = SIMD width Warp size = 4 SIMD Width = 4 RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Large Warp Multi-cycle Issue Typical OperationProposed Approach sub-warp Warp size = 16 SIMD Width = 4 Thread
(10) Sub-Warp Compaction Iteratively select one thread per column to create a packed sub-warp Dynamic generation of sub-warps
(11) Impact on the Register File Large Warp Active Mask Organization Baseline Register File Large Warp Register File Need separate decoders per bank
(12) Scheduling Constraints Next large warp cannot be scheduled until first sub- warp completes execution Scoreboard checks for issue dependencies Thread not available for packing into a sub-warp unless previous issue (sub-warp) has completed single bit status Simple check However on a branch, all sub-warps must complete before it is eligible for instruction fetch scheduling Re-fetch policy for conditional branches Must wait till last sub-warp finishes Optimization for unconditional branch instructions Don’t create multiple sub-warps Sub-warping always completes in a single cycle
(13) Effect of Control Divergence Note that divergence is unknown until all sub-warps execute Divergence management just happens on large warp boundaries Need to buffer sub-warp state, e.g., active masks The last warp effect Cannot fetch the next instruction in a warp until all sub-warps issue Trailing warp (warp divergence effect) can lead to many idle cycles Effect of the last thread E.g., in data dependent loop iteration count across threads Last thread can hold up reconvergence
(14) A Round Robin Warp Scheduler Warp0 Warp1 WarpN Time Load Exploit inter-warp reference locality in the cache Exploit inter-warp reference locality in the DRAM row buffers However, need to maintain latency hiding
(15) Two Level Round Robin Scheduler LW0LW1 LW2LW3 LW4LW5 LW6LW7 LW8LW9 LW10LW11 LW12LW13 LW14LW15 Fetch Group 0Fetch Group 1 Fetch Group 3Fetch Group 2 RR ThreadFetch Group Size: Enough to keep the pipeline busy
(16) Scheduler Behavior Need to set fetch group size carefully tune to fill the pipeline Timeout on switching fetch groups to mitigate the last warp effect
(17) Summary Intra-warp compaction made feasible due to multi- cycle warp execution Mismatch between warp size and SIMD width enables flexible intra-warp compaction Do not make warps too big last thread effect begins to dominate
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) SIMD Divergence Optimization Through Intra-warp Compaction A. S. Vaidya, et. al ISCA 2013
(19) Goals Improve utilization in divergent code via intra-warp compaction Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture
(20) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9,
(21) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, Gen graphics processor 32-byte bidirectional ring Dedicated coherence signals Coherent distributed cache Shared with GPU Operates as a memory side cache Shared physical memory
(22) Inside the Gen9 EU Thread 0 Architecture Register File (ARF) Per thread register state Up to 7 threads 128, 256-bit registers/thread (8-way SIMD) Each thread executes a kernel Threads may execute different kernels Multi-instruction dispatch Figure from The Computer Architecture of the Intel Processor Graphics Gen9,
(23) Operation (2) Dispatch 4 instructions from 4 threads Constraints between issue slots Support both FP and Integer operations 4, 32-bit FP operations 8, 16-bit integer operations 8, 16-bit FP operations MAD operations each cycle 96 bytes/cycle read BW 32 bytes/cycle write BW Divergence/reconvergence management SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9,
(24) Operation SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9, SIMD 8 SIMD 16 SIMD 4 Intermix SIMD instructions of various lengths Thread
(25) Mapping the BSP Model Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) Map multiple threads to SIMD instance executed by a EU Thread All threads in a TB or workgroup mapped to same thread (shared memory access) Figure from The Computer Architecture of the Intel Processor Graphics Gen9, SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform
(26) Subslice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, #Eus * #threads/EU determines width of the slice Flexible data interface Scatter/gather support Memory request coalescing across 64-byte cache lines Shared memory access
(27) Slice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, Flexible partitioning SM Data cache Buffers for accelerators Shared memory 64Kbyte/slice Not coherent with other structures
(28) Product Organization From The Computer Architecture of the Intel Processor Graphics Gen9, Load balancing Honor barrier and shared memory constraints Implemented shared atomics (with CPU) Shared virtual memory Share pointer rich data structures between CPU and GPU Coherent shared memory between CPU and GPU
(29) Coherent Memory Hierarchy From The Computer Architecture of the Intel Processor Graphics Gen9, Not coherent
(30) Microarchitecture Operation I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending Per-thread operation Per-thread scoreboard check Thread arbitration and dual issue/2-cycles Operand fetch/swizzle Encode swizzle in RF access Instruction execution happens in waves of 4-wide operations Note: variable width SIMD instructions
(31) Divergence Assessment Coherent applications Divergent Applications SIMD Efficiency
(32) Basic Cycle Compression Example: Actual operation depends on data types, execution cycles/op Note power/energy savings RF for a Single Operand
(33) Swizzle Cycle Compression
(34) SCC Operation Can compact across Quads Swizzle settings overlapped with RF access Pack lanes into a quad 128b 4 lanes cycle i cycle i+1 cycle i+2 cycle i+3 RF for a Single Operand Note increased area, power/energy
(35) Compaction Opportunities SIMD 16 SIMD 8 Idle lanes No further compaction possible For K active threads what is the maximum cycle savings for SIMD N instructions ? Thread
(36) Performance Savings Difference between saving cycles and saving time When is #cycles ≠ time?
(37) Summary Multi-cycle warp/SIMD/work_group execution Optimize #cycles/warp by compressing idle cycles Rearrange idle cycles via swizzling to create opportunity Sensitivities to the memory interface speeds Memory bound applications may experience limited benefit
(38) Intra-Warp Compaction B CD E F A G Thread Block Scope limited to within a warp Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp