Download presentation
Presentation is loading. Please wait.
Published byClara McKenzie Modified over 9 years ago
1
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques
2
(2) Goal Idle thread Active thread Compaction Compact threads in a warp to coalesce (and eliminate) idle cycles improve utilization
3
(3) References V. Narasiman, et. al., “Improving GPU Performance via Large Warps and Two-Level Scheduling,” MICRO 2011 A. S. Vaidya, et.al., “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013
4
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Improving GPU Performance via Large Warps and Two-Level Scheduling V. Narsiman et. al MICRO 2011
5
(5) Goals Improve performance of divergent code via compaction of threads within a warp Integrate warp scheduling optimization with intra-warp compaction
6
(6) Resource Underutilization Due to control divergence Due to memory divergence 32 warps, 32 threads per warp, single SM
7
(7) Warp Scheduling and Locality Warp0 Warp1 WarpN Time Warp0 WarpN Warp1 Time Load Degrades memory reference locality and row buffer locality Overlaps memory accesses Opportunities for memory coalescing Potential for exposing memory sta lls
8
(8) Key Ideas Conventional design today warp size = #SIMD lanes Use large warps multi- cycle issue of sub-warps Compact threads in a warp to form fully utilized sub- warps 2-level scheduler to spread memory accesses in time Reduce memory related stall cycles Conventional GPU RR fetch policy
9
(9) Large Warps Large warps converted to a sequence of sub-warps RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Warp size = SIMD width Warp size = 4 SIMD Width = 4 RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Large Warp Multi-cycle Issue Typical OperationProposed Approach sub-warp Warp size = 16 SIMD Width = 4 Thread
10
(10) Sub-Warp Compaction Iteratively select one thread per column to create a packed sub-warp Dynamic generation of sub-warps
11
(11) Impact on the Register File Large Warp Active Mask Organization Baseline Register File Large Warp Register File Need separate decoders per bank
12
(12) Scheduling Constraints Next large warp cannot be scheduled until first sub- warp completes execution Scoreboard checks for issue dependencies Thread not available for packing into a sub-warp unless previous issue (sub-warp) has completed single bit status Simple check However on a branch, all sub-warps must complete before it is eligible for instruction fetch scheduling Re-fetch policy for conditional branches Must wait till last sub-warp finishes Optimization for unconditional branch instructions Don’t create multiple sub-warps Sub-warping always completes in a single cycle
13
(13) Effect of Control Divergence Note that divergence is unknown until all sub-warps execute Divergence management just happens on large warp boundaries Need to buffer sub-warp state, e.g., active masks The last warp effect Cannot fetch the next instruction in a warp until all sub-warps issue Trailing warp (warp divergence effect) can lead to many idle cycles Effect of the last thread E.g., in data dependent loop iteration count across threads Last thread can hold up reconvergence
14
(14) A Round Robin Warp Scheduler Warp0 Warp1 WarpN Time Load Exploit inter-warp reference locality in the cache Exploit inter-warp reference locality in the DRAM row buffers However, need to maintain latency hiding
15
(15) Two Level Round Robin Scheduler LW0LW1 LW2LW3 LW4LW5 LW6LW7 LW8LW9 LW10LW11 LW12LW13 LW14LW15 Fetch Group 0Fetch Group 1 Fetch Group 3Fetch Group 2 RR ThreadFetch Group Size: Enough to keep the pipeline busy
16
(16) Scheduler Behavior Need to set fetch group size carefully tune to fill the pipeline Timeout on switching fetch groups to mitigate the last warp effect
17
(17) Summary Intra-warp compaction made feasible due to multi- cycle warp execution Mismatch between warp size and SIMD width enables flexible intra-warp compaction Do not make warps too big last thread effect begins to dominate
18
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) SIMD Divergence Optimization Through Intra-warp Compaction A. S. Vaidya, et. al ISCA 2013
19
(19) Goals Improve utilization in divergent code via intra-warp compaction Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture
20
(20) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
21
(21) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides Gen graphics processor 32-byte bidirectional ring Dedicated coherence signals Coherent distributed cache Shared with GPU Operates as a memory side cache Shared physical memory
22
(22) Inside the Gen9 EU Thread 0 Architecture Register File (ARF) Per thread register state Up to 7 threads 128, 256-bit registers/thread (8-way SIMD) Each thread executes a kernel Threads may execute different kernels Multi-instruction dispatch Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
23
(23) Operation (2) Dispatch 4 instructions from 4 threads Constraints between issue slots Support both FP and Integer operations 4, 32-bit FP operations 8, 16-bit integer operations 8, 16-bit FP operations MAD operations each cycle 96 bytes/cycle read BW 32 bytes/cycle write BW Divergence/reconvergence management SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
24
(24) Operation SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides SIMD 8 SIMD 16 SIMD 4 Intermix SIMD instructions of various lengths Thread
25
(25) Mapping the BSP Model Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) Map multiple threads to SIMD instance executed by a EU Thread All threads in a TB or workgroup mapped to same thread (shared memory access) Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform
26
(26) Subslice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides #Eus * #threads/EU determines width of the slice Flexible data interface Scatter/gather support Memory request coalescing across 64-byte cache lines Shared memory access
27
(27) Slice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides Flexible partitioning SM Data cache Buffers for accelerators Shared memory 64Kbyte/slice Not coherent with other structures
28
(28) Product Organization From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides Load balancing Honor barrier and shared memory constraints Implemented shared atomics (with CPU) Shared virtual memory Share pointer rich data structures between CPU and GPU Coherent shared memory between CPU and GPU
29
(29) Coherent Memory Hierarchy From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides Not coherent
30
(30) Microarchitecture Operation I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending Per-thread operation Per-thread scoreboard check Thread arbitration and dual issue/2-cycles Operand fetch/swizzle Encode swizzle in RF access Instruction execution happens in waves of 4-wide operations Note: variable width SIMD instructions
31
(31) Divergence Assessment Coherent applications Divergent Applications SIMD Efficiency
32
(32) Basic Cycle Compression Example: Actual operation depends on data types, execution cycles/op Note power/energy savings RF for a Single Operand
33
(33) Swizzle Cycle Compression
34
(34) SCC Operation Can compact across Quads Swizzle settings overlapped with RF access Pack lanes into a quad 128b 4 lanes cycle i cycle i+1 cycle i+2 cycle i+3 RF for a Single Operand Note increased area, power/energy
35
(35) Compaction Opportunities SIMD 16 SIMD 8 Idle lanes No further compaction possible For K active threads what is the maximum cycle savings for SIMD N instructions ? Thread
36
(36) Performance Savings Difference between saving cycles and saving time When is #cycles ≠ time?
37
(37) Summary Multi-cycle warp/SIMD/work_group execution Optimize #cycles/warp by compressing idle cycles Rearrange idle cycles via swizzling to create opportunity Sensitivities to the memory interface speeds Memory bound applications may experience limited benefit
38
(38) Intra-Warp Compaction B CD E F A G Thread Block Scope limited to within a warp Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.