© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CSCI 4717/5717 Computer Architecture
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Lecture 12 Reduce Miss Penalty and Hit Time
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Computer Organization and Architecture
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
Multiscalar processors
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
CH12 CPU Structure and Function
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Chapter 04 Authors: John Hennessy & David Patterson.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.
Sunpyo Hong, Hyesoon Kim
Operation of the SM Pipeline
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
My Coordinates Office EM G.27 contact time:
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 5a: CPU architecture 101 boris.
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Mattan Erez The University of Texas at Austin
The Vector-Thread Architecture
Operation of the Basic SM Pipeline
Introduction to Control Divergence
Register File Organization
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Chapter 11 Processor Structure and function
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presented by Ondrej Cernin
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques

(2) Goal Idle thread Active thread Compaction Compact threads in a warp to coalesce (and eliminate) idle cycles  improve utilization

(3) References V. Narasiman, et. al., “Improving GPU Performance via Large Warps and Two-Level Scheduling,” MICRO 2011 A. S. Vaidya, et.al., “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Improving GPU Performance via Large Warps and Two-Level Scheduling V. Narsiman et. al MICRO 2011

(5) Goals Improve performance of divergent code via compaction of threads within a warp Integrate warp scheduling optimization with intra-warp compaction

(6) Resource Underutilization Due to control divergence Due to memory divergence 32 warps, 32 threads per warp, single SM

(7) Warp Scheduling and Locality Warp0 Warp1 WarpN Time Warp0 WarpN Warp1 Time Load Degrades memory reference locality and row buffer locality Overlaps memory accesses Opportunities for memory coalescing Potential for exposing memory sta lls

(8) Key Ideas Conventional design today  warp size = #SIMD lanes Use large warps  multi- cycle issue of sub-warps  Compact threads in a warp to form fully utilized sub- warps 2-level scheduler to spread memory accesses in time  Reduce memory related stall cycles Conventional GPU RR fetch policy

(9) Large Warps Large warps converted to a sequence of sub-warps RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Warp size = SIMD width Warp size = 4 SIMD Width = 4 RF scalar Pipeline scalar pipeline scalar pipeline scalar pipeline Large Warp Multi-cycle Issue Typical OperationProposed Approach sub-warp Warp size = 16 SIMD Width = 4 Thread

(10) Sub-Warp Compaction Iteratively select one thread per column to create a packed sub-warp Dynamic generation of sub-warps

(11) Impact on the Register File Large Warp Active Mask Organization Baseline Register File Large Warp Register File Need separate decoders per bank

(12) Scheduling Constraints Next large warp cannot be scheduled until first sub- warp completes execution Scoreboard checks for issue dependencies  Thread not available for packing into a sub-warp unless previous issue (sub-warp) has completed  single bit status  Simple check  However on a branch, all sub-warps must complete before it is eligible for instruction fetch scheduling Re-fetch policy for conditional branches  Must wait till last sub-warp finishes Optimization for unconditional branch instructions  Don’t create multiple sub-warps  Sub-warping always completes in a single cycle

(13) Effect of Control Divergence Note that divergence is unknown until all sub-warps execute  Divergence management just happens on large warp boundaries  Need to buffer sub-warp state, e.g., active masks The last warp effect  Cannot fetch the next instruction in a warp until all sub-warps issue  Trailing warp (warp divergence effect) can lead to many idle cycles Effect of the last thread  E.g., in data dependent loop iteration count across threads  Last thread can hold up reconvergence

(14) A Round Robin Warp Scheduler Warp0 Warp1 WarpN Time Load Exploit inter-warp reference locality in the cache Exploit inter-warp reference locality in the DRAM row buffers However, need to maintain latency hiding

(15) Two Level Round Robin Scheduler LW0LW1 LW2LW3 LW4LW5 LW6LW7 LW8LW9 LW10LW11 LW12LW13 LW14LW15 Fetch Group 0Fetch Group 1 Fetch Group 3Fetch Group 2 RR ThreadFetch Group Size: Enough to keep the pipeline busy

(16) Scheduler Behavior Need to set fetch group size carefully  tune to fill the pipeline Timeout on switching fetch groups to mitigate the last warp effect

(17) Summary Intra-warp compaction made feasible due to multi- cycle warp execution  Mismatch between warp size and SIMD width enables flexible intra-warp compaction Do not make warps too big  last thread effect begins to dominate

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) SIMD Divergence Optimization Through Intra-warp Compaction A. S. Vaidya, et. al ISCA 2013

(19) Goals Improve utilization in divergent code via intra-warp compaction Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture

(20) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9,

(21) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, Gen graphics processor 32-byte bidirectional ring Dedicated coherence signals Coherent distributed cache Shared with GPU Operates as a memory side cache Shared physical memory

(22) Inside the Gen9 EU Thread 0 Architecture Register File (ARF) Per thread register state Up to 7 threads 128, 256-bit registers/thread (8-way SIMD) Each thread executes a kernel  Threads may execute different kernels  Multi-instruction dispatch Figure from The Computer Architecture of the Intel Processor Graphics Gen9,

(23) Operation (2) Dispatch 4 instructions from 4 threads Constraints between issue slots Support both FP and Integer operations 4, 32-bit FP operations 8, 16-bit integer operations 8, 16-bit FP operations MAD operations each cycle 96 bytes/cycle read BW 32 bytes/cycle write BW Divergence/reconvergence management SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9,

(24) Operation SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform Figure from The Computer Architecture of the Intel Processor Graphics Gen9, SIMD 8 SIMD 16 SIMD 4 Intermix SIMD instructions of various lengths Thread

(25) Mapping the BSP Model Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) Map multiple threads to SIMD instance executed by a EU Thread All threads in a TB or workgroup mapped to same thread (shared memory access) Figure from The Computer Architecture of the Intel Processor Graphics Gen9, SIMD-16, SIMD-8, SIMD-32 instructions SIMD-4 instructions transform

(26) Subslice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, #Eus * #threads/EU determines width of the slice Flexible data interface  Scatter/gather support  Memory request coalescing across 64-byte cache lines  Shared memory access

(27) Slice Organization From The Computer Architecture of the Intel Processor Graphics Gen9, Flexible partitioning SM Data cache Buffers for accelerators Shared memory 64Kbyte/slice Not coherent with other structures

(28) Product Organization From The Computer Architecture of the Intel Processor Graphics Gen9, Load balancing Honor barrier and shared memory constraints Implemented shared atomics (with CPU) Shared virtual memory Share pointer rich data structures between CPU and GPU Coherent shared memory between CPU and GPU

(29) Coherent Memory Hierarchy From The Computer Architecture of the Intel Processor Graphics Gen9, Not coherent

(30) Microarchitecture Operation I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending Per-thread operation Per-thread scoreboard check Thread arbitration and dual issue/2-cycles Operand fetch/swizzle Encode swizzle in RF access Instruction execution happens in waves of 4-wide operations Note: variable width SIMD instructions

(31) Divergence Assessment Coherent applications Divergent Applications SIMD Efficiency

(32) Basic Cycle Compression Example: Actual operation depends on data types, execution cycles/op Note power/energy savings RF for a Single Operand

(33) Swizzle Cycle Compression

(34) SCC Operation Can compact across Quads Swizzle settings overlapped with RF access Pack lanes into a quad 128b 4 lanes cycle i cycle i+1 cycle i+2 cycle i+3 RF for a Single Operand Note increased area, power/energy

(35) Compaction Opportunities SIMD 16 SIMD 8 Idle lanes No further compaction possible For K active threads what is the maximum cycle savings for SIMD N instructions ? Thread

(36) Performance Savings Difference between saving cycles and saving time  When is #cycles ≠ time?

(37) Summary Multi-cycle warp/SIMD/work_group execution Optimize #cycles/warp by compressing idle cycles  Rearrange idle cycles via swizzling to create opportunity Sensitivities to the memory interface speeds  Memory bound applications may experience limited benefit

(38) Intra-Warp Compaction B CD E F A G Thread Block Scope limited to within a warp Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp