Operation of the SM Pipeline

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA Dynamic Parallelism

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

What GPGPU-Sim Simulates

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Sunpyo Hong, Hyesoon Kim

(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Dynamic Scheduling Why go out of style?

Sathish Vadhiyar Parallel Programming

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

CS203 – Advanced Computer Architecture

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Mattan Erez The University of Texas at Austin

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Out of Order Processors

Presented by: Isaac Martin

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Programming Massively Parallel Processors Performance Considerations

* From AMD 1996 Publication #18522 Revision E

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

The Vector-Thread Architecture

Operation of the Basic SM Pipeline

Register File Organization

Mattan Erez The University of Texas at Austin

HARP Control Divergence & Assignment 4

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

CUDA Execution Model - II

The University of Adelaide, School of Computer Science

6- General Purpose GPU Programming

Conceptual execution on a processor which exploits ILP

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Operation of the SM Pipeline ©Sudhakar Yalamanchili unless otherwise noted

Objectives Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor Understand the type of information necessary for each stage of operation Identification of performance bottlenecks Detailed implementations are addressed in subsequent modules

Reading Documentation for the GPGPUSim simulator Good source of information about the general organization and operation of a stream multiprocessor http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual Operation of a Scoreboard https://en.wikipedia.org/wiki/Scoreboarding X. Xiang, Y. Yiang, H. Zhou, “Warp Level Divergence in GPUs: Characterization, Impact, and Mitigation,” International Symposium on High Performance Computer Architecture, 2014. D. Tarjan and K. Skadron, “On Demand Register Allocation and Deallocation for a Multithreaded Processor,” US Patent 2011/0161616 A1, June 2011

NVIDIA GK110 (Keplar) Thread Block Scheduler Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

SMX Organization : GK 110 Multiple Warp Schedulers 64K 32-bit registers 192 cores – 6 clusters of 32 cores each What are the main stages of a generic SMX pipeline? Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

A Generic SM Pipeline Scalar Fetch & Decode Warp 6 Warp 1 Warp 2 Decode RF PRF D-Cache Data All Hit? Writeback Pending Warps Pipeline scalar pipeline Issue I-Buffer I-Fetch Miss? Scalar Fetch & Decode Instruction Issue & Warp Scheduler Front-end Predicate & GP Register Files Scalar Cores Scalar Pipelines Data Memory Access Back-end Writeback/Commit

Single Warp Execution PC AM WID State warp state Thread Block setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): Grid

Instruction Fetch & Decode I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Examples from Harmonica2 GPU PC AM WID State Instr Warp 0 PC Warp 1 PC To I-Cache Warp n-1 PC May realize multiple fetch policies Next Warp From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Buffer Buffer a fixed number of instructions per warp I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Example: buffer 2 instructions/warp Decoded instruction V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard ECE 6100/CS 6290 Buffer a fixed number of instructions per warp Coordinated with instruction fetch Need an empty I-buffer for the warp V: valid instruction in the buffer R: instruction ready to be issued Set using the scoreboard logic From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Buffer (2) Scoreboard enforces and WAW and RAW hazards I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard Scoreboard enforces and WAW and RAW hazards Indexed by Warp ID Each entry hosts required registers, Destination registers are reserved at issue Reserved registers released at writeback Enables multiple instructions to be issued from a single warp From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Buffer (3) Scoreboard Generic Scoreboard Name Busy Op Fi I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard Generic Scoreboard dest reg src1 src2 Source Registers have value? Function unit producing value Name Busy Op Fi Fj Fk Qj Qk Rj Rk Int Yes Load F2 R3 No From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Issue I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps pool of ready warps Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Manages implementation of barriers, register dependencies, and control divergence From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Issue (2) warp I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps barrier Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Barriers – warps wait here for barrier synchronization All threads in the CTA must reach the barrier From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Issue (3) I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Scoreboard V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Register Dependencies - track through the scoreboard From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Issue (4) Control Divergence - per warp stack I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps divergent warps Warp 3 Warp 8 Warp 7 instruction Keeps track of divergent threads at a branch Warp Scheduler SIMT Stack (per warp) Control Divergence - per warp stack From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

Instruction Issue (5) Scheduler can issue multiple instructions from a warp Issue conditions Has valid instructions Not waiting at a barrier Scoreboard check Pipeline line is not stalled: operand access stage (will get to it later) Reserve destination registers Instructions may issue to memory, SP or SFU pipelines Warp scheduling disciplines  more later in the course

Single ported Register File Banks Register File Access Banks 0-15 I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Arbiter RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Single ported Register File Banks 1024 bit Xbar OC OC OC OC Operand Collectors (OC) DU DU DU DU Dispatch Units (DU) ALUs L/S SFU

Scalar Pipeline Functional units are pipelined I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Functional units are pipelined Designs with multiple issue Dispatch ALU FPU LD/SD Result Queue A Single Core

Shared Memory Access Multiple bank organization I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps 2-way Conflict access Conflict free access Multiple bank organization Data is interleaved across banks Bank conflicts extend access times

Memory Request Coalescing Memory Requests Tid RQ Size Base Add Offset Pending Request Table Memory Address Coalescing Pending RQ Count Addr Mask Thread Masks PRT is filled whenever a memory request is issued Generate a set of address masks  one for each memory transaction Issue transactions From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013

Case Study: Keplar GK 110 From GK110: NVIDIA white paper

Keplar SMX Up to two instruction can be issued per warp A slice of the SMX From GK110: NVIDIA white paper Up to two instruction can be issued per warp E.g., LD and SFU More flexible instruction paring rules More efficient support for atomic operations in global memory – both latency and throughput E.g., atomicADD, atomicEXC

Shuffle Instruction Permits threads in a warp to share data From GK110: NVIDIA white paper From GK110: NVIDIA white paper Permits threads in a warp to share data Avoid a load-store sequence Reduce the shared memory requirement per TB  increase occupancy Data exchanged in registers without using shared memory Some operations become more efficient

Memory Hierarchy Configurable cache/shared memory configuration for L1 warp Configurable cache/shared memory configuration for L1 Read-only cache for compiler or developer (intrinsics) use Shared L2 across all SMXs ECC coverage across the hierarchy Performance impact L1 Cache Shared Memory Read-Only Cache L2 Cache DRAM From GK110: NVIDIA white paper

Dynamic Parallelism The ability for device-side nested kernel launch From GK110: NVIDIA white paper The ability for device-side nested kernel launch Eliminates host-GPU interactions Current overheads are high Matches a wider range of parallelism patterns – will cover in more depth later Examples of recursive, data dependent parallelism – AMR Can get by with a weaker CPU?

Concurrent Kernel Launch From GK110: NVIDIA white paper Kernels from multiple streams are now mapped to distinct hardware queues TBs from multiple kernels can share a SMX

Warp and Instruction Dispatch From GK110: NVIDIA white paper

Grid Management Multiple grids launched from both CPU and GPU can be handled in Keplar Need the ability to re-prioritize and schedule new grids

Summary Synchronous progress of a warp through the SM pipelines Warp progress in a thread block can diverge for many reasons Barriers Control divergence Memory divergence How is the execution optimized? Next 