CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Slides:



Advertisements
Similar presentations
CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CS6963 L4: Hardware Execution Model and Overview January 26, 2009.
CS6963 L2: Hardware Execution Model and Overview January 25, 2010.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Introduction to CUDA (1 of n*)
ME964 High Performance Computing for Engineering Applications “They have computers, and they may have other weapons of mass destruction.” Janet Reno, former.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA - 2.
ME964 High Performance Computing for Engineering Applications
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
ME964 High Performance Computing for Engineering Applications Execution Model and Its Hardware Support Sept. 25, 2008.
Overview of GPGPU Architecture and Programming Paradigm
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Single Instruction Multiple Threads
Lecture 20 Computing with Graphical Processing Units
GPU Programming with CUDA
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
The CUDA Programming Model
Introduction to CUDA Programming
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
L2: Hardware Execution Model and Overview
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
© David Kirk/NVIDIA and Wen-mei W. Hwu,
NVIDIA Fermi Architecture
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Programming Massively Parallel Processors Performance Considerations
Mattan Erez The University of Texas at Austin
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming Fall 2010 Jih-Kwon Peir Computer Information Science Engineering University of Florida

Chapter 6, Supplement 2: Threading Hardware

Single-Program Multiple-Data (SPMD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks CPU Serial Code Grid 0 . . . GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); CPU Serial Code Grid 1 . . . GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Grids and Blocks (Review) A kernel is executed as a grid of thread blocks All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution using barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate

GeForce-8 Series HW Overview Streaming Processor Array … TPC TPC TPC TPC TPC TPC Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SM Shared Memory TEX SP SP SM SP SP SFU SFU SP SP SP SP

CUDA Processor Terminology SPA Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Streaming Processor Scalar ALU for a single CUDA thread

Streaming Multiprocessor (SM) 8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory texture and global memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

G80 Thread Computing Pipeline Processors execute computing threads Alternative operating mode specifically for computing The future of GPUs is programmable processing So – build the architecture around the processor Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Generates Thread grids based on kernel calls

Thread Life Cycle in HW Grid is launched on the SPA Thread Blocks are serially distributed to all the SM’s Potentially >1 Thread Block per SM Each SM launches Warps of Threads 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can distribute more Thread Blocks Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

SM Executes Blocks Blocks Blocks t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Blocks Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution Texture L1 TF L2 Memory

Thread Scheduling/Execution Each Thread Blocks is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. … Block 1 Warps … Block 2 Warps … … t0 t1 t2 … t31 t0 t1 t2 … t31 Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

Reworked Streaming Multiprocessors Active threads per SM from 768 to 1024 (24 to 32 32-thread warps) 8,192 registers to 16,384 per SM. With the concomitant increase in the number of threads, the number of registers usable simultaneously by a thread has increased from 10 registers to 16 Dual-issue mode: SM executes two instructions every two cycles: one MAD and one floating MUL One instruction uses two GPU cycles (four ALU cycles) executed on a warp (32 threads executed by 8-way SIMD units), but the front end of SM can launch one instruction at each cycle, provided that the instructions are of different types: MAD in one case, SFU (MUL) in the other. Provide double precision

Thread Scheduling (cont.) GPU cycle MUL / SPU ALU cycle, twice fast Each code block assigned to one SM, each SM can take up to 8 blocks Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, each thread is initiated in each fast clock (i.e. ALU cycle) Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks (i.e. GPU cycle) The Fetch/decode/... stages have a higher throughput to feed both the SP/MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies (later!)

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 . . . warp 8 instruction 12 warp 3 instruction 96

Overlap Global Memory Access

SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin / age of warp SM broadcasts the same instruction to 32 Threads of a Warp I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops