ME964 High Performance Computing for Engineering Applications Execution Model and Its Hardware Support Sept. 25, 2008.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

CUDA More on Blocks/Threads. 2 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CS6963 L4: Hardware Execution Model and Overview January 26, 2009.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Introduction to CUDA (1 of n*)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
ME964 High Performance Computing for Engineering Applications “They have computers, and they may have other weapons of mass destruction.” Janet Reno, former.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
ME964 High Performance Computing for Engineering Applications
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA programming Performance considerations (CUDA best practices)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Computer Engg, IIT(BHU)
CS427 Multicore Architecture and Parallel Computing
Introduction to CUDA Programming
CS427 Multicore Architecture and Parallel Computing
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Lecture 5: GPU Compute Architecture for the last time
© David Kirk/NVIDIA and Wen-mei W. Hwu,
NVIDIA Fermi Architecture
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Programming Massively Parallel Processors Performance Considerations
Mattan Erez The University of Texas at Austin
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ME964 High Performance Computing for Engineering Applications Execution Model and Its Hardware Support Sept. 25, 2008

Before we get started… Last Time The CUDA execution model Wrapped up overview the CUDA API Read CUDA Programming Guide 1.1 (for next Tu) Today Review of concepts discussed over the previous two lectures More on the CUDA execution model and its hardware support Focus on thread scheduling HW4 assigned Due on Thursday, Oct. 2 at 11:59 PM Timing Kernel Call Overhead, Matrix-Matrix multiplication (tiled, arbitrary size matrices), Vector Reduction operation Please Note: On Nov 11 and 13 we’ll have a Guest Lecturer, Dr. Darius Buntinas, of Argonne National Lab Lectures will cover MPI, a different parallel computational model The two lectures will run *two* hours long You’ll get a free Tu or Th afterwards… 2

The GPU has evolved into a flexible and powerful processor: It’s programmable using high-level languages (soon in FORTRAN) It supports 32-bit floating point precision and dbl precision (2.0) Capable of GFLOP-level crunching number speed: GPU in each of today’s PC and workstation Why Use the GPU for Computing ? 3

What is Driving this Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (owing to its graphics rendering origin) More transistors can be devoted to data processing rather than data caching and flow control The fast-growing video game industry exerts strong economic pressure that forces constant innovation DRAM Cache ALU Control ALU DRAM CPU GPU 4 HK-UIUC

ALU – Arithmetic Logic Unit Digital circuit that performs arithmetic and logical operations Fundamental building block of a processing unit (CPU and GPU) A and B operands (the data, coming from input registers) F is an operator (“+”, “-”, etc.) – specified by the control unit R is the result, stored in output register D is an output flag passed back to the control unit 5

6 Some Useful Information on Tools (short detour)

Compilation Any source file containing CUDA language extensions must be compiled with nvcc You spot such a file by its.cu suffix nvcc is a compile driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl,... Assignment: Read the nvcc document available on the class website nvcc can output: C code Must then be compiled with the rest of the application using another tool Assembly code (ptx) Or directly object code 7 HK-UIUC

Linking Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library ( cudart ) The CUDA core library ( cuda ) 8 HK-UIUC

Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (using the nvcc -deviceemu ) runs entirely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread For your assignments: in Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations 9 When running in device emulation mode, one can: Use host native debug support (breakpoints, variable QuickWatch and edit, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf ) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads

Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results There are various options to force strict single precision on the host 10 HK-UIUC

11 End Information on Tools Begin Discussion on Block/Thread Scheduling

Review: The CUDA Programming Model GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD) CUDA perspective: Single Program Multiple Threads What’s the overall software (application) development model? CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks Grid 0... GPU Parallel Kernel KernelA >>(args); Grid 1 CPU Serial Code GPU Parallel Kernel KernelB >>(args); CPU Serial Code 12

Execution Configuration: Grids and Blocks (Review) A kernel is executed as a grid of blocks of threads All threads in a kernel can access several device data memory spaces A block [of threads] is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Threads from two different blocks cannot cooperate!!! This has important software design implications Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA 13 HK-UIUC

CUDA Thread Block: Review In relation to a Block, the programmer decides: Block size: from 1 to 512 concurrent threads Block dimension (shape): 1D, 2D, or 3D # of threads in each dimension All threads in a Block execute the same thread code Threads have thread id numbers within Block Threads share data and synchronize while doing their share of the work Thread program uses thread id to select work and address shared data CUDA Thread Block Thread Id #: … m Thread code Courtesy: John Nickolls, NVIDIA 14

GeForce-8 Series HW Overview TPC TEX SM SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Texture Processor Cluster Stream Multiprocessor SM Shared Memory Stream Processor Array … 15 HK-UIUC

SPA Stream Processor Array (variable across GeForce 8-series, 8 in GeForce8800 GTX) TPC Texture Processor Cluster (2 SM + TEX) SM Stream Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Scalar [Stream] Processor (SP) Scalar ALU for a single CUDA thread CUDA Processor Terminology 16 HK-UIUC

Stream Multiprocessor (SM) 8 Scalar Processors (SP) 2 Special Function Units (SFU) It’s where a block lands for execution Multi-threaded instruction dispatch 1 to 768 (!) threads active Shared instruction fetch per 32 threads 20+ GFLOPS on G80 16 KB shared memory DRAM texture and memory access SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Stream Multiprocessor Shared Memory 17 HK-UIUC

Scheduling on the HW Grid is launched on the SPA Thread Blocks are serially distributed to all the SMs Potentially >1 Thread Block per SM Each SM launches Warps of Threads SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can launch next Block[s] in line NOTE: Two levels of scheduling: For running [desirably] a large number of blocks on a small number of SMs (16/14/etc.) For running up to 24 warps of threads on the 8 SPs available on each SM Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 18

SM Executes Blocks Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…) SM in G80 can take up to 768 threads This is 24 warps (occupancy calculator!!) Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently but time slicing is involved SM assigns/maintains thread id #s SM manages/schedules thread execution t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1SM 0 19 HK-UIUC

Thread Scheduling/Execution Each Thread Block is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are the basic scheduling units in SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only *one* of the 24 Warps will be selected for instruction fetch and execution. … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Streaming Multiprocessor Shared Memory 20 HK-UIUC

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 Side-comment: Suppose your code has one global memory access every four instructions Then, a minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency warp 8 instruction 11 SM multithreaded Warp scheduler warp 1 instruction 42 warp 3 instruction 35 warp 8 instruction time warp 3 instruction HK-UIUC

SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/4 cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 Threads of a Warp I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU 22 HK-UIUC

Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes “ready” after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until scoreboarding prevents issue 23 HK-UIUC

Granularity Considerations For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles? For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it can take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. This is not an option anyway (we need less then 512 per block, and less than 768 per SM) 24 HK-UIUC

How would you scale up the GPU? Scaling up here means beefing it up Two issues: As a company, you don’t want to rock the boat a lot when scaling up You don’t want to have legacy code re-written to take advantage of new HW You can beef up the memory, not discussed here Increase the number of TCP Easy to do, basically more HW Implications on our side: If you have enough blocks, you rise with the tide too Increase the number of SMs on each TCP Easy to do, basically more HW Implications on our side: If you have enough blocks, you rise with the tide too Increase the number of SP This is tricky, you’d have to fiddle with the control unit of the SM The Warp size would change, most likely this would require more threads in a block to be efficient, but that requires more memory on the chip (shared & registers) It snowballs, this is probably going to stay like this for a while… 25

TPCSMSMSMTPCSMSMSMTPCSMSMSMTPCSMSMSMTPCSMSMSM TPC SMSMSM TPC SMSMSM TPC SMSMSM TPC SMSMSM TPC SMSMSM Stream Processor Array New GT200 GPU Architecture Texture Processing Cluster G80 – up to 8 TCP in SPA GT200 – 10 TCP in SPA 26

27 End Discussion on Block/Thread Scheduling Begin Discussion on Memory Access