April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
4/2/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Graphics Processing Units
CS252 Graduate Computer Architecture Lecture 25 Disks and Queueing Theory GPUs April 25 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
4/18/ cs252-S12, Lecture 23 CS252 Graduate Computer Architecture Lecture 23 Graphics Processing Units (GPU) April 18 th, 2012 Krste Asanovic Electrical.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.
4/4/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS,
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Prof. Zhang Gang School of Computer Sci. & Tech.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Design of Digital Circuits Lecture 21: GPUs
Lecturer: Alan Christopher
Graphics Processing Unit
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 2: Intro to the simd lifestyle and GPU internals
Department of Electrical and Computer Engineering
Presented by: Isaac Martin
Computer Graphics Graphics Hardware
The Vector-Thread Architecture
General Purpose Graphics Processing Units (GPGPUs)
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Krste Asanovic Electrical Engineering and Computer Sciences
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley

April 4, 2011CS152, Spring Last Time: Vector Computers Vectors provide efficient execution of data-parallel loop codes Vector ISA provides compact encoding of machine parallelism Vector ISA scales to more lanes without changing binary code Vector registers provide fast temporary storage to reduce memory bandwidth demands, & simplify dependence checking between vector instructions Scatter/gather, masking, compress/expand operations increase set of vectorizable loops Requires extensive compiler analysis (or programmer annotation) to be certain that loops can be vectorized Full “long” vector support (vector length control, scatter/gather) still only in supercomputers (NEC SX9, Cray X1E); microprocessors have limited packed or subword-SIMD operations –Intel x86 MMX/SSE/AVX –IBM/Motorola PowerPC VMX/Altivec

April 4, 2011CS152, Spring 2011 Types of Parallelism Instruction-Level Parallelism (ILP) –Execute independent instructions from one instruction stream in parallel (pipelining, superscalar, VLIW) Thread-Level Parallelism (TLP) –Execute independent instruction streams in parallel (multithreading, multiple cores) Data-Level Parallelism (DLP) –Execute multiple operations of the same type in parallel (vector/SIMD execution) Which is easiest to program? Which is most flexible form of parallelism? –i.e., can be used in more situations Which is most efficient? –i.e., greatest tasks/second/area, lowest energy/task 3

April 4, 2011CS152, Spring 2011 Resurgence of DLP Convergence of application demands and technology constraints drives architecture choice New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel SIMD-based architectures (vector-SIMD, subword- SIMD, SIMT/GPUs) are most efficient way to execute these algorithms 4

April 4, 2011CS152, Spring 2011 DLP important for conventional CPUs too Prediction for x86 processors, from Hennessy & Patterson, upcoming 5 th edition –Note: Educated guess, not Intel product plans! TLP: 2+ cores / 2 years DLP: 2x width / 4 years DLP will account for more mainstream parallelism growth than TLP in next decade. –SIMD –single-instruction multiple-data (DLP) –MIMD- multiple-instruction multiple- data (TLP) 5

April 4, 2011CS152, Spring 2011 Graphics Processing Units (GPUs) Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units –Provide workstation-like graphics for PCs –User could configure graphics pipeline, but not really program it Over time, more programmability added ( ) –E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants –Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations –Incredibly difficult programming model as had to use graphics pipeline model for general computation 6

April 4, 2011CS152, Spring 2011 General-Purpose GPUs (GP-GPUs) In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA –“Compute Unified Device Architecture” –Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data- parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics –Would probably need another course to describe graphics processing 7

April 4, 2011CS152, Spring 2011 Simplified CUDA Programming Model Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) {for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy >>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) {int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 8

April 4, 2011CS152, Spring 2011 Programmer’s View of Execution 9 blockIdx 0 threadId 0 threadId 1 threadId 255 blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255/256) threadId 0 threadId 1 threadId 255 Create enough blocks to cover input vector (Nvidia calls this ensemble of blocks a Grid, can be 2-dimensional) Conditional (i<n) turns off unused threads in last block blockDim = 256 (programmer can choose)

April 4, 2011CS152, Spring 2011 GPU Hardware Execution Model GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) –Programmer unaware of number of cores 10 Core 0 Lane 0 Lane 1 Lane 15 Core 1 Lane 0 Lane 1 Lane 15 Core 15 Lane 0 Lane 1 Lane 15 GPU Memory CPU CPU Memory

April 4, 2011CS152, Spring 2011 “Single Instruction, Multiple Thread” GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) 11 µT0µT1µT2µT3µT4µT5µT6µT7 ld x mul a ld y add st y Scalar instruction stream SIMD execution across warp

April 4, 2011CS152, Spring 2011 Implications of SIMT Model All “vector” loads and stores are scatter-gather, as individual µthreads perform scalar loads and stores –GPU adds hardware to dynamically coalesce individual µthread loads and stores to mimic vector loads and stores Every µthread has to perform stripmining calculations redundantly (“am I active?”) as there is no scalar processor equivalent 12

April 4, 2011CS152, Spring CS152 Administrivia Quiz 3 results

April 4, 2011CS152, Spring 2011 Question 1: Register Renaming 14 Avg : 12.1 points Stdev: 3.2 points

April 4, 2011CS152, Spring 2011 Question 2: OoO Scheduling 15 Avg : 12.6 points Stdev: 5.0 points

April 4, 2011CS152, Spring 2011 Question 3: Branch Prediction 16 Avg : 9.4 points Stdev: 2.4 points

April 4, 2011CS152, Spring 2011 Question 4: Iron Law for OoO 17 Avg : 26.9 points Stdev: 3.8 points

April 4, 2011CS152, Spring 2011 Quiz 3 Totals 18 Avg : 62.0 points Stdev: 9.8 points

April 4, 2011CS152, Spring CS152 Administrivia Quiz 4, Monday April 11 “VLIW, Multithreading, Vector, and GPUs” –Covers lectures L13-L16 and associated readings –PS 4 + Lab 4

April 4, 2011CS152, Spring 2011 Conditionals in SIMT model Simple if-then-else are compiled into predicated execution, equivalent to vector masking More complex control flow compiled into branches How to execute a vector of branches? 20 µT0µT1µT2µT3µT4µT5µT6µT7 tid=threadid If (tid >= n) skip Call func1 add st y Scalar instruction stream SIMD execution across warp skip:

April 4, 2011CS152, Spring 2011 Branch divergence Hardware tracks which µthreads take or don’t take branch If all go the same way, then keep going in SIMD fashion If not, create mask vector indicating taken/not-taken Keep executing not-taken path under mask, push taken branch PC+mask onto a hardware stack and execute later When can execution of µthreads in warp reconverge? 21

April 4, 2011CS152, Spring 2011 Warps are multithreaded on core One warp of 32 µthreads is a single thread in the hardware Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit) A single thread block can contain multiple warps (up to 512 µT max in CUDA), all mapped to single core Can have multiple blocks executing on one core 22 [Nvidia, 2010]

April 4, 2011CS152, Spring 2011 GPU Memory Hierarchy 23 [ Nvidia, 2010]

April 4, 2011CS152, Spring 2011 SIMT Illusion of many independent threads But for efficiency, programmer must try and keep µthreads aligned in a SIMD fashion –Try and do unit-stride loads and store so memory coalescing kicks in –Avoid branch divergence so most instruction slots execute useful work and are not masked off 24

April 4, 2011CS152, Spring 2011 Nvidia Fermi GF100 GPU 25 [Nvidia, 2010]

April 4, 2011CS152, Spring 2011 Fermi “Streaming Multiprocessor” Core 26

April 4, 2011CS152, Spring 2011 Fermi Dual-Issue Warp Scheduler 27

April 4, 2011CS152, Spring 2011 GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) –Advantage is shared memory with CPU, no need to transfer data –Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system »Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? –On same die, CPU and GPU should have same memory bandwidth –GPU might have more FLOPS as needed for graphics anyway 28

April 4, 2011CS152, Spring Acknowledgements These slides contain material developed and copyright by: –Krste Asanovic (UCB)