4/4/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS,

Slides:

Advertisements

Similar presentations

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Krste CS 252 Feb. 27, 2006 Lecture 12, Slide 1 EECS 252 Graduate Computer Architecture Lec. 12: Vector Computers Krste Asanovic ( )

March 15, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers Krste Asanovic Electrical Engineering and Computer.

4/2/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,

March 18, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 16: Vector Computers Krste Asanovic Electrical Engineering and Computer.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

April 1, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 17: Vectors Part II Krste Asanovic Electrical Engineering and Computer.

CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,

CS 252 Graduate Computer Architecture Lecture 7: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California,

April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Computer Graphics Graphics Hardware

1 Chapter 04 Authors: John Hennessy & David Patterson.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

4/18/ cs252-S12, Lecture 23 CS252 Graduate Computer Architecture Lecture 23 Graphics Processing Units (GPU) April 18 th, 2012 Krste Asanovic Electrical.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Computer Architecture Lec. 12: Vector Computers. Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

3/21/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers Krste Asanovic Electrical Engineering and Computer.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

CENG709 Computer Architecture and Operating Systems Lecture 15: Vector Computers Murat Manguoglu Department of Computer Engineering Middle East Technical.

Vector computers.

Page 1 Vector Processors Slides issues de diverses sources (quasi aucune de moi)

Computer Engg, IIT(BHU)

The Present and Future of Parallelism on GPUs

Computer Graphics Graphics Hardware

Prof. Zhang Gang School of Computer Sci. & Tech.

Computer Architecture: SIMD and GPUs (Part I)

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

COSC3330 Computer Architecture Lecture 18. Vector Machine

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 17 Vectors

14: Vector Computers: an old-fashioned approach

Massachusetts Institute of Technology

Design of Digital Circuits Lecture 21: GPUs

Lecturer: Alan Christopher

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

Lecture 5: GPU Compute Architecture

COMP4211 : Advance Computer Architecture

Computer Organization & Design 计算机组成与设计

Presented by: Isaac Martin

Krste Asanovic Electrical Engineering and Computer Sciences

Computer Organization & Design 计算机组成与设计

NVIDIA Fermi Architecture

Computer Graphics Graphics Hardware

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 15 – Vectors Krste Asanovic Electrical Engineering and Computer.

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.

Krste Asanovic Electrical Engineering and Computer Sciences

6- General Purpose GPU Programming

CSE 502: Computer Architecture

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

4/4/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

4/4/2016 CS152, Spring 2016 Administrivia  PS5 is out  PS4 due on Wednesday  Lab 4  Quiz 4 on Monday April 11th 2

4/4/2016 CS152, Spring 2016 Vector Programming Model [0][1][VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 Scalar Registers r0 r15 Vector Registers v0 v15 [0][1][2][VLRMAX-1] VLR Vector Length Register v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1Stride, r2 Memory Vector Register

4/4/2016 CS152, Spring 2016 Vector Stripmining Problem: Vector registers have finite length Solution: Break loops into pieces that fit in registers, “Stripmining” 4 for (i=0; i<N; i++) C[i] = A[i]+B[i]; ABC 64 elements Remainder

4/4/2016 CS152, Spring 2016 Vector Conditional Execution 5 Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i]; Solution: Add vector mask (or flag) registers –vector version of predicate registers, 1 bit per element …and maskable vector instructions –vector operation becomes bubble (“NOP”) at elements where mask bit is clear Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask

4/4/2016 CS152, Spring 2016 Masked Vector Instructions 6 C[4] C[5] C[1] Write data port A[7]B[7] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 M[7]=1 Density-Time Implementation –scan mask vector and only execute elements with non-zero masks C[1] C[2] C[0] A[3]B[3] A[4]B[4] A[5]B[5] A[6]B[6] M[3]=0 M[4]=1 M[5]=1 M[6]=0 M[2]=0 M[1]=1 M[0]=0 Write data portWrite Enable A[7]B[7]M[7]=1 Simple Implementation –execute all N operations, turn off result writeback according to mask

4/4/2016 CS152, Spring 2016 Vector Reductions 7 Problem: Loop-carried dependence on reduction variables sum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum Solution: Re-associate operations if possible, use binary tree to perform reduction # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1)

4/4/2016 CS152, Spring 2016 Vector Scatter/Gather 8 Want to vectorize loops with indirect accesses: for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result

4/4/2016 CS152, Spring 2016 Vector Scatter/Gather 9 Histogram example : for (i=0; i<N; i++) A[B[i]]++; Is following a correct translation? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA # Scatter incremented values

4/4/2016 CS152, Spring 2016 A Modern Vector Super: NEC SX-9 (2008)  65nm CMOS technology  Vector unit (3.2 GHz) –8 foreground VRegs + 64 background VRegs (256x64-bit elements/VReg) –64-bit functional units: 2 multiply, 2 add, 1 divide/sqrt, 1 logical, 1 mask unit –8 lanes (32+ FLOPS/cycle, 100+ GFLOPS peak per CPU) –1 load or store unit (8 x 8-byte accesses/cycle)  Scalar unit (1.6 GHz) –4-way superscalar with out-of-order and speculative execution –64KB I-cache and 64KB data cache 10 Memory system provides 256GB/s DRAM bandwidth per CPU Up to 16 CPUs and up to 1TB DRAM form shared-memory node –total of 4TB/s bandwidth to shared DRAM memory Up to 512 nodes connected via 128GB/s network links (message passing between nodes)

4/4/2016 CS152, Spring 2016 Multimedia Extensions (aka SIMD extensions) 11  Very short vectors added to existing ISAs for microprocessors  Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b –Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b –Newer designs have wider registers 128b for PowerPC Altivec, Intel SSE2/3/4 256b for Intel AVX  Single instruction operates on all elements within register 16b 32b 64b 8b 16b x16b adds

4/4/2016 CS152, Spring 2016 Multimedia Extensions versus Vectors  Limited instruction set: –no vector length control –no strided load/store or scatter/gather –unit-stride loads must be aligned to 64/128-bit boundary  Limited vector register length: –requires superscalar dispatch to keep multiply/add/load units busy –loop unrolling to hide latencies increases register pressure  Trend towards fuller vector support in microprocessors –Better support for misaligned memory accesses –Support of double-precision (64-bit floating-point) –New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b) 12

4/4/2016 CS152, Spring 2016 Degree of Vectorization  Compilers are good at finding data-level parallelism 13 MIPS processor with vector coprocessor

4/4/2016 CS152, Spring 2016 Average Vector Length  Maximum depends on if becnhmarks use 16 bit or 32 bit operations 14

4/4/2016 CS152, Spring 2016 Types of Parallelism  Instruction-Level Parallelism (ILP) –Execute independent instructions from one instruction stream in parallel (pipelining, superscalar, VLIW)  Thread-Level Parallelism (TLP) –Execute independent instruction streams in parallel (multithreading, multiple cores)  Data-Level Parallelism (DLP) –Execute multiple operations of the same type in parallel (vector/SIMD execution)  Which is easiest to program?  Which is most flexible form of parallelism? –i.e., can be used in more situations  Which is most efficient? –i.e., greatest tasks/second/area, lowest energy/task 15

4/4/2016 CS152, Spring 2016 Resurgence of DLP  Convergence of application demands and technology constraints drives architecture choice  New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel  SIMD-based architectures (vector-SIMD, subword-SIMD, SIMT/GPUs) are most efficient way to execute these algorithms 16

4/4/2016 CS152, Spring 2016 DLP important for conventional CPUs too  Prediction for x86 processors, from Hennessy & Patterson, 5 th edition –Note: Educated guess, not Intel product plans!  TLP: 2+ cores / 2 years  DLP: 2x width / 4 years  DLP will account for more mainstream parallelism growth than TLP in next decade. –SIMD –single-instruction multiple-data (DLP) –MIMD- multiple-instruction multiple-data (TLP) 17

4/4/2016 CS152, Spring 2016 Graphics Processing Units (GPUs)  Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high- performance floating-point units –Provide workstation-like graphics for PCs –User could configure graphics pipeline, but not really program it  Over time, more programmability added ( ) –E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants –Massively parallel (millions of vertices or pixels per frame) but very constrained programming model  Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations –Incredibly difficult programming model as had to use graphics pipeline model for general computation 18

4/4/2016 CS152, Spring 2016 General-Purpose GPUs (GP-GPUs)  In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA –“Compute Unified Device Architecture” –Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas.  Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general- purpose computing  Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution  This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics –Would probably need another course to describe graphics processing 19

4/4/2016 CS152, Spring 2016 Simplified CUDA Programming Model  Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) {for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy >>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) {int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 20

4/4/2016 CS152, Spring 2016 Programmer’s View of Execution 21 blockIdx 0 threadId 0 threadId 1 threadId 255 blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255/256) threadId 0 threadId 1 threadId 255 Create enough blocks to cover input vector (Nvidia calls this ensemble of blocks a Grid, can be 2-dimensional) Conditional (i<n) turns off unused threads in last block blockDim = 256 (programmer can choose)

4/4/2016 CS152, Spring 2016 GPU Hardware Execution Model  GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor  CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) –Programmer unaware of number of cores 22 Core 0 Lane 0 Lane 1 Lane 15 Core 1 Lane 0 Lane 1 Lane 15 Core 15 Lane 0 Lane 1 Lane 15 GPU Memory CPU CPU Memory

4/4/2016 CS152, Spring 2016 “Single Instruction, Multiple Thread”  GPUs use a SIMT model (SIMD with multithreading)  Individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution (each thread executes the same instruction each cycle) on hardware (Nvidia groups 32 CUDA threads into a warp). Threads are independent from each other 23 µT0µT1µT2µT3µT4µT5µT6µT7 ld x mul a ld y add st y Scalar instruction stream SIMD execution across warp

4/4/2016 CS152, Spring 2016 Implications of SIMT Model  All “vector” loads and stores are scatter-gather, as individual µthreads perform scalar loads and stores –GPU adds hardware to dynamically coalesce individual µthread loads and stores to mimic vector loads and stores  Every µthread has to perform stripmining calculations redundantly (“am I active?”) as there is no scalar processor equivalent 24

4/4/2016 CS152, Spring 2016 Conditionals in SIMT model  Simple if-then-else are compiled into predicated execution, equivalent to vector masking  More complex control flow compiled into branches  How to execute a vector of branches? 25 µT0µT1µT2µT3µT4µT5µT6µT7 tid=threadid If (tid >= n) skip Call func1 add st y Scalar instruction stream SIMD execution across warp skip:

4/4/2016 CS152, Spring 2016 Branch divergence  Hardware tracks which µthreads take or don’t take branch  If all go the same way, then keep going in SIMD fashion  If not, create mask vector indicating taken/not-taken  Keep executing not-taken path under mask, push taken branch PC+mask onto a hardware stack and execute later  When can execution of µthreads in warp reconverge? 26

4/4/2016 CS152, Spring 2016 Warps are multithreaded on core  One warp of 32 µthreads is a single thread in the hardware  Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit)  A single thread block can contain multiple warps (up to 512 µT max in CUDA), all mapped to single core  Can have multiple blocks executing on one core 27 [Nvidia, 2010]

4/4/2016 CS152, Spring 2016 GPU Memory Hierarchy 28 [ Nvidia, 2010]

4/4/2016 CS152, Spring 2016 SIMT  Illusion of many independent threads –Threads inside a warp execute in a SIMD fashion  But for efficiency, programmer must try and keep µthreads aligned in a SIMD fashion –Try and do unit-stride loads and store so memory coalescing kicks in –Avoid branch divergence so most instruction slots execute useful work and are not masked off 29

4/4/2016 CS152, Spring 2016 Nvidia Fermi GF100 GPU 30 [Nvidia, 2010]

4/4/2016 CS152, Spring 2016 Fermi “Streaming Multiprocessor” Core 31

4/4/2016 CS152, Spring 2016 Fermi Dual-Issue Warp Scheduler 32

4/4/2016 CS152, Spring 2016 Apple A5X Processor for iPad v3 (2012) 12.90mm x 12.79mm 45nm technology 33 [Source: Chipworks, 2012]

4/4/2016 CS152, Spring 2016 Historical Retrospective, Cray-2 (1985)  243MHz ECL logic  2GB DRAM main memory (128 banks of 16MB each) –Bank busy time 57 clocks!  Local memory of 128KB/core  1 foreground + 4 background vector processors 34 Foreground CPU Shared Memory Core 0 Lane Local Memory Core 0 Lane Local Memory Core 0 Lane Local Memory Core 0 Lane Local Memory

4/4/2016 CS152, Spring 2016 GPU Versus CPU 35

4/4/2016 CS152, Spring 2016 Why?  Need to understand the difference –Latency intolerance versus latency tolerance –Task parallelism versus data parallelism –Multithreaded cores versus SIMT cores –10s of threads versus thousands of threads  CPUs: low latency, low throughput  GPUs: high latency, high throughput –GPUs are designed for tasks that tolerate latency 36

4/4/2016 CS152, Spring 2016 What About Caches?  GPUs can have more ALUs in the same area and therefore run more threads of computation 37

4/4/2016 CS152, Spring 2016 GPU Future  High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) –Advantage is shared memory with CPU, no need to transfer data –Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3)  Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? –On same die, CPU and GPU should have same memory bandwidth –GPU might have more FLOPS as needed for graphics anyway 38

4/4/2016 CS152, Spring 2016 Acknowledgements  These slides contain material developed and copyright by: –Krste Asanovic (UCB) –Mohamed Zahran (NYU)  “An introduction to modern GPU architecture”. Ashu Rege. NVIDIA. 39