Lecture 20 Computing with Graphical Processing Units

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

What is GPGPU? Many of these slides are taken from Henry Neeman’s presentation at the University of Oklahoma.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Extracted directly from:

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

GPU Memories These notes will introduce:

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Some things are naturally parallel

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

NVIDIA Fermi Architecture

Programming Massively Parallel Processors Performance Considerations

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

General Purpose Graphics Processing Units (GPGPUs)

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Lecture 20 Computing with Graphical Processing Units

What makes a processor run faster? Registers and cache Vectorization (SSE) Instruction level parallelism Hiding data transfer delays Adding more cores 2 Scott B. Baden / CSE 160 / Wi '16

Today’s Lecture Computing with GPUs 3 Scott B. Baden / CSE 160 / Wi '16

Technology trends No longer possible to use a growing population of transistors to boost single processor performance  Cannot dissipate power, which grows linearly with clock frequency f  Can no longer increase the clock speed Instead, we replicate the cores  Reduces power consumption, pack more performance onto the chip In addition to multicore processors we have “many core” processors Not a precise definition, and there are different kinds of many-cores 4 Scott B. Baden / CSE 160 / Wi '16

Many cores We’ll look at one member of the family— Graphical Processing Units—made by one manufacturer—NVIDIA Simplified core, replicated on a grand scale: 1000s of cores Removes certain power hungry features of modern processors  Branches are more expensive  Memory accesses must be aligned  Explicit data motion involving on-chip memory  Increases performance:power ratio 5 Scott B. Baden / CSE 160 / Wi '16

Heterogeneous processing with Graphical Processing Units Specialized many-core processor (the device) controlled by a conventional processor (the host) Explicit data motion  Between host and device  Inside the device Host MEM C0 C1 C2 Device P0 P1 P2 6 Scott B. Baden / CSE 160 / Wi '16

What’s special about GPUs? Process long vectors on 1000s of specialized cores Execute 1000s of threads to hide data motion Some regularity involving memory accesses and control flow 7 Scott B. Baden / CSE 160 / Wi '16

Stampede’s NVIDIA Tesla Kepler K20m (GK110) Hierarchically organized clusters of streaming multiprocessors  13 streaming processors @ 705 MHz (down from 1.296 GHz on GeForce 280)  Peak performance: 1.17 Tflops/s Double Precision, fused multiply/add SIMT parallelism 5 GB “device” memory (frame buffer) @ 208 GB/s See international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler- GK110-GK210-Architecture-Whitepaper.pdf www.techpowerup.com/gpudb/2029/tesla-k20m.html Nvidia 7.1B transistors 3/8/16 Scott B. Baden / CSE 160 / Wi '16

Overview of Kepler GK110 3/8/16 Scott B. Baden / CSE 160 / Wi '16

SMX Streaming processor Stampede’s K20s (GK110 GPU) have 13 SMXs (2496 cores) Each SMX  192 SP cores, 64 DP cores, 32 SFUs, 32 Load/Store units  Each scalar core: fused multiply adder, truncates intermediate result  64KB on-chip memory configurable as scratchpad memory + L1 $  64K x 32-bit registers (256 (512) KB) up to 255/thread  1 FMA /cycle = 2 flops / cyc / DP core * 64 DP/SMX * 13 SMX = 1664 flops/cyc @0.7006 Ghz = 1.165 TFLOPS per processor (2.33 for K80) Nvidia 11 Scott B. Baden / CSE 160 / Wi '16

12 Nvidia Scott B. Baden / CSE 160 / Wi '16

Kepler’s Memory Hierarchy DRAM takes hundreds of cycles to access Can partition the on-chip Shared memory L,1$ cache {¾ + ¼} {½ + ½} L2 Cache (1.5 MB) B. Wilkinson 13 Scott B. Baden / CSE 160 / Wi '16

Which of these memories are on chip and hence fast to access? Host memory Registers Shared memory A & B E. B & C Scott B. Baden / CSE 160 / Wi '16

Which of these memories are on chip and hence fast to access? Host memory Registers Shared memory A & B E. B & C Scott B. Baden / CSE 160 / Wi '16

CUDA Programming environment with extensions to C Under control of the host, invoke sequences of multithreaded kernels on the device (GPU) Many lightweight virtualized threads CUDA: programming environment + C extensions KernelA<<4,8>> KernelB<<4,8>> KernelC<<4,8>> Scott B. Baden / CSE 160 / Wi '16

Thread execution model Kernel call spawns virtualized, hierarchically organized threads Grid ⊃ Block ⊃ Thread Hardware dispatches blocks to cores, 0 overhead Compiler re-arranges loads to hide latencies Global Memory . . . . . KernelA<<<2,3>,<3,5>>>() Scott B. Baden / CSE 160 / Wi '16

Thread block execution SMX Thread Blocks t0 t1 t2 … tm  Unit of workload assignment  Each thread has its own set of registers  All have access to a fast on-chip shared memory  Synchronization only among all threads in a block  Threads in different blocks communicate via slow global memory  Global synchronization also via kernel invocation MT IU SP Device Grid 1 Shared Memory Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) SIMT parallelism: all threads in a warp execute the same instruction  All branches followed  Instructions disabled  Divergence, serialization Grid 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) DavidKirk/NVIDIA & Wen-mei Hwu/UIUC KernelA<<<2,3>,<3,5>>>() Grid Block Scott B. Baden / CSE 160 / Wi '16

Which kernel call spawns 1000 threads? A. KernelA<<<10,100>,<10,10>>>() B. KernelA<<<100,10>,<10,10>>>() D. KernelA<<<10,10>,<10,100>>>() C. KernelA<<<2,5>,<10,10>>>() Scott B. Baden / CSE 160 / Wi '16

Execution Configurations Grid ⊃ Block ⊃ Thread Expressed with configuration variables Programmer sets the thread block size, maps threads to memory locations Each thread uniquely specified by block & thread ID Device Grid 1 Block Block Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (2, 1) (3, 1) (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Kernel __global__ void Kernel (...); dim2 DimGrid(2,3); // 6 thread blocks dim2 DimBlock(3,5); // 15 threads /block Kernel<<< DimGrid, DimBlock, >>>(...); DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 3/8/16 Scott B. Baden / CSE 160 / Wi '16

Coding example – Increment Array Serial Code void incrementArrayOnHost(float *a, int N){ int i; for (i=0; i < N; i++) a[i] = a[i]+1.f; } CUDA // Programmer determines the mapping of virtual thread IDs // to global memory locations #include <cuda.h> __global__ void incrementOnDevice(float *a, int N) { // Each thread uniquely specified by block & thread ID int idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx]+1.f; } incrementOnDevice <<< nBlocks, blockSize >>> (a_d, N); Rob Farber, Dr Dobb’s Journal 3/8/16 Scott B. Baden / CSE 160 / Wi '16

Managing memory Data must be allocated on the device Data must be moved between host and the device explicitly float *a_h, *b_h; float *a_d; // pointers to host memory // pointer to device memory cudaMalloc((void **) &a_d, size); for (i=0; i<N; i++) a_h[i] = (float)i; // init host data cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice); Scott B. Baden / CSE 160 / Wi '16

Computing and returning result int bSize = 4; int nBlocks = N/bSize + (N%bSize == 0?0:1); incrementOnDevice <<< nBlocks, bSize >>> (a_d, N); // Retrieve result from device and store in b_h cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); // check results for (i=0; i<N; i++) assert(a_h[i] == b_h[i]); // cleanup free(a_h); free(b_h); cudaFree(a_d); Scott B. Baden / CSE 160 / Wi '16

Experiments - increment benchmark Total time: timing taken from the host, includes copying data to the device Device only: time taken on device only Loop repeats the computation inside the kernel – 1 kernel launch and 1 set of data transfers in and out of device N = 8388480 (8M ints), block size = 128, times in milliseconds, Repetitions 10 100 1000 104 1.88 14.7 144 1.44s Device time 19.4 32.3 162 1.46s Kernel launch + data xfer Scott B. Baden / CSE 160 / Wi '16

What is the cost of moving the data and launching the kernel? A. About 1.75 ms ((19.4-1.88)/10) B. About 0.176 ms (32.3-14.7)/100 C. About 0.018 ms ((162-144)/1000) D. About 17.5 ms (19.4-1.88) N = 8 M block size = 128, times in milliseconds Repetitions 10 100 1000 104 1.88 14.7 144 1.44s Device time 19.4 32.3 162 1.46s Kernel launch + data xfer Scott B. Baden / CSE 160 / Wi '16

What is the cost of moving the data and launching the kernel? A. About 1.75 ms ((19.4-1.88)/10) B. About 0.176 ms (32.3-14.7)/100 C. About 0.018 ms ((162-144)/1000) D. About 17.5 ms (19.4-1.88) N = 8 M block size = 128, times in milliseconds Repetitions 10 100 1000 104 1.88 14.7 144 1.44s Device time 19.4 32.3 162 1.46s Kernel launch + data xfer Scott B. Baden / CSE 160 / Wi '16