Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
GRAPHIC PROCESSING UNITS Razvan Bogdan Microprocessor Systems.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Basic CUDA Programming
ECE408 / CS483 Applied Parallel Programming Lecture 1: Introduction
Slides from “PMPP” book
Introduction to CUDA C Slide credit: Slides adapted from
CUDA Parallelism Model
Parallel programming with GPGPU coprocessors
Introduction to Heterogeneous Parallel Computing
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013

Course Goals Learn how to program massively parallel processors and achieve – high performance – functionality and maintainability – scalability across future generations Technical subjects – principles and patterns of parallel algorithms – processor architecture features and constraints – programming API, tools and techniques 2 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Text/Notes D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands- on Approach,” 2nd Edition, Morgan Kaufman Publisher, 2012, NVIDIA, NVidia CUDA C Programming Guide, version 5.0, NVidia, 2013 (reference book) 3 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Course Sessions Overview Monday Lecture 1 Lecture 2 Lab Session 1 Tuesday Lecture 3 Lecture 4 Wednesday Lecture 5 Lecture 6 Lab Session 2 Thursday Lab Session 3 Friday Lab Session 4 Lab Session 5 4 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Blue Waters Hardware >300 Cray System & Storage cabinets: >25,000 Compute nodes: >1 TB/s Usable Storage Bandwidth: >1.5 Petabytes System Memory: 4 GB Memory per core module: 3D Torus Gemin Interconnect Topology: >25 Petabytes Usable Storage: >11.5 Petaflops Peak performance: >49,000 Number of AMD Interlogos processors: >380,000 Number of AMD x86 core modules: >4,000 Number of NVIDIA Kepler GPUs: 5 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Cray XK7 Compute Node Y X Z HT3 PCIe Gen2 XK7 Compute Node Characteristics AMD Series 6200 (Interlagos) NVIDIA Kepler Host Memory 32GB 1600 MT/s DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Keplers in final installation 6 © David Kirk/NVIDIA and Wen-mei W. Hwu,

CPU and GPU have very different design philosophy GPU Throughput Oriented Cores Chip Compute Unit Cache/Local Mem Registers SIMD Unit Threading CPU Latency Oriented Cores Chip Core Local Cache Registers SIMD Unit Control 7 © David Kirk/NVIDIA and Wen-mei W. Hwu,

CPUs: Latency Oriented Design Large caches – Convert long latency memory accesses to short latency cache accesses Sophisticated control – Branch prediction for reduced branch latency – Data forwarding for reduced data latency Powerful ALU – Reduced operation latency Cache ALU Control ALU DRAM CPU 8 © David Kirk/NVIDIA and Wen-mei W. Hwu,

GPUs: Throughput Oriented Design Small caches – To boost memory throughput Simple control – No branch prediction – No data forwarding Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput Require massive number of threads to tolerate latencies 9 DRAM GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,

Winning Applications Use Both CPU and GPU CPUs for sequential parts where latency matters – CPUs can be 10+X faster than GPUs for sequential code GPUs for parallel parts where throughput wins – GPUs can be 10+X faster than CPUs for parallel code 10 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Heterogeneous parallel computing is catching on. 280 submissions to GPU Computing Gems – 110 articles included in two volumes Financial Analysis Scientific Simulation Engineering Simulation Data Intensive Analytics Medical Imaging Digital Audio Processing Computer Vision Digital Video Processing Biomedical Informatics Electronic Design Automation Statistical Modeling Ray Tracing Rendering Interactive Physics Numerical Methods 11 © David Kirk/NVIDIA and Wen-mei W. Hwu,

What is the stake? Scalable and portable software lasts through many hardware generations Scalable algorithms and libraries can be the best legacy we can leave behind from this era 12 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Introduction to CUDA C

Objective To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism – Hierarchical thread organization – Main interfaces for launching parallel execution – Thread index(es) to data index mapping 14 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel Programming Work Flow Identify compute intensive parts of an application Adopt scalable algorithms Optimize data arrangements to maximize locality Performance Tuning Pay attention to code portability and maintainability 15 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Massive Parallelism - Regularity © David Kirk/NVIDIA and Wen-mei W. Hwu,

CUDA /OpenCL – Execution Model Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args); 17 © David Kirk/NVIDIA and Wen-mei W. Hwu,

The Von-Neumann Model Memory Control Unit I/O ALUReg File PCIR Processing Unit Processor 18 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads –Each thread is a virtualized Von-Neumann Processor –All threads in a grid run the same kernel code (SPMD)‏ –Each thread has an index that it uses to compute memory addresses and make control decisions 19 © David Kirk/NVIDIA and Wen-mei W. Hwu, tid = compute thread index work(tid)

Thread Blocks: Scalable Cooperation Divide thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate 20 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Computing the Thread Index gridDim.x # of blocks in grid 21 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Computing the Thread Index gridDim.xblockIdx.x # of blocks in grid position of block in grid 22 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Computing the Thread Index gridDim.xblockIdx.xblockDim.x # of blocks in grid position of block in grid # threads in block 23 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Computing the Thread Index gridDim.xblockIdx.xblockDim.xthreadIdx.x # of blocks in grid position of block in grid # threads in block position of thread in block 24 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Computing the Thread Index gridDim.xblockIdx.xblockDim.xthreadIdx.x # of blocks in grid position of block in grid # threads in block position of thread in block 25 © David Kirk/NVIDIA and Wen-mei W. Hwu, tid = blockIdx.x*blockDim.x + threadIdx.x

blockIdx and threadIdx Each thread uses indices to decide what data to work on – blockIdx: 1D, 2D, or 3D (CUDA 4.0) – threadIdx: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – … 26 © David Kirk/NVIDIA and Wen-mei W. Hwu,

A[0] vector A vector B vector C A[1]A[2]A[3]A[4]A[N-1] B[0]B[1]B[2]B[3] … B[4] … B[N-1] C[0]C[1]C[2]C[3]C[4]C[N-1] … Vector Addition – Conceptual View 27 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Vector Addition – Traditional C Code // Compute vector sum C = A+B void vecAdd(float* A, float* B, float* C, int n) { for (i = 0, i < n, i++) C[i] = A[i] + B[i]; } int main() { // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements … vecAdd(A_h, B_h, C_h, N); } 28 © David Kirk/NVIDIA and Wen-mei W. Hwu,

#include void vecAdd(float* A, float* B, float* C, int n)‏ { int size = n* sizeof(float); float* A_d, B_d, C_d; // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition 3. // copy C from the device memory // Free device vectors } CPU Host Memory GPU Part 2 Device Memory Part 1Part 3 Heterogeneous Computing vecAdd Host Code 29 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Control Flow CPUGPU cudaMalloc(...) cudaMemcpy(...) kernel >>(...) cudaFree(...) 30 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Partial Overview of CUDA Memories Device code can: – R/W per-thread registers – R/W per-grid global memory Host code can – Transfer data to/from per grid global memory (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers Host We will cover more later. 31 © David Kirk/NVIDIA and Wen-mei W. Hwu,

(Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers Host CUDA Device Memory Management API functions cudaMalloc() global memory – Allocates object in the device global memory – Two parameters Address of a pointer to the allocated object Size of of allocated object in terms of bytes cudaFree() – Frees object from device global memory Pointer to freed object 32 © David Kirk/NVIDIA and Wen-mei W. Hwu,

(Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Thread (0, 0) Registers Thread (1, 0) Registers Host Host-Device Data Transfer API functions cudaMemcpy() – memory data transfer – Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer – Transfer to device is asynchronous 33 © David Kirk/NVIDIA and Wen-mei W. Hwu,

void vecAdd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, B_d, C_d; 1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &C_d, size); 2. // Kernel invocation code – to be shown later // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d); } 34 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Example: Vector Addition Kernel Input Vector 1: Input Vector 2: Output Vector: assign 1 thread per element 35 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Example: Vector Addition Kernel Input Vector 1: Input Vector 2: Output Vector: quantization of global thread count due to block organization 36 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Example: Vector Addition Kernel Input Vector 1: Input Vector 2: Output Vector: boundary checks 37 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Example: Vector Addition Kernel Input Vector 1: Input Vector 2: Output Vector: data padding 38 © David Kirk/NVIDIA and Wen-mei W. Hwu,

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vectAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel >>(A_d, B_d, C_d, n); } Example: Vector Addition Kernel Device Code 39 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Example: Vector Addition Kernel Host Code 40 © David Kirk/NVIDIA and Wen-mei W. Hwu, // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } int vectAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel >>(A_d, B_d, C_d, n); }

More on Kernel Launch int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernnel >>(A_d, B_d, C_d, n); } Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking Host Code 41 © David Kirk/NVIDIA and Wen-mei W. Hwu,

__global__ void vecAddKernel(float *A_d, float *B_d, float *C_d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C_d[i] = A_d[i]+B_d[i]; } __host__ Void vecAdd() { dim3 DimGrid(ceil(n/256,1,1); dim3 DimBlock(256,1,1); vecAddKernel >> (A_d,B_d,C_d,n); } Kernel execution in a nutshell Kernel Blk 0 Blk N-1 GPU M0 RAM Mk Schedule onto multiprocessors 42 © David Kirk/NVIDIA and Wen-mei W. Hwu,

More on CUDA Function Declarations host __host__ float HostFunc()‏ hostdevice __global__ void KernelFunc()‏ device __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function Each “__” consists of two underscore characters A kernel function must return void __device__ and __host__ can be used together 43 © David Kirk/NVIDIA and Wen-mei W. Hwu,

Integrated C programs with CUDA extensions NVCC Compiler Host C Compiler/ Linker Host CodeDevice Code (PTX) Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs Compiling A CUDA Program 44 © David Kirk/NVIDIA and Wen-mei W. Hwu,

ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu,