1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Parallel Computing in CUDA Michael Garland NVIDIA Research.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.
Basic CUDA Programming
Some things are naturally parallel
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA

GPU & CUDA Features

3 Comparing GPU & CPU system Memory bandwidth Host memory (CPU) ~ 10 GB/s Device memory (GPU) ~ 100 GB/s Bottleneck PCIE : 3~5 GB/s More computing cores CPU : 4~32 cores (Peak Perf : 10~100G FLOPS) GPU : 32~512 cores (Peak Perf : >1T FLOPS) Large scale parallel Traditional (MPI, OpenMP, pthreads, …) : about 10~1K threads CUDA : about 1K ~ 10M threads (the algorithm concern may be different)

4 Graphic Card DRAM: GDDR3/5 (device memory) GPU PCIE

5 Memory Bandwidth Main Memory FSB Core 1Core 2 L2 Cache 12.8 GB/sec Main Memory GDDR3 102 GB/sec 512 bit 64 bit 8x faster interface 240 cores

6 Highway: Device memory 100 GB/s PCIE gen.2 x16 3~5 GB/s Street: Host memory 10 GB/s

7 NVIDIA’s GPUs : Ever Increasing Performance NV30 NV35 NV40 G70 T10 = Tesla 10-series G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

8 Hierarchy of concurrent threads GRID: CUDA Kernel A kernel is a function executing on GPU One grid contains many blocks practically 64~1024 blocks for good performance BLOCK One block contains many threads practically 32~512 threads for good performance blockIdx : block index in Grid (0,1,2,3,…) Threads in the same block can cooperate Synchronize Share data with fast on-die memory called “Shared Memory” Threads threadIdx : thread index in Block (0,1,2,3,…) Thread Block Grid...

9 IDs and Dimensions Built-in variables: threadIdx, blockIdx blockDim, gridDim Dimensions set at launch time Can be unique for each section Blocks: 2D IDs, unique within a grid Threads: 3D IDs, unique within a block Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

CUDA Environment

11 CUDA Environments Driver running CUDA on specific cards & CUDA version Toolkits Compiler, library, debuger, profiler, … SDK CUDA Examples & some useful tools All FREE download from NVIDIA developer zone

12 CUDA Toolkits Compiler CUDA based on C/C++ (Available now for all supported OSs ) Library BLAS, FFT, RAND, SPARSE Profiler Sampling signals on GPU for optimizing reference Memory access parameters Execution (serialization, divergence) Debugger Runs on the GPU (GDB, Nexus, … etc) Documents CUDA Programming Guide CUDA Best Practice Guide API & LIB Reference …

13 CUDA SDK Useful tools deviceQuery – query available devices specs on this machine bandwidthTest – test bandwidth of PCIE & device memory Valuable materials for developers Matrix Multiplication, Transpose Tree Reduction Simple BLAS Multi GPU sample … Valuable middle-wares for developers Video codec (MPEG, H.264, …) Radix Sort …

CUDA Programming Model

15 Simple 5 steps to program GPU with CUDA Step 1 : Allocate device memory Step 2 : Upload input data from host to device memory Step 3 : Call CUDA kernel(s) ….. (can call many kernels to manipulate data in device memory) Step 4 : Download output data from device to host memory Step 5 : Free device memory Operation 1Operation 2Operation 3 Init Alloc Function Lib Function CPU GPU

16 Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code (CPU)... Parallel Kernel (GPU) Serial Code (CPU) Parallel Kernel (GPU)

17 GPU Memory Allocation / Release Host (CPU) manages GPU memory: cudaMalloc (void ** pointer, size_t nbytes) cudaFree (void* pointer) int m = 1024; int nbytes = m*sizeof(int); int * devPtr = 0; cudaMalloc( (void**)& devPtr, nbytes); cudaFree(devPtr);

18 Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

19 Launching kernels on GPU Launch parameters: grid dimensions (up to 2D) thread-block dimensions (up to 3D) Launch 1D kernel kernel >>(...); Launch 2D kernel dim3 grid(16, 16); dim3 block(16,16); kernel >>(...);

20 Sample code Step 1: Allocate device memory float* devPtr = 0; cudaMalloc((void**)&devPtr, sizeInBytes); Step 2: Copy data from host to device cudaMemcpy(devPtr, hostPtr, numOfBytes, cudaMemcpyHostToDevice); Step 3: Call CUDA kernel(s) kernel_function >> (devicePtr); Step 4: Copy data from device back to host cudaMemcpy(hostPtr, devPtr, numOfBytes, cudaMemcpyDeviceToHost); Step 5: Free device memory cudaFree(devPtr);

21 Code executed on GPU C function with some restrictions: Single input variables can be directly transferred by parameters Array variables must be in device memory Return type is void Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU Built-in ID: gridDim, blockDim, blockIdx, threadIdx

22 Sample kernel code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Must loop n times (Serial) } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Only one time (Parallel) } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); Standard C Code Parallel CUDA Code

23 Memory scope Thread Scope Each thread has own local storage Block Scope Each thread block has own shared memory Accessible only by threads within that block Grid Scope (Global Scope) Accessible by all threads as well as host (CPU)

24 Memory system Thread Scope Register: on die, fastest, default Local memory: DRAM, non cached, slow (400~600 clocks) Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ Grid Scope Global memory: DRAM, non cached, slow (400~600 clocks) Constant memory: on die, small (total 64KB), Qualifier __constant__ Texture memory: read only, DRAM+cache, fast if cache hit Cache (Fermi only): R/W cache

25 Memory : Location & Scope Thread Registers Block Shared Memory Kernel... Global Memory

26 Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a BLOCK Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block

27 Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed cudaThreadSynchronize() -- SYNC CPU & GPU blocks until all previous CUDA calls complete Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed Asynchronous CUDA calls provide: non-blocking memcopies ability to overlap memcopies and kernel execution

28 Synchronization Level Block Sync sync all threads in the block Method: Call __syncthreads() in kernel Grid Sync sync all threads in the grid (kernel) Method: Launch another kernel CPU-GPU Sync sync CPU & GPU Method: Call cudaThreadSynchronize() in host C code

29 Device Management CPU can query and select GPU devices cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control one GPU

30 CUDA Error Reporting to CPU All CUDA calls return error code: except for kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (no error has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminted character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

31 Compiling CUDA NVCC C/C++ CUDA Application PTX to Target Compiler G80 G200 … Target code PTX Code Virtual Physical CPU Code JIT compiler -- Driver Hardware

32 Onhand practice SDK Install & make deviceQuery, bandwidthTest Hello CUDA Vector Add Tree Reduction

Thinking in parallel

34 Example: Vector Addition k = blockIdx.x * blockDim.x + threadIdx.x gridDim.x = 3, blockDim.x = 4 blockIdx.x012 threadIdx.x k (element index) a [k] + b [k]  c [k]

35 Example: Vector Addition CPU GPGPU void add(float *a, float* b, float* c, int n) { for (int k = 0; k<n; k++) c[k] = a[k] + b[k]; } void main() {..... add(a, b, c, n); } __global__ void Gadd(float *a, float* b, float* c, int n) { int k = blockIdx.x * blockDim.x + threadIdx.x; if( k < n ) c[k] = a[k] + b[k]; } void main() { ….. int Block = 512; int Grid = n / Block + 1; Gadd >> (a, b, c, n); }

Backup slides

37 PTX Example (SAXPY code) cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32$i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish:exit;

38 Scalability Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability

39 Hardware multi-threads Kernel launched by host... SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU... Device processor array Device Memory

40 Example: Matrix Transpose Coalesced read/write device memory best memory bandwidth utilization Use shared memory to share data between threads in block Use 2D blockDim & gridDim Over 20x than Host in large matrix (ex. 3000x4000) BLOCK Host Invocation dim3 grid (n/16+1,m/16+1,1); dim3 block (16,16,1); Transpose >> (b, a, m, n);

41 Example: Matrix Transpose __global__ void Transpose (float* b, float* a, int m, int n){ __shared__ float s[256]; //declare shared memory (all threads in block could see it) int x=blockIdx.x*blockDim.x + threadIdx.x; //compute index (x,y) int y=blockIdx.y*blockDim.y + threadIdx.y; if(y<m && x<n){ int i=y*n+x; //compute input address (make the lowest index fit threadIdx.x) int t=threadIdx.y*blockDim.x + threadIdx.x; //compute shared address s[t]=a[i]; //coalesced input (read in 16x16 matrix block) } __syncthreads(); //synchronize threads in block x=blockIdx.x*blockDim.x + threadIdx.y; //exchange threadIdx (x,y) usage y=blockIdx.y*blockDim.y + threadIdx.x; //compute index (x,y) if(y<m && x<n){ int o=x*m+y; //compute output address (make the lowest index fit threadIdx.x) int t=threadIdx.x*blockDim.y + threadIdx.y; //compute shared address b[o]=s[t]; //coalesced output (write out 16x16 matrix block) }

42 Hardware (SPA, Streaming Processors Array) TPC

43 Hardware (TPC, Texture/Processors Cluster)

44 Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks

45 SPA, Streaming Processors Array Double Precision Special Function Unit (SFU) TP Array Shared Memory

46 How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Double Precision Special Function Unit (SFU) TP Array Shared Memory Thread Processor (TP) FP/Int Multi-banked Register File SpcOps ALUs Thread Processor Array (TPA)