CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA and the Memory Model (Part II). Code executed on GPU.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Basic CUDA Programming
Some things are naturally parallel
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
Measuring Performance
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

CUDA Basics

© NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute Supported on common operating systems: Windows Mac OS Linux D irect Compute C/C++

© NVIDIA Corporation 2009 Outline of CUDA Basics Basics Memory Management Basic Kernels and Execution on GPU Coordinating CPU and GPU Execution Development Resources See the Programming Guide for the full API

Basic Memory Management

© NVIDIA Corporation 2009 Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses Can’t tell from the pointer value whether the address is on CPU or GPU Must exercise care when dereferencing: Dereferencing CPU pointer on GPU will likely crash Same for vice versa

© NVIDIA Corporation 2009 Архитектура Tesla 10 TPC Interconnection Network ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 DRAM CPUBridgeHost Memory Work Distribution

© NVIDIA Corporation 2009 Архитектура Tesla 10 TEX SM Texture Processing Cluster SM Streaming Multiprocessor Instruction $Constant $ Instruction Fetch Shared Memory SFU SP SFU SP Double Precision Register File

© NVIDIA Corporation 2009 Архитектура Fermi Instruction CacheUniform Cache Warp Scheduler 64 Kb Shared Memory/ L1 Cache SFU Core Interconnection network Register File ( bit words Dispatch Unit Warp Scheduler Dispatch Unit Core SFU Uniform Cache LD/ST

© NVIDIA Corporation 2009 GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

© NVIDIA Corporation 2009 Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Non-blocking memcopies are provided

© NVIDIA Corporation 2009 Code Walkthrough 1 Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0s Copy from GPU to CPU Print the values

© NVIDIA Corporation 2009 Code Walkthrough 1 #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

© NVIDIA Corporation 2009 Code Walkthrough 1 #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

© NVIDIA Corporation 2009 Code Walkthrough 1 #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

© NVIDIA Corporation 2009 Code Walkthrough 1 #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

Basic Kernels and Execution on GPU

© NVIDIA Corporation 2009 CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

© NVIDIA Corporation 2009

Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory

© NVIDIA Corporation 2009 IDs and Dimensions Threads: 3D IDs, unique within a block Blocks: 2D IDs, unique within a grid Dimensions set at launch time Can be unique for each grid Built-in variables: threadIdx, blockIdx blockDim, gridDim Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

© NVIDIA Corporation 2009 Code executed on GPU C function with some restrictions: Can only access GPU memory No variable number of arguments No static variables Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU __host__ and __device__ qualifiers can be combined

© NVIDIA Corporation 2009 Code Walkthrough 2 Build on Walkthrough 1 Write a kernel to initialize integers Copy the result back to CPU Print the values

© NVIDIA Corporation 2009 __global__ void kernel(int *a) { int idx = blockIdx.x * blockDim.x + threadIdx.x; a[idx] = 7; } Kernel Code (executed on GPU)

© NVIDIA Corporation 2009 Launching kernels on GPU Launch parameters: grid dimensions (up to 2D), dim3 type thread-block dimensions (up to 3D), dim3 type shared memory: number of bytes per block for extern smem variables declared without size Optional, 0 by default stream ID Optional, 0 by default dim3 grid(16, 16); dim3 block(16,16); kernel >>(...);

© NVIDIA Corporation 2009 #include __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; grid.x = dimx / block.x; kernel >>( d_a ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

© NVIDIA Corporation 2009 __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; } Kernel Variations and Output Output: Output: Output:

© NVIDIA Corporation 2009 Code Walkthrough 3 Build on Walkthruogh 2 Write a kernel to increment n×m integers Copy the result back to CPU Print the values

© NVIDIA Corporation 2009 __global__ void kernel(int *a, int dimx, int dimy) { int ix = blockIdx.x * blockDim.x + threadIdx.x; int iy = blockIdx.y * blockDim.y + threadIdx.y; int idx = iy * dimx + ix; a[idx] = a[idx] + 1; } Kernel with 2D Indexing

© NVIDIA Corporation 2009 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel >>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; } __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

© NVIDIA Corporation 2009 Blocks must be independent Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock Independence requirement gives scalability

© NVIDIA Corporation 2009 Blocks must be independent Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability

Coordinating CPU and GPU Execution

© NVIDIA Corporation 2009 Synchronizing GPU and CPU All kernel launches are asynchronous control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed cudaThreadSynchronize() blocks until all previous CUDA calls complete Asynchronous CUDA calls provide: non-blocking memcopies ability to overlap memcopies and kernel execution

© NVIDIA Corporation 2009 CUDA Error Reporting to CPU All CUDA calls return error code: except kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (“no error” has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminated character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

© NVIDIA Corporation 2009 CUDA Event API Events are inserted (recorded) into CUDA call streams Usage scenarios: measure elapsed time for CUDA calls (clock cycle precision) query the status of an asynchronous CUDA call block CPU until CUDA calls prior to the event are completed asyncAPI sample in CUDA SDK cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); kernel >>(...); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float et; cudaEventElapsedTime(&et, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

© NVIDIA Corporation 2009 Device Management CPU can query and select GPU devices cudaGetDeviceCount(int* count) cudaSetDevice(int device) cudaGetDevice(int *current_device) cudaGetDeviceProperties(cudaDeviceProp* prop, int device) cudaChooseDevice(int *device, cudaDeviceProp* prop) Multi-GPU setup: device 0 is used by default one CPU thread can control many GPUs multiple CPU threads can control the same GPU

Shared Memory

© NVIDIA Corporation 2009 Shared Memory On-chip memory 2 orders of magnitude lower latency than global memory Order of magnitude higher bandwidth than gmem 16KB per multiprocessor NVIDIA GPUs contain up to 30 multiprocessors Allocated per threadblock Accessible by any thread in the threadblock Not accessible to other threadblocks Several uses: Sharing data among threads in a threadblock User-managed cache (reducing gmem accesses)

© NVIDIA Corporation 2009 Example of Using Shared Memory Applying a 1D stencil: 1D data For each output element, sum all elements within a radius For example, radius = 3 Add 7 input elements radius

© NVIDIA Corporation 2009 Implementation with Shared Memory 1D threadblocks (partition the output) Each threadblock outputs BLOCK_DIMX elements Read input from gmem to smem Needs BLOCK_DIMX + 2*RADIUS input elements Compute Write output to gmem “halo” Input elements corresponding to output as many as there are threads in a threadblock

© NVIDIA Corporation 2009 Kernel code __global__ void stencil( int *output, int *input, int dimx, int dimy ) { __shared__ int s_a[BLOCK_DIMX+2*RADIUS]; int global_ix = blockIdx.x*blockDim.x + threadIdx.x; int local_ix = threadIdx.x + RADIUS; s_a[local_ix] = input[global_ix]; if ( threadIdx.x < RADIUS ) { s_a[local_ix – RADIUS] = input[global_ix – RADIUS]; s_a[local_ix + BLOCK_DIMX + RADIUS] = input[global_ix + RADIUS]; } __syncthreads(); int value = 0; for( offset = -RADIUS; offset<=RADIUS; offset++ ) value += s_a[ local_ix + offset ]; output[global_ix] = value; }

© NVIDIA Corporation 2009 Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a thread-block Since threads are scheduled at run-time Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block

© NVIDIA Corporation 2009 Memory Model Review Local storage Each thread has own local storage Mostly registers (managed by the compiler) Data lifetime = thread lifetime Shared memory Each thread block has own shared memory Accessible only by threads within that block Data lifetime = block lifetime Global (device) memory Accessible by all threads as well as host (CPU) Data lifetime = from allocation to deallocation

© NVIDIA Corporation 2009 Memory Model Review Thread Per-thread Local Storage Block Per-block Shared Memory

© NVIDIA Corporation 2009 Memory Model Review Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels

© NVIDIA Corporation 2009 Memory Model Review Device 0 memory Device 1 memory Host memory cudaMemcpy()

© NVIDIA Corporation 2009 NVIDIA GPUDirect 2.0 Direct memory access to other GPU memory on the same system Looks like NUMA on CPU

© NVIDIA Corporation 2009 Unified Virtual Addressing Before UVAUVA

CUDA Development Resources

© NVIDIA Corporation 2009 CUDA Programming Resources CUDA toolkit Compiler, libraries, and documentation free download for Windows, Linux, and MacOS CUDA SDK code samples whitepapers Instructional materials on Developer Zone slides and audio, webinars parallel programming courses tutorials forums

© NVIDIA Corporation 2009 CUDA Webinars * CUDA C and CUDA C++ Streams and Concurrent Kernels, Dr Steve Rennich, August 2nd,2011 * CUDA Texture Memory & Graphics Interop, Dr Gernot Zegler, August 9th,2011 * CUDA Optimization: Identifying Performance Limiters by Dr Paulius Micikevicius, August 23rd,2011 * Multi-GPU and Host Multi-Threading Considerations by Dr Paulius Micikevicius, August 30th,2011 * GPU Direct and Unified Virtual Addressing by Timothy Schroeder, Sept 6th.,2011 * CUDA Optimization: Memory Bandwidth Limited Kernels by Tim Schroeder, Sept 20th,2011 * CUDA Optimization: Instruction Limited Kernels by Dr Gernot Ziegler, Sept 27th,2011 * CUDA Optimization: Register Spilling and Local Memory Usage by Dr Paulius Micikevicius, Oct 4th,2011

© NVIDIA Corporation 2009 Books

© NVIDIA Corporation 2009 GPU Tools Profiler Available now for all supported OSs Command-line or GUI Sampling signals on GPU for: Memory access parameters Execution (serialization, divergence) Debugger Cuda-gdb (Linux), Nsight (MS), cuda-memcheck Runs on the GPU 3 rd part debuggers with CUDA support: TotalView, etc. Simple printfs in kernel code – in Fermi

© NVIDIA Corporation 2009

The “7 Dwarfs” of High Performance Computing Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed: 1. Dense Linear Algebra Ex: Solve Ax=b or Ax = λx where A is a dense matrix 2. Sparse Linear Algebra Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) 3. Operations on Structured Grids Ex: A new (i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) 4. Operations on Unstructured Grids Ex: Similar, but list of neighbors varies from entry to entry 5. Spectral Methods Ex: Fast Fourier Transform (FFT) 6. Particle Methods Ex: Compute electrostatic forces using Fast Multiple Method 7. Monte Carlo Ex: Many independent simulations using different inputs

© NVIDIA Corporation 2009 The “7 Dwarfs” of High Performance Computing 1. Dense Linear Algebra cuBLAS, CULA 2. Sparse Linear Algebra cuSPARSE, CUSP ( Operations on Structured Grids Successful implementations: Turbostream, GasDynamicsTool (not released yet)… 4. Operations on Unstructured Grids Succesfull implementations: SD++, FEFLO 5. Spectral Methods cuFFT, DCT 6. Particle Methods N-Body and similar examples at SDK 7. Monte Carlo cuRAND