GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Basic CUDA Programming
Some things are naturally parallel
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Introduction to CUDA C Slide credit: Slides adapted from
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Presentation transcript:

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

Programming Languages Access The Power of GPU Applications Libraries Programming Languages OpenACC Directives

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications NVIDIA cuBLAS NVIDIA cuSPARSE NVIDIA NPP NVIDIA cuFFT Matrix Algebra on GPU and Multicore GPU Accelerated Linear Algebra Vector Signal Image Processing NVIDIA cuRAND S3461A - CUDA Accelerated Compute Libraries (Maxim Naumov: new features, how to use) Monday, 10:30am in 210H Monday, 2:30pm in 212A There are a wide variety of computational libraries available. You are probably familiar with the FFTW Fast Fourier Transform library, the various BLAS linear algebra libraries, Intel’s Math Kernel Library and Integrated Performance Primitives libraries, to name a few. But what you might not know is that there are compatible GPU-accelerated libraries available as drop-in replacements for all of these libraries, and many more. Including both open source and commercially supported libraries. e.g. IMSL – International Mathematics & Statistics Library GPU-accelerated libraries are the most straightforward way to add GPU acceleration to applications, and in many cases their performance cannot be beat. Background: * cuBLAS is NVIDIA’s GPU-accelerated BLAS linear algebra library, which includes very fast dense matrix and vector computations. * cuSPARSE is NVIDIA’s Sparse matrix library, which includes sparse matrix-vector multiplication, fast sparse linear equation solvers, and tridiagonal solvers. * NVIDIA Performance Primitives (NPP) includes literally thousands of signal and image processing functions. * cuFFT is NVIDIA’s GPU-accelerated Fast Fourier Transform Library. * cuRAND is NVIDIA’s GPU-accelerated random number generation library, which is useful in Monte Carlo algorithms and financial applications, among others. * Thrust is an open-source C++ Template library of parallel algorithms such as sorting, reductions, searching, and set operations. All are FREE and included in the CUDA Toolkit, available at www.nvidia.com/getcuda IMSL Library CenterSpace NMath Building-block Algorithms C++ Templated Parallel Algorithms

GPU Programming Languages MATLAB, Mathematica, LabVIEW Numerical analytics OpenACC, CUDA Fortran Fortran OpenACC, CUDA C C Doesn’t show Java, or the growing list of DSLs being developed to address domain-specific challenges in many fields. GTC Dinner with Strangers last night and met Phil Pratt-Szeliga, who created the Rootbeer compiler for running Java code on GPUs. S3058 - Rootbeer: Seamlessly Using GPUs from Java (Phil Pratt-Szeliga, Syracuse) Wednesday, 9:30am in 231 That covers the Libraries, OpenACC Directives and Programming Language solutions… the 3 approaches to accelerating your applications. CUDA C++, Thrust, Hemi, ArrayFire C++ Anaconda Accelerate, PyCUDA, Copperhead Python CUDAfy.NET, Alea.cuBase .NET developer.nvidia.com/language-solutions

GPU Architecture

GPU: Massively Parallel Coprocessor A GPU is Coprocessor to the CPU or Host Has its own DRAM Runs 1000s of threads in parallel Single Precision: 4.58TFlop/s Double Precision: 1.31TFlop/s

Heterogeneous Parallel Computing Logic() Compute() Latency-Optimized Fast Serial Processing

Heterogeneous Parallel Computing Logic() Compute() CPU是针对穿行程序对延迟进行优化的处理器。 Latency-Optimized Fast Serial Processing Throughput-Optimized Fast Parallel Processing

Heterogeneous Parallel Computing Logic() Compute() Latency-Optimized Fast Serial Processing Throughput-Optimized Fast Parallel Processing

GPU in Computer System Connected to CPU chipset by PCIe DDR3 CPU DRAM Connected to CPU chipset by PCIe 16GB/s One Way, 32GB/s in both way

GPU High Level View Streaming Multiprocessor (SM) A set of CUDA cores Global memory

GK110 SM Control unit Execution unit Memory 4 Warp Scheduler 8 instruction dispatcher Execution unit 192 single-precision CUDA Cores 64 double-precision CUDA Cores 32 SFU, 32 LD/ST Memory Registers: 64K 32-bit Cache L1+shared memory (64 KB) Texture Constant 总共192,其中64个可做双精度

Kepler/Fermi Memory Hierarchy 3 levels, very similar to CPU Register Spills to local memory Caches Shared memory L1 cache L2 cache Constant cache Texture cache Global memory 以应用情况为导向

Kepler/Fermi Memory Hierarchy SM-0 SM-1 SM-N Registers Registers Registers C L1& SMEM TEX C L1& SMEM TEX C L1& SMEM TEX 32K x 32bit register file per SM L2 Global Memory

Basic Concepts GPU computing is all about 2 things: Transfer data CPU Memory GPU Memory PCI Bus CPU GPU !这样的硬件设计也导致了,我们的GPU程序可以分为两个部分。 Offload computation GPU computing is all about 2 things: Transfer data between CPU-GPU Do parallel computing on GPU

GPU Programming Basics

How To Get Start CUDA C/C++: download CUDA drivers & compilers & samples (All In One Package ) free from: http://developer.nvidia.com/cuda/cuda-downloads CUDA Fortran: PGI OpenACC: PGI, CAPS, Cray

CUDA Programming Basics Hello World Basic syntax, compile & run GPU memory management Malloc/free memcpy Writing parallel kernels Threads & block Memory hierarchy

Heterogeneous Computing Device Grid 0 Block (2, 1) Block (1, 1) Block (0, 1) Block (2, 0) Block (1, 0) Block (0, 0) Host C Program Sequential Execution Serial code Parallel kernel Kernel0<<<>>>() Kernel1<<<>>>() Grid 1 Block (1, 2) Block (0, 2) Executes on both CPU & GPU Similar to OpenMP’s fork-join pattern Accelerated kernels CUDA: simple extensions to C/C++

Hello World on CPU hello_world.c: #include <stdio.h> void hello_world_kernel() { printf(“Hello World\n”); } int main() { hello_world_kernel(); } Compile & Run: gcc hello_world.c ./a.out

Hello World on GPU hello_world.cu: #include <stdio.h> __global__ void hello_world_kernel() { printf(“Hello World\n”); } int main() { hello_world_kernel<<<1,1>>>(); } Compile & Run: nvcc hello_world.cu ./a.out

Hello World on GPU CUDA kernel within .cu files hello_world.cu: #include <stdio.h> __global__ void hello_world_kernel() { printf(“Hello World\n”); } int main() { hello_world_kernel<<<1,1>>>(); } Compile & Run: nvcc hello_world.cu ./a.out CUDA kernel within .cu files .cu files compiled by nvcc CUDA kernels preceded by “__global__” CUDA kernels launched with “<<<…,…>>>”

Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions

CUDA C/C++ Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Data Copies Non-blocking memcopies are provided cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Non-blocking memcopies are provided

Code Walkthrough 1 Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0s Copy from GPU to CPU Print the values

Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

Compile & Run nvcc main.cu ./a.out 0000000000000000 编译

Thread Hierarchy 2-level hierarchy: blocks and grid A block can: Block = a group of up to 1024 threads Grid = all blocks for a given kernel launch E.g. total 72 threads blockDim=12, gridDim=6 A block can: Synchronize their execution Communicate via shared memory Size of grid and blocks are specified during kernel launch dim3 grid(6,1,1), block(12,1,1); kernel<<<grid, block>>>(…);

IDs and Dimensions Threads: Blocks: Built-in variables: 3D IDs, unique within a block Blocks: 3D IDs, unique within a grid Built-in variables: threadIdx: idx within a block blockIdx: idx within the grid blockDim: block dimension gridDim: grid dimension Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) !

GPU and Programming Model Software GPU Threads are executed by scalar processors CUDA Core Thread Thread blocks are executed on multiprocessors 因为~ Thread Block Multiprocessor ... A kernel is launched as a grid of thread blocks Grid Device

Which thread do I belong to? blockDim.x = 4, gridDim.x = 4 threadIdx.x: 1 2 3 1 2 3 1 2 3 1 2 3 blockIdx.x: 1 1 1 1 2 2 2 2 3 3 3 3 idx = blockIdx.x*blockDim.x + threadIdx.x: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Code Walkthrough 2: Simple Kernel Allocate memory on GPU Copy the data from CPU to GPU Write a kernel to perform a vector addition Copy the result to CPU Free the memory

Vector Addition using C void vec_add(float *x,float *y,int n) { for (int i=0;i<n;++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x,y,n); free(x); free(y); Change to 1M element vector.

Vector Addition using CUDA C __global__ void vec_add(float *x,float *y,int n) { int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i]; } float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float)); cudaMalloc(&d_y,n*sizeof(float)); cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice); vec_add<<<n/128,128>>>(d_x,d_y,n); cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); Change to 1M element vector.

Vector Addition using CUDA C __global__ void vec_add(float *x,float *y,int n) { int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i]; } float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float)); cudaMalloc(&d_y,n*sizeof(float)); cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice); vec_add<<<n/128,128>>>(d_x,d_y,n); cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); Keyword for CUDA kernel Change to 1M element vector.

Vector Addition using CUDA C __global__ void vec_add(float *x,float *y,int n) { int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i]; } float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float)); cudaMalloc(&d_y,n*sizeof(float)); cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice); vec_add<<<n/128,128>>>(d_x,d_y,n); cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); Thread index computation to replace loop Change to 1M element vector.

GPU Memory Model Review Thread Per-thread Local Memory Block Per-block Shared Memory 哪些存储空间对于哪些线程是可见的呢 Kernel 0 Sequential Kernels . . . Per-device Global Memory Kernel 1 . . .

Global Memory Data lifetime = from allocation to deallocation Kernel 0 . . . Per-device Global Memory Kernel 1 Sequential Kernels Data lifetime = from allocation to deallocation Accessible by all threads as well as host (CPU)

Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Per-block Shared Memory C/C++: __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime Accessible by any thread in the threadblock Not accessible to other threadblocks 对所对应的的block中每个线程是可见的。

Per-thread Local Storage Registers Thread Per-thread Local Storage Automatic variables (scalar/array) inside kernels Data lifetime = thread lifetime Accessible only by the thread declares it

Example of Using Shared Memory Applying a 1D stencil to a 1D array of elements: Each output element is the sum of all elements within a radius For example, for radius = 3, each output element is the sum of 7 input elements: 是计算输入数据一定半径内所有数据之和 radius radius

Example of Using Shared Memory …1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 … ……28…………………………………

Kernel Code Using Global Memory One element per thread __global__ void stencil(int* in, int* out) { int globIdx = blockIdx.x * blockDim.x + threadIdx.x; int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += in[globIdx + offset]; out[globIdx] = value; } A lot of redundant read in neighboring threads: not an optimized way

Implementation with Shared Memory One element per thread Read (BLOCK_SIZE + 2 * RADIUS) elements from global memory to shared memory Compute BLOCK_SIZE output elements in shared memory Write BLOCK_SIZE output elements to global memory “halo” = RADIUS elements on the left = RADIUS elements on the right The BLOCK_SIZE input elements corresponding to the output elements

Kernel Code RADIUS = 3 BLOCK_SIZE = 16 __global__ void stencil(int* in, int* out) { __shared__ int shared[BLOCK_SIZE + 2 * RADIUS]; int globIdx = blockIdx.x * blockDim.x + threadIdx.x; int locIdx = threadIdx.x + RADIUS; shared[locIdx] = in[globIdx]; if (threadIdx.x < RADIUS) { shared[locIdx – RADIUS] = in[globIdx – RADIUS]; shared[locIdx + BLOCK_DIMX] = in[globIdx + BLOCK_SIZE]; } __syncthreads(); int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += shared[locIdx + offset]; out[globIdx] = value;

Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a thread block Since threads are scheduled at run-time Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block Otherwise may lead to deadlock 当一个线程运行到

Kepler/Fermi Memory Hierarchy SM-0 SM-1 SM-N Registers Registers Registers C L1& SMEM TEX C L1& SMEM TEX C L1& SMEM TEX 32K x 32bit register file per SM L2 Global Memory

Constant Cache Global variables marked by __constant__ are constant and can’t be changed in device. Will be cached by Constant Cache Located in global memory Good for threads access the same address __constant__ int a=10; __global__ void kernel() { a++; //error } ... Memory addresses

Texture Cache Save Data as Texture : Why use it? SMX Save Data as Texture : Provides hardware accelerated filtered sampling of data (1D, 2D, 3D) Read-only data cache holds fetched samples Backed up by the L2 cache Why use it? Separate pipeline from shared/L1 Highest miss bandwidth Flexible, e.g. unaligned accesses Tex Tex Tex Tex Read-only Data Cache 1、阐明什么是Texture 2、阐明SMX相对于SM在Texture上的性能提升 L2

Texture Cache Unlocked In GK110 SMX Added a new path for compute Avoids the texture unit Allows a global address to be fetched and cached Eliminates texture setup Managed automatically by compiler “const __restrict” indicates eligibility Tex Tex Tex Tex Read-only Data Cache L2

const __restrict Annotate eligible kernel parameters with const __restrict Compiler will automatically map loads to use read-only data cache path __global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; } 1、__restrict 是什么意思 2、需要和const一起使用 In C99, __restrict promises no aliasing over pointers. This means the compiler knows it's safe to read via texture cache, because previous writes will not have hit a readable location. Texture cache is not coherent with LSU writes, hence the need for __restrict. In this case, "__const restrict" guarantees read-only, non-aliased memory access. Non-const "__restrict" says it's a write path. Note also that this helps the compiler optimise in general, so it's good practice with or without texture use.

References Manuals Books Training videos Forum Programming Guide Best Practice Guide Books CUDA By Examples, Tsinghua University Press Training videos GTC talks online: optimization, advanced optimization + hundreds of other GPU computing talks http://www.gputechconf.com/gtcnew/on-demand-gtc.php NVIDIA GPU Computing webinars http://developer.nvidia.com/gpu-computing-webinars Forum http://cudazone.nvidia.cn/forum/forum.php