1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy LEE @ NVIDIA

GPU & CUDA Features

3 Comparing GPU & CPU system Memory bandwidth Host memory (CPU) ~ 10 GB/s Device memory (GPU) ~ 100 GB/s Bottleneck PCIE : 3~5 GB/s More computing cores CPU : 4~32 cores (Peak Perf : 10~100G FLOPS) GPU : 32~512 cores (Peak Perf : >1T FLOPS) Large scale parallel Traditional (MPI, OpenMP, pthreads, …) : about 10~1K threads CUDA : about 1K ~ 10M threads (the algorithm concern may be different)

4 Graphic Card DRAM: GDDR3/5 (device memory) GPU PCIE

5 Memory Bandwidth Main Memory FSB Core 1Core 2 L2 Cache 12.8 GB/sec Main Memory GDDR3 102 GB/sec 512 bit 64 bit 8x faster interface 240 cores

6 Highway: Device memory 100 GB/s PCIE gen.2 x16 3~5 GB/s Street: Host memory 10 GB/s

7 NVIDIA’s GPUs : Ever Increasing Performance NV30 NV35 NV40 G70 T10 = Tesla 10-series G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

8 Hierarchy of concurrent threads GRID: CUDA Kernel A kernel is a function executing on GPU One grid contains many blocks practically 64~1024 blocks for good performance BLOCK One block contains many threads practically 32~512 threads for good performance blockIdx : block index in Grid (0,1,2,3,…) Threads in the same block can cooperate Synchronize Share data with fast on-die memory called “Shared Memory” Threads threadIdx : thread index in Block (0,1,2,3,…) Thread Block Grid...

9 IDs and Dimensions Built-in variables: threadIdx, blockIdx blockDim, gridDim Dimensions set at launch time Can be unique for each section Blocks: 2D IDs, unique within a grid Threads: 3D IDs, unique within a block Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

CUDA Environment

11 CUDA Environments Driver running CUDA on specific cards & CUDA version Toolkits Compiler, library, debuger, profiler, … SDK CUDA Examples & some useful tools All FREE download from NVIDIA developer zone

12 CUDA Toolkits Compiler CUDA based on C/C++ (Available now for all supported OSs ) Library BLAS, FFT, RAND, SPARSE Profiler Sampling signals on GPU for optimizing reference Memory access parameters Execution (serialization, divergence) Debugger Runs on the GPU (GDB, Nexus, … etc) Documents CUDA Programming Guide CUDA Best Practice Guide API & LIB Reference …

13 CUDA SDK Useful tools deviceQuery – query available devices specs on this machine bandwidthTest – test bandwidth of PCIE & device memory Valuable materials for developers Matrix Multiplication, Transpose Tree Reduction Simple BLAS Multi GPU sample … Valuable middle-wares for developers Video codec (MPEG, H.264, …) Radix Sort …

CUDA Programming Model

15 Simple 5 steps to program GPU with CUDA Step 1 : Allocate device memory Step 2 : Upload input data from host to device memory Step 3 : Call CUDA kernel(s) ….. (can call many kernels to manipulate data in device memory) Step 4 : Download output data from device to host memory Step 5 : Free device memory Operation 1Operation 2Operation 3 Init Alloc Function Lib Function CPU GPU

16 Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code (CPU)... Parallel Kernel (GPU) Serial Code (CPU) Parallel Kernel (GPU)

17 GPU Memory Allocation / Release Host (CPU) manages GPU memory: cudaMalloc (void ** pointer, size_t nbytes) cudaFree (void* pointer) int m = 1024; int nbytes = m*sizeof(int); int * devPtr = 0; cudaMalloc( (void**)& devPtr, nbytes); cudaFree(devPtr);

18 Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

19 Launching kernels on GPU Launch parameters: grid dimensions (up to 2D) thread-block dimensions (up to 3D) Launch 1D kernel kernel >>(...); Launch 2D kernel dim3 grid(16, 16); dim3 block(16,16); kernel >>(...);

20 Sample code Step 1: Allocate device memory float* devPtr = 0; cudaMalloc((void**)&devPtr, sizeInBytes); Step 2: Copy data from host to device cudaMemcpy(devPtr, hostPtr, numOfBytes, cudaMemcpyHostToDevice); Step 3: Call CUDA kernel(s) kernel_function >> (devicePtr); Step 4: Copy data from device back to host cudaMemcpy(hostPtr, devPtr, numOfBytes, cudaMemcpyDeviceToHost); Step 5: Free device memory cudaFree(devPtr);

21 Code executed on GPU C function with some restrictions: Single input variables can be directly transferred by parameters Array variables must be in device memory Return type is void Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU Built-in ID: gridDim, blockDim, blockIdx, threadIdx

22 Sample kernel code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Must loop n times (Serial) } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Only one time (Parallel) } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); Standard C Code Parallel CUDA Code

23 Memory scope Thread Scope Each thread has own local storage Block Scope Each thread block has own shared memory Accessible only by threads within that block Grid Scope (Global Scope) Accessible by all threads as well as host (CPU)

24 Memory system Thread Scope Register: on die, fastest, default Local memory: DRAM, non cached, slow (400~600 clocks) Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ Grid Scope Global memory: DRAM, non cached, slow (400~600 clocks) Constant memory: on die, small (total 64KB), Qualifier __constant__ Texture memory: read only, DRAM+cache, fast if cache hit Cache (Fermi only): R/W cache

25 Memory : Location & Scope Thread Registers Block Shared Memory Kernel... Global Memory

26 Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a BLOCK Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block

27 Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed cudaThreadSynchronize() -- SYNC CPU & GPU blocks until all previous CUDA calls complete Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed Asynchronous CUDA calls provide: non-blocking memcopies ability to overlap memcopies and kernel execution

28 Synchronization Level Block Sync sync all threads in the block Method: Call __syncthreads() in kernel Grid Sync sync all threads in the grid (kernel) Method: Launch another kernel CPU-GPU Sync sync CPU & GPU Method: Call cudaThreadSynchronize() in host C code

29 Device Management CPU can query and select GPU devices cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop ) Multi-GPU setup: device 0 is used by default one CPU thread can control one GPU

30 CUDA Error Reporting to CPU All CUDA calls return error code: except for kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (no error has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminted character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

31 Compiling CUDA NVCC C/C++ CUDA Application PTX to Target Compiler G80 G200 … Target code PTX Code Virtual Physical CPU Code JIT compiler -- Driver Hardware

32 Onhand practice SDK Install & make deviceQuery, bandwidthTest Hello CUDA Vector Add Tree Reduction

Thinking in parallel

34 Example: Vector Addition k = blockIdx.x * blockDim.x + threadIdx.x gridDim.x = 3, blockDim.x = 4 blockIdx.x012 threadIdx.x012301230123 k (element index)01234567891011 a [k] + b [k]  c [k]

35 Example: Vector Addition CPU GPGPU void add(float *a, float* b, float* c, int n) { for (int k = 0; k<n; k++) c[k] = a[k] + b[k]; } void main() {..... add(a, b, c, n); } __global__ void Gadd(float *a, float* b, float* c, int n) { int k = blockIdx.x * blockDim.x + threadIdx.x; if( k < n ) c[k] = a[k] + b[k]; } void main() { ….. int Block = 512; int Grid = n / Block + 1; Gadd >> (a, b, c, n); }

Backup slides

37 PTX Example (SAXPY code) cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32$i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish:exit;

38 Scalability Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability

39 Hardware multi-threads Kernel launched by host... SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU SP Shared Memory MT IU... Device processor array Device Memory

40 Example: Matrix Transpose Coalesced read/write device memory best memory bandwidth utilization Use shared memory to share data between threads in block Use 2D blockDim & gridDim Over 20x than Host in large matrix (ex. 3000x4000) BLOCK Host Invocation dim3 grid (n/16+1,m/16+1,1); dim3 block (16,16,1); Transpose >> (b, a, m, n);

41 Example: Matrix Transpose __global__ void Transpose (float* b, float* a, int m, int n){ __shared__ float s[256]; //declare shared memory (all threads in block could see it) int x=blockIdx.x*blockDim.x + threadIdx.x; //compute index (x,y) int y=blockIdx.y*blockDim.y + threadIdx.y; if(y<m && x<n){ int i=y*n+x; //compute input address (make the lowest index fit threadIdx.x) int t=threadIdx.y*blockDim.x + threadIdx.x; //compute shared address s[t]=a[i]; //coalesced input (read in 16x16 matrix block) } __syncthreads(); //synchronize threads in block x=blockIdx.x*blockDim.x + threadIdx.y; //exchange threadIdx (x,y) usage y=blockIdx.y*blockDim.y + threadIdx.x; //compute index (x,y) if(y<m && x<n){ int o=x*m+y; //compute output address (make the lowest index fit threadIdx.x) int t=threadIdx.x*blockDim.y + threadIdx.y; //compute shared address b[o]=s[t]; //coalesced output (write out 16x16 matrix block) }

42 Hardware (SPA, Streaming Processors Array) TPC

43 Hardware (TPC, Texture/Processors Cluster)

44 Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks

45 SPA, Streaming Processors Array Double Precision Special Function Unit (SFU) TP Array Shared Memory

46 How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Double Precision Special Function Unit (SFU) TP Array Shared Memory Thread Processor (TP) FP/Int Multi-banked Register File SpcOps ALUs Thread Processor Array (TPA)

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.

Similar presentations

Presentation on theme: "1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.

Similar presentations

Presentation on theme: "1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA."— Presentation transcript:

Similar presentations

About project

Feedback