1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Extracted directly from:
CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA - 2.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
GPU Memories These notes will introduce:
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development & Technology

Optimization

3 3 Steps to Port your C/C++ code to CUDA Step 1: Single Thread port your C/C++ code to single thread CUDA kernel, and make sure output result correct. Focus on data movement between device & host memory Step 2: Single Block Port single thread kernel into single block kernel, and make sure output result correct. Focus on parallelizing with thread index Step 3: Multi Blocks & Threads Port single block kernel into multi blocks kernel, and make sure output result correct. Focus on fixing 2 layers index system, determine the best index utilization

4 3 Steps to optimize your CUDA kernels Step 1: setup timers to measure the kernel time use CUDA Event to measure kernel executing time use clock() in kernel to measure executing time weight per part in detail Step 2: kernel part & bottleneck division analyze your kernel, divide it into multi parts determine the bottleneck in each part use profiler to help determine bottlenecks Step 3: parts optimization optimize each part one by one, from the most time consuming part make sure the output correct after optimizing make sure the kernel executing time become shorter after optimizing

5 Bottlenecks Division I PCIE bound Suffer too much cudaMemcpy between host and device memory Memory bound Suffer global memory (device memory) bandwidth, or non-coalesced memory access pattern Computing bound Suffer computing power limit (Flops) Branch bound Suffer too many branch conditions Thread bound Suffer too few threads Register bound Suffer too few available registers (conjugated with thread bound)

6 Bottlenecks Division II PCIE bound Try keeping all data in device memory as long as possible Try using CUDA Stream to asynchronous data movement Memory bound Try using Texture, shared memory, constant memory, or cache (after Fermi) to reduce directly global memory I/O Computing bound Try reduce the operations in your algorithm Try use intrinsic functions Try trigger on –fast_math compiler options After trying all possible ways, you face to the hardware limit, which means this part is almost optimized already, please change to faster card

7 Bottlenecks Division III Branch bound Reduce the number of branches, especially the diverged branches. Thread bound Use compiler option –Xptxas –v to watch the used register amount per thread, and the used smem (shared memory) amount per block If the total register amount per block is over the spec, please try --maxrregcount to set the maximum register usage per thread, but this will generate local memory (DRAM) usage, which will be performance drawback. Register bound Note: the number of variables declared in kernel is not equal to the register using amount, the compiler will optimize it to smaller register amount, and drop some not used variables Try reduce the variables in your algorithm Try change the computing order, this will make the lifetime of some variables shorter

8 Warp & Branches I Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks

9 Warp & Branches II Branch will make warp diverged, each part will be executed in time order More diverged branches will be slower Non-diverged 2-fold diverged 3-fold diverged

10 Warp & Branches III Fat & slim diverged warp If there are some common instructions in diverged warp, move it out of branch will save some executing time Generally speaking, make the branch as slim as possible will save time Such as data load/save, common instructions,…etc will make the diverged warp fatter 1: common instruction 2 3 1: com 2 3 Slim diverged warp Fat diverged warp

11 Estimate the computing throughput Computing weight Isolate the computing parts & measure their percentage in kernel through GPU clock() Kernel executing time Measure the kernel executing time, and calculate the computing time in these kernel parts Used computing throughput in Flops Count total arithmetic operations, and divide by this executing time

12 Example of intrinsic functions __mul24(x, y) faster than 32 bits product computes the product of the 24 least significant bits of the integer parameters __sinf(x), __cosf(x), __expf(x) very fast, single precision less precision, but still ok __sincosf(x,sptr,cptr)

13 Memory system Thread Scope Register: on die, fastest, default Local memory: DRAM, non cached, slow (400~600 clocks) Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__ Grid Scope Global memory: DRAM, non cached, slow (400~600 clocks) Constant memory: on die, small (total 64KB), Qualifier __constant__ Texture memory: read only, DRAM+cache, fast if cache hit Cache (Fermi only): R/W cache

14 Count memory bandwidth (exercise) Memory access weight Isolate the memory access parts & measure their percentage in kernel thru GPU clock() Kernel executing time Measure the kernel executing time, and calculate the memory access time in kernel Used memory bandwidth Count total memory access bytes, and divide by the access time

15 Coalesced global memory I/O Threads in ½ warp shares the same memory controller If the memory access pattern in ½ warp is dense localized in memory, this will lead to good performance, cause it will form a single transaction. We call this coalesced I/O if they diverge to different memory segments, the performance will drop due to multi transactions. We call this non-coalesced I/O How many threads in warp shared the same memory controller may differ from hardware spec.

16 Wrap kernel as standard C/C++ functions This can compile to kernels into standard object files or Library Link to other languages: Java, Fortran, MATLAB, … Not necessary to rewrite all non-C code into CUDA, we can call kernels from any other languages

17 Example: Wrap Kernel to C __global__ void ker_xxx(int* a, int* b){//some CUDA kernel … } extern “C”{//export as standard C format void xxx(int* a, int* b); }; void xxx(int* a, int* b){//wrap the kernel into C function … ker_xxx >>(a,b); … }

18 Multi-GPU operations Before CUDA 4.0 or Non-tesla Cards One CUDA context can control only one GPU hardware, to send/receive data, and launch kernels (which means one CPU thread can control one GPU, since one CPU thread own one CUDA context) We can use MPI, openMP, pthreads to create multi CPU threads, then use cudaSetDevice() to assign each CPU thread to each GPU Data communications: copy data in global memory back to system memory, and transfer data through MPI, openMP, pthreads protocols. CUDA 4.0 UVA: universal virtual addressing (all GPU & CPU can see data from each other)

19 Hardware (SPA, Streaming Processors Array) TPC

20 Hardware (TPC, Texture/Processors Cluster)

21 Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)… 1 block divides into multi-warps SM could execute 1 warp in 4 clocks

22 SPA, Streaming Processors Array Double Precision Special Function Unit (SFU) TP Array Shared Memory

23 How to use so many cores? 240 SP thread processors 30 DP thread processors Full scalar processor IEEE 754 double precision floating point Double Precision Special Function Unit (SFU) TP Array Shared Memory Thread Processor (TP) FP/Int Multi-banked Register File SpcOps ALUs Thread Processor Array (TPA)