Introduction to CUDA Programming

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
An Introduction to Programming with CUDA Paul Richmond
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Lecture 25 PC System Architecture PCIe Interconnect
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
My Coordinates Office EM G.27 contact time:
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming Architecture Overview Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Computer Engg, IIT(BHU)
Introduction to CUDA Programming
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Introduction to CUDA Programming
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA Programming
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Some things are naturally parallel
Lecture 5: GPU Compute Architecture for the last time
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Introduction to CUDA Programming
DRAM Bandwidth Slide credit: Slides adapted from
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Introduction to CUDA Programming
Programming Massively Parallel Processors Performance Considerations
Programming Massively Parallel Graphics Processors
Mattan Erez The University of Texas at Austin
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Introduction to CUDA Programming Architecture Overview Andreas Moshovos Winter 2009 Updated Winter 2002 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk Real World Technologies by David Kanter HotChips 22 Presentation by C. M. Wittenbrink, E. Kilgariff and A. Prabhu Introduction to CUDA Programming

Programmer’s view of a CPU+GPU system GPU as a co-processor (data is from 2008) CPU GPU 3GB/s – 8GB.s 141GB/sec Memory 6.4GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB on our systems GTX280 characteristics Top of the line in 2008-2009 Key Suppliers: Nvidia and AMD

Our Lab Systems: Processor, board and memory Q9550 @ 2.83GHz Launched Q1’08 Cores: 4 one thread/core L1D/L1I: 64KB L2: 12MB Mem Bus: 1333Mhz 45nm, 95W ASUS P5E-VM DO Intel Q35 Chipset Ballistix 2 x 2GB DDR2 800 PC2-6400 CL4-4-4-12 6.4 GB/s peak

System Architecture of a Typical PC / Intel (2008)

Current (2011) Intel System Architecture (desktop)

PCI-Express Programming Model PCI device registers are mapped into the CPU’s physical address space Accessed through loads/ stores (kernel mode) Addresses assigned to the PCI devices at boot time All devices listen for their addresses That’s a reason why Windows XP cannot “see” 4GB

Switched, point-to-point connection PCI-E 1.x Architecture Switched, point-to-point connection Each card has a dedicated “link” to the central switch, no bus arbitration. Packet switches messages form virtual channel Prioritized packets for QoS E.g., real-time video streaming IO IO IO IO IO IO NB NB BUS: PCI or older PCI-E

PCI-E 1.x Architecture Contd. Each link consists of one more lanes Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5Gb/s in one direction) Upstream and downstream now simultaneous and symmetric Differential signalling Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. Each byte data is 8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/s per lane each way. Thus, the net data rates are 250 MB/s (x1) 500 MB/s (x2), 1GB/s (x4), 2 GB/s (x8), 4 GB/s (x16), each way

Version Clock Speed Transfer Rate Overhead Data Rate 1.x 1.25Ghz PCI-E 2.x and beyond Version Clock Speed Transfer Rate Overhead Data Rate 1.x 1.25Ghz 2.5GT/s 20% 250MB/s 2.0 2.5 Ghz 5GT/s 500MB/s 3.0 4Ghz 8GT/s 0% 1GB/s

Typical AMD System (for completeness) AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture HyperTransport ™ similarities to PCIe: Packet based, switching network Dedicated links for both directions Shown in 4 socket configuraton, 8 GB/sec per link Northbridge/HyperTransport ™ is on die Glueless logic to DDR, DDR2 memory PCI-X/PCIe bridges (usually implemented in Southbridge)

“Current” AMD system architecture

Our lab motherboards (2008)

Typical motherboard today (2012)

Grids of Blocks Blocks of Threads CUDA Refresher Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds

Execution model guarantees Only that threads will execute Says nothing about the order Extreme cases: #1: All threads run in parallel #2: All threads run sequentially Interleaving at synchronization points A CUDA program can run: On the CPU On a GPU with 1 and on one with N units Different models/price points

Thread Blocks Refresher Programmer declares (Thread) Block: Block size 1 to 1024concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads All threads in a Block execute the same thread program Threads have thread id numbers within Block Threads share data and synchronize while doing their share of the work Thread program uses thread id to select work and address shared data Thread Id #: 0 1 2 3 … m Thread program

GPU CPU My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); GPU CPU

Use multithreading to hide DRAM latency Architecture Goals Use multithreading to hide DRAM latency Support fine-grain parallel processing Virtualize the processors to achieve scalability Multiple blocks and threads per processor Simplify programming. Develop program for one thread Conventional Processors Latency optimized ILP Caches 99% hit rate GPU Caches 90% or less. Not a good option  Throughput optimized ILP + TLP

3 Billion Transistors in 40 nm process (TSMC) GF100 Specifications 3 Billion Transistors in 40 nm process (TSMC) Up to 512 CUDA / unified shader cores 384-bit GDDR5 memory interface 6GB capacity GeForce GTX480: Graphics Enthusiast

GF100 Architecture Overview -- Compute 64-bit

GF100 Architecture - Complete 512 CUDA cores 16 PolyMorph Engines 4 raster units 64 texture units 48 ROP units 384-bit GDDR5 6 channels 64-bit / channel

SPA TPC SM SP Terminology Streaming Processor Array Texture Processor Cluster 3 SM + TEX SM Streaming Multiprocessor (32 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Streaming Processor Scalar ALU for a single CUDA thread

SM Architecture Streaming Multiprocessor (SM) 32 Streaming Processors (SP) 32 INT or FP (32-bit) 16 DP (64-bit) 4 Super Function Units (SFU) 16 Load/Store Units Multi-threaded instruction dispatch Up to1536 threads active 32 x 48 Up to 8 concurrent blocks 1024 threads/block limit Shared instruction fetch per 32 threads Cover latency of texture/memory loads 80+ GFLOPS 16K/48K KB shared memory 48K/16K L1 cache DRAM texture and memory access

Grid is launched on the SPA Thread Life Grid is launched on the SPA Thread Blocks are serially distributed to all the SM’s Potentially >1 Thread Block per SM Each SM launches Warps of Threads 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can distribute more Thread Blocks Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Cooperative Thread Array Break Blocks into warps Allocate Resources Registers, Shared Mem, Barriers Then allocate for execution

Streaming Multiprocessor Architecture

Stream Multiprocessors Execute Blocks Threads are assigned to SMs at Block granularity Up to 8 Blocks to each SM as resource allows SM in GF100 can take up to 1536 threads Could be 256 (threads/block) * 6 blocks Could be 512 (threads/block) * 3 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU TF L2 SM 0

Thread Scheduling and Execution Each Thread Blocks is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warp: primitive scheduling unit All threads in warp: same instruction control flow causes some to become inactive … Block 1 Warps … Block 2 Warps … … t0 t1 t2 … t31 t0 t1 t2 … t31 SP SFU Instruction Fetch/Dispatch Instruction L1 Data L1 Streaming Multiprocessor Shared Memory DPU

SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G200 SM multithreaded Warp scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 . . . warp 8 instruction 12 warp 3 instruction 96

Warp Scheduling: Hiding Thread stalls

How many warps are there? If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution.

Warp Scheduling Ramifications If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate a 200-cycle memory latency Why? Need to hide 200 cycles every four instructions Every Warp occupies 4 cycles during which the same instruction executes Every 4 insts a thread stalls Every 16 cycles a thread stalls 200/16 =12.5 or at least 13 warps

Granularity Considerations: Block & Thread limits per SM For Matrix Multiplication or any 2D-type of computation, should I use 8X8, 16X16 or 32X32 tiles? For 8X8, we have 64 threads per Block. Thread/SM limit = 1024 up to 16 Blocks. Blocks/SM limit = 8  only 512 threads will go into each SM For 16X16, we have 256 threads per Block. Thread/SM limit = 1024  up to 4 Blocks. Blocks/SM limit = 8  full capacity unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. Thread/block limit = 1024 Not even one can fit into an SM.

SM Instruction Buffer – Warp Scheduling (ref) Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp: not public SM broadcasts the same instruction to 32 Threads of a Warp That’s the theory  warp scheduling may use heuristics I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Decoupled Memory/Processor pipelines Scoreboarding (ref) All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of Memory/Processor ops

WARP scheduling & scoreboarding add r1, r2, 10  r1 = r2 + 10 add r3, r1, r1  r3 = r1 + r1 Scoreboard[r1] = 0 time Scoreboard[r1]? stall Scoreboard[r1] = 1

WARP scheduling & scoreboarding add r1, r2, 10  r1 = r2 + 10 add r3, r2, r2  r3 = r2 + r2 Scoreboard[r1] = 0 time Scoreboard[r2]? Scoreboard[r1] = 1

Stream Multiprocessor Detail 64 entry

32 bit ALU and Multiply-Add IEEE Single-Precision Floating-Point Scalar Units 32 bit ALU and Multiply-Add IEEE Single-Precision Floating-Point Integer Latency is 4 cycles FP: NaN, Denormals become signed 0. Round to nearest even

Special Function Units Transcendental function evaluation and per-pixel attribute interpolation Function evaluator: rcp, rsqrt, log2, exp2, sin, cos approximations Uses quadratic interpolation based on Enhanced Minimax Approximation 1 scalar result per cycle Latency is 16 cycles Some are synthesized: 32 cycles or so

As much parallelism as possible Memory System Goals High-Bandwidth As much parallelism as possible wide. 512 pins in G200 / Many DRAM chips fast signalling. max data rate per pin. maximize utilization Multiple bins of memory requests Coalesce requests to get as wide as possible Goal to use every cycle to transfer from/to memory Compression: lossless and lossy Caches where it makes sense. Small

multiple banks per chip 2^N rows. 2^M cols Timing contraints DRAM considerations multiple banks per chip 4-8 typical 2^N rows. 16K typical 2^M cols 8K typical Timing contraints 10~ cycles for row 4 cycles within row DDR 1Ghz --> 2Gbit/pin 32-bit --> 8 bytes clock GPU to memory: many traffic generators no correlation if greedy scheduling separate heaps / coalesce accesses Longer latency

Parallelism in the Memory System Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-Block Shared by threads of the same block Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Thread Local Memory Block Shared Memory Grid 0 . . . Global Memory Sequential Grids in Time Grid 1 . . .

SM Memory Architecture Threads in a Block share data & results In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory SM Shared Memory dynamically allocated to Blocks, one of the limiting resources

TEX pipe can also read/write RF Load/Store pipe can also read/write RF SM Register File Register File (RF) 64 KB 16K 32-bit registers Provides 4 operands/clock TEX pipe can also read/write RF 3 SMs share 1 TEX Load/Store pipe can also read/write RF I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Programmer’s View of Register File There are 16K registers in each SM in G200 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all Blocks assigned to the SM Once assigned to a Block, the register is NOT accessible by threads in other Blocks Each thread in the same Block only access registers assigned to itself 4 blocks 3 blocks

Register Use Implications Example Matrix Multiplication If each Block has 16X16 threads and each thread uses 10 registers, how many threads can run on each SM? Each Block requires 10*16*16 = 2560 registers 16384 = 6* 2560 + change So, six blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 16384 < 2816 *6 Only five Blocks can run on an SM, 5/6 reduction of parallelism

Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism

ILP example a = a + bmem c = c + 10 * b load r1, 0(r3) add r2, r2, r1 mul r1, r4, 10 add r5, r1, r1 r1 is a temporary a in r2 b in r4 c in r5 One more register: load r1, 0(r3) add r2, r2, r1 mul r6, r4, 10 add r5, r6, r6 Final version:

Within or Across Thread Parallelism (ILP vs. TLP) (ref) Assume: kernel: 256-thread Blocks 4 independent instructions for each global memory load, thread: 21 registers global loads: 200 cycles 3 Blocks can run on each SM (16K / (256 * 21)) If a Compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load Only two can run on each SM However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two Blocks have 16 Warps. Conclusion: could be better

How many registers is my kernel using? NVCC flag: -ptxas-options="-v“ ptxas info : Compiling entry function 'acos_main' ptxas info : Used 4 registers, 60+56 bytes lmem, 44+40 bytes smem, 20 bytes cmem[1], 12 bytes cmem[14] For shared memory per block: 44 bytes explicitly allocated by the user 40 were implicitly allocated by the system/compiler Lmem: local memory per thread Constant: Program variables [1] Compiler generated constants [14] Double-check this please

CUDA Occupancy Calculator http://developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/docs/CUDA_Occupancy_Calculator.xls

Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM 64KB total in DRAM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a Block I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Each SM has 16 KB of Shared Memory 16 banks of 32-bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other Key Performance Enhancement Move data in Shared memory Operate in there I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Parallel Memory Architecture In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp G200 the same

Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Load/Store (Memory read/write) Clustering/Batching Use LD to hide LD latency (non-dependent LD ops only) Use same thread to help hide own latency Instead of: LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! But, you must have enough non-dependent LDs and Math

How to get high-performance #1 Programmer managed Scratchpad memory Bring data in from global memory Reuse 16KB/banked Accessed in parallel by 16 threads “shared memory” Programmer needs to: Decide what to bring and when Decide which thread accesses what and when Coordination paramount

How to get high-performance #2 Global memory accesses 32 threads access memory together Can coalesce into a single reference E.g., a[threadID] works well Control flow 32 threads run together If they diverge there is a performance penalty Texture cache When you think there is locality

Must carefully check for stability/correctness Numerical Accuracy Can do FP Mostly OK some minor discrepancies Can do DP 1/8 the bandwidth Better on newer hardware Mixed methods Break numbers into two single-precision values Must carefully check for stability/correctness Will get better w/ next generation hardware

Are GPUs really that much faster than CPUs 50x – 200x speedups typically reported Recent work found Not enough effort goes into optimizing code for CPUs Intel paper (ISCA 2010) http://portal.acm.org/ft_gateway.cfm?id=1816021&type=pdf But: The learning curve and expertise needed for CPUs is much larger Then, so is the potential and flexibility

Predefined Vector Datatypes Can be used both in host and in device code. [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] Structures accessed with .x, .y, .z, .w fields default constructors, “make_TYPE (…)”: float4 f4 = make_float4 (1f, 10f, 1.2f, 0.5f); dim3 type built on uint3 Used to specify dimensions Default value is (1, 1, 1)

Execution Configuration Must specify when calling a __global__ function: <<< Dg, Db [, Ns [, S]] >>> where: dim3 Dg: grid dimensions in blocks dim3 Db: block dimensions in threads size_t Ns: per block additional number of shared memory bytes to allocate optional, defaults to 0 more on this much later on cudaStream_t S: request stream(queue) optional, default to 0. Compute capability >= 1.1

dim3 gridDim uint3 blockIdx dim3 blockDim uint3 threadIdx Built-in Variables dim3 gridDim Number of blocks per grid, in 2D (.z always 1) uint3 blockIdx Block ID, in 2D (blockIdx.z = 1 always) dim3 blockDim Number of threads per block, in 3D uint3 threadIdx Thread ID in block, in 3D

Execution Configuration Examples 1D grid / 1D blocks dim3 gd(1024) dim3 bd(64) akernel<<<gd, bd>>>(...) gridDim.x = 1024, gridDim.y = 1, blockDim.x = 64, blockDim.y = 1, blockDim.z = 1 2D grid / 3D blocks dim3 gd(4, 128) dim3 bd(64, 16, 4) gridDim.x = 4, gridDim.y = 128, blockDim.x = 64, blockDim.y = 16, blockDim.z = 4

Most cuda…() functions return a cudaError_t Error Handling Most cuda…() functions return a cudaError_t If cudaSuccess: Request completed without a problem cudaGetLastError(): returns the last error to the CPU Use with cudaThreadSynchronize(): cudaError_t code; cudaThreadSynchronize (); code = cudaGetLastError (); char *cudaGetErrorString(cudaError_t code); returns a human-readable description of the error code

Error Handling Utility Function void cudaDie (const char *msg) { cudaError_t err; cudaThreadSynchronize (); err = cudaGetLastError(); if (err == cudaSuccess) return; fprintf (stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString (err)); exit(EXIT_FAILURE); } adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131

CUDA_SAFE_CALL ( some cuda call ) Error Handling Macros CUDA_SAFE_CALL ( some cuda call ) CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size, cudaMemcpyDeviceToHost) ); Prints error and exits on error Must define #define _DEBUG No checking code emitted when undefined: Performance Use make dbg=1 under NVIDIA_CUDA_SDK

Measuring Time -- gettimeofday Unix-based: #include <sys/time.h> #include <time.h> struct timeval start, end; gettimeofday (&start, NULL); WHAT WE ARE INTERESTED IN gettimeofday (&end, NULL); timeCpu = (float)(end.tv_sec - start.tv_sec); if (end.tv_usec < start.tv_usec) { timeCpu -= 1.0; timeCpu += (double)(1000000.0 + end.tv_usec - start.tv_usec)/1000000.0; } else timeCpu += (double)(end.tv_usec - start.tv_usec)/1000000.0;

Look at the clock example under projects in SDK Using CUDA clock () clock_t clock (); Can be used in device code returns a counter value One per multiprocessor / incremented every clock cycle Sample at the beginning and end of the code upper bound since threads are time-sliced uint start = clock(); ... compute (less than 3 sec) .... uint end = clock(); if (end > start)  time = end - start; else  time = end + (0xffffffff - start) Look at the clock example under projects in SDK Using takes some effort Every thread measures start and end Then must find min start and max end Cycle accurate

You’ll get one measurement per thread Clock() example __global__ static void timedkernel(clock_t * timer_start, clock_t *timer_end) { tmid = blockIdx.x * blockDim.x + threadIdx.x; timer_start[tmid] = clock(); do something timer_end[tmid] = clock(); } You’ll get one measurement per thread

You’ll get one measurement per block Clock() example 2 __global__ static void timedkernel(clock_t * timer_start, clock_t * timer_end) { tmid = blockIdx.x; // first thread in block if (threadIdx.x == 0) timer_start[tmid] = clock(); do something _syncthreads(); // wait for all threads in block if (threadIdx.x == 0) timer_end[tmid] = clock(); } You’ll get one measurement per block SDK uses a timer array with twice as many elements as blocks timer_end[] becomes timer[bid+gridDim.x]

Using cutTimer…() library calls #include <cuda.h> #include <cutil.h> unsigned int htimer; cutCreateTimer (&htimer); CudaThreadSynchronize (); cutStartTimer(htimer); WHAT WE ARE INTERESTED IN cudaThreadSynchronize (); cutStopTimer(htimer); printf (“time: %f\n", cutGetTimerValue(htimer));

Code Overview: Host side #include <cuda.h> #include <cutil.h> unsigned int htimer; float *ha, *da; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); for (int i = 0; i < N; i++) ha[i] = i; cutCreateTimer (&htimer); cudaMalloc ((void **) &da, sizeof (float) * N); cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N, cudaMemcpyHostToDevice); blocks = (N + threads_block – 1) / threads_block; cudaThreadSynchronize (); cutStartTimer(htimer); darradd <<<blocks, threads_block>> (da, 10f, N) cutStopTimer(htimer); cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); cudaFree (da); free (ha); printf (“processing time: %f\n", cutGetTimerValue(htimer)); }

Code Overview: Device Side __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10);

Variable Declarations __device__ stored in device memory (large, high latency, no cache) Allocated with cudaMalloc (__device__qualifier implied) accessible by all threads lifetime: application __constant__ same as __device__, but cached and read-only by GPU written by CPU via cudaMemcpyToSymbol(...) call __shared__ stored in on-chip shared memory (very low latency) accessible by all threads in the same thread block lifetime: kernel launch Unqualified variables: scalars and built-in vector types are stored in registers arrays of more than 4 elements or run-time indices stored in device memory

Measurement Methodology You will not get exactly the same time measurements every time Other processes running / external events (e.g., network activity) Cannot control “Non-determinism” Must take sufficient samples say 10 or more There is theory on what the number of samples must be Measure average Will discuss this next time or will provide a handout online

Handling Large Input Data Sets – 1D Example Recall gridDim.[xy] <= 65535 Host calls kernel multiple times: float *dac = da; // starting offset for current kernel while (n_blocks) { int bn = n_blocks; int elems; // array elements processed in this kernel if (bn > 65535) bn = 65535; elems = bn * block_size; darradd <<<bn, block_size>>> (dac, 10.0f, elems); n_blocks -= bn; dac += elems; } Better alternative: Each thread processes multiple elements