CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS/EE 217 – GPU Architecture and Parallel Programming

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Reduction)

Programming Massively Parallel Processors Performance Considerations

Mattan Erez The University of Texas at Austin

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ECE 498AL Lecture 10: Control Flow

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011

Administrivia Presentation topic due today  via by 11:59pm  Or I pick your topic  Not bonus day eligible Assignment 4 due Friday, 03/04

Agenda Warps Revisited Parallel Reduction Revisited Warp Partitioning Memory Coalescing Dynamic Partitioning of SM Resources Data Prefetching

Warps Recall a warp  Group of threads from a block G80 / GT200 – 32 threads per warp  Each thread executes the same instruction on different data  The unit of scheduling  An implementation detail Not if you want performance

Warps 32 threads per warp but 8 SPs per SM. What gives?

Warps 32 threads per warp but 8 SPs per SM. What gives? When an SM schedules a warp:  Its instruction is ready  8 threads enter the SPs on the 1 st cycle  8 more on the 2 nd, 3 rd, and 4 th cycles  Therefore, 4 cycles are required to execute any instruction

Warps Question  A kernel has 1 global memory read ( 200 cycles) 4 non-dependent multiples/adds  How many warps are required to hide the memory latency?

Warps Solution  Each warp has 4 multiples/adds 16 cycles  We need to cover 200 cycles 200 / 16 = 12.5 ceil(12.5) = 13  13 warps are required

Efficient data- parallel algorithms Optimizations based on GPU Architecture Maximum Performance + =

Parallel Reduction Recall Parallel Reduction (sum)

Parallel Reduction

Parallel Reduction

Parallel Reduction

Parallel Reduction Similar to brackets for a basketball tournament log( n ) passes for n elements How would you implement this in CUDA?

__shared__ float partialSum[]; //... load into shared memory unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; } Code from

__shared__ float partialSum[]; //... load into shared memory unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; } Code from Computing the sum for the elements in shared memory

__shared__ float partialSum[]; //... load into shared memory unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; } Code from Stride : 1, 2, 4, …

__shared__ float partialSum[]; //... load into shared memory unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; } Code from Why?

__shared__ float partialSum[]; //... load into shared memory unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; } Code from Compute sum in same shared memory As stride increases, what do more threads do?

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Parallel Reduction Thread 0 Thread 1

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Parallel Reduction Thread 0 Thread 1 1 st pass: threads 1, 3, 5, and 7 don’t do anything  Really only need n/2 threads for n elements

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Parallel Reduction Thread 0 Thread 1 2 nd pass: threads 2 and 6 also don’t do anything

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Parallel Reduction Thread 0 Thread 1 3 rd pass: thread 4 also doesn’t do anything

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Parallel Reduction Thread 0 Thread 1 In general, number of required threads cuts in half after each pass

Parallel Reduction What if we tweaked the implementation?

Parallel Reduction

Parallel Reduction

Parallel Reduction

Code from stride : …, 4, 2, 1 __shared__ float partialSum[] //... load into shared memory unsigned int t = threadIdx.x; for(unsigned int stride = blockDim.x >> 1; stride > 0; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t + stride]; }

Code from __shared__ float partialSum[] //... load into shared memory unsigned int t = threadIdx.x; for(unsigned int stride = blockDim.x >> 1; stride > 0; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t + stride]; }

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 0 Thread 1 Parallel Reduction

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 0 Thread 1 Parallel Reduction st pass: threads 4, 5, 6, and 7 don’t do anything  Really only need n/2 threads for n elements

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 0 Thread 1 Parallel Reduction nd pass: threads 2 and 3 also don’t do anything

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 0 Thread 1 Parallel Reduction rd pass: thread 1 also doesn’t do anything

Parallel Reduction What is the difference? stride = 1, 2, 4, … stride = 4, 2, 1, …

Parallel Reduction What is the difference? if (t < stride) partialSum[t] += partialSum[t + stride]; if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp Partitioning Warp Partitioning: how threads from a block are divided into warps Knowledge of warp partitioning can be used to:  Minimize divergent branches  Retire warps early

Understand warp partitioning make your code run faster

Warp Partitioning Partition based on consecutive increasing threadIdx

Warp Partitioning 1D Block  threadIdx.x between 0 and 512 (G80/GT200)  Warp n Starts with thread 32n Ends with thread 32(n + 1) – 1  Last warp is padded if block size is not a multiple of 32 0… Warp 0Warp 1Warp 2Warp 3 …

Warp Partitioning 2D Block  Increasing threadIdx means Increasing threadIdx.x Starting with row threadIdx.y == 0

Warp Partitioning Image from 2D Block

Warp Partitioning 3D Block  Start with threadIdx.z == 0  Partition as a 2D block  Increase threadIdx.z and repeat

Image from: Warp Partitioning Divergent branches are within a warp!

Warp Partitioning For warpSize == 32, does any warp have a divergent branch with this code: if (threadIdx.x > 15) { //... }

Warp Partitioning For any warpSize, does any warp have a divergent branch with this code: if (threadIdx.x > warpSize) { //... }

Warp Partitioning Given knowledge of warp partitioning, which parallel reduction is better? if (t < stride) partialSum[t] += partialSum[t + stride]; if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp Partitioning stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 Pretend warpSize == 2

Warp Partitioning 1 st Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 4 divergent branches 0 divergent branches

Warp Partitioning 2 nd Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 2 divergent branches 0 divergent branches

Warp Partitioning 2 nd Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 1 divergent branch 1 divergent branch

Warp Partitioning 2 nd Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 1 divergent branch 1 divergent branch Still diverge when number of elements left is <= warpSize

Warp Partitioning Good partitioning also allows warps to be retired early.  Better hardware utilization if (t < stride) partialSum[t] += partialSum[t + stride]; if (t % (2 * stride) == 0) partialSum[t] += partialSum[t + stride]; stride = 1, 2, 4, … stride = 4, 2, 1, …

Warp Partitioning stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 Parallel Reduction

Warp Partitioning 1 st Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 0 warps retired 2 warps retired

Warp Partitioning 1 st Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3

Warp Partitioning 2 nd Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 1 warp retired 2 warps retired

Warp Partitioning 2 nd Pass stride = 1, 2, 4, … stride = 4, 2, 1, … Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3

Memory Coalescing Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern? Image from: M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M

Memory Coalescing Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern? Image from: Md Nd W I D T H WIDTH Thread 0 Thread 1 Thread 0 Thread 1 a) column after column?b) row after row?

Memory Coalescing Given a matrix stored row-major in global memory, what is a thread’s desirable access pattern?  a) column after column Individual threads read increasing, consecutive memory address  b) row after row Adjacent threads read increasing, consecutive memory addresses

Memory Coalescing Image from: a) column after column

Memory Coalescing Image from: b) row after row

Memory Coalescing Image from: Recall warp partitioning; if these threads are in the same warp, global memory addresses are increasing and consecutive across warps.

Memory Coalescing Global memory bandwidth (DRAM)  G80 – 86.4 GB/s  GT200 – 150 GB/s Achieve peak bandwidth by requesting large, consecutive locations from DRAM  Accessing random location results in much lower bandwidth

Memory Coalescing Memory coalescing – rearrange access patterns to improve performance Useful today but will be less useful with large on-chip caches

Memory Coalescing The GPU coalesce consecutive reads in a half-warp into a single read  What makes this possible? Strategy: read global memory in a coalesce-able fashion into shared memory  Then access shared memory randomly at maximum bandwidth Ignoring bank conflicts – next lecture See Appendix G in the NVIDIA CUDA C Programming Guide for coalescing alignment requirements

SM Resource Partitioning Recall a SM dynamically partitions resources: Registers Thread block slots Thread slots Shared memory SM

SM Resource Partitioning Recall a SM dynamically partitions resources: Registers Thread block slots Thread slots Shared memory SM K registers / 32K memory 16K G80 Limits

SM Resource Partitioning We can have  8 blocks of 96 threads  4 blocks of 192 threads  But not 8 blocks of 192 threads Registers Thread block slots Thread slots Shared memory SM K registers / 32K memory 16K G80 Limits

SM Resource Partitioning We can have (assuming 256 thread blocks)  768 threads (3 blocks) using 10 registers each  512 threads (2 blocks) using 11 registers each Registers Thread block slots Thread slots Shared memory SM K registers / 32K memory 16K G80 Limits

SM Resource Partitioning We can have (assuming 256 thread blocks)  768 threads (3 blocks) using 10 registers each  512 threads (2 blocks) using 11 registers each Registers Thread block slots Thread slots Shared memory SM K registers / 32K memory 16K G80 Limits More registers decreases thread- level parallelism Can it ever increase performance?

SM Resource Partitioning Performance Cliff: Increasing resource usage leads to a dramatic reduction in parallelism  For example, increasing the number of registers, unless doing so hides latency of global memory access

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m *f;

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m *f; Read global memory

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m *f; Execute instructions that are not dependent on memory read

Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m *f; Use global memory after the above line from enough warps hide the memory latency

Data Prefetching Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

Data Prefetching Recall tiled matrix multiply: for (/*... */) { // Load current tile into shared memory __syncthreads(); // Accumulate dot product __syncthreads(); }

Data Prefetching Tiled matrix multiply with prefect: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); }

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); }

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } Prefetch for next iteration of the loop

Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/*... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } These instructions executed by enough threads will hide the memory latency of the prefetch