© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

ME964 High Performance Computing for Engineering Applications “A computer will do what you tell it to do, but that may be much different from what you.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

CS/EE 217 – GPU Architecture and Parallel Programming

Sathish Vadhiyar Parallel Programming

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

Programming Massively Parallel Processors Performance Considerations

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ECE 498AL Lecture 10: Control Flow

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective To understand the implications of control flow on –Branch divergence overhead –SM execution resource utilization To learn better ways to write code with control flow To understand compiler/HW predication designed to reduce the impact of control flow –There is a cost involved.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 3 Quick terminology review Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) –The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in the CUDA device Block: a group of warps that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 4 Grids and Blocks A kernel is executed as a grid of thread blocks –All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution using barrier –Efficiently sharing data through a low latency shared memory –Two threads from two different blocks cannot cooperate

CUDA Thread Block : Review All threads in a block execute the same kernel program (SPMD) Programmer declares block: –Block size 1 to 1024 concurrent threads –Block shape 1D, 2D, or 3D –Block dimensions in threads Threads have thread index numbers within block –Kernel code uses thread index and block index to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate –Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #: … m Thread program Courtesy: John Nickolls, NVIDIA © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 6 How thread blocks are partitioned Thread blocks are partitioned into warps Thread IDs within a warp are consecutive and increasing. If block is 1D then threadID= threadIdx.x –Warp 0: thread 0, …thread 31; –Warp 1: thread 32,…, thread 63 –Warp n: thread 32*n,…., thread 32(n+1)-1 –For blocks with # of threads not a multiple of 32 Last warp padded with extra threads to fill up the 32 threads e.g. a block: 48 threads; 2 warps; warp 1 padded with 16 extra threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 7 How thread blocks are partitioned: 2D Thread Blocks 3D Thread blocks are portioned in a similar way (i.e. threads with ThreadIDx.z=1 follow threads with ThreadIDx.z=0 and so on

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 8 How thread blocks are partitioned (cont.) Partitioning is always the same –Thus you can use this knowledge in control flow –The exact size of warps may change from generation to generation of devices However, DO NOT rely on any ordering between warps –If there are any dependencies between threads, you must __syncthreads() to get correct results

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 9 Control Flow Instructions Hardware executes an instruction for all threads in the same wrap before moving to next instruction –SIMT: single instruction, multiple thread execution Works well when all threads in a wrap follow the same control flow path Main performance concern with branching is divergence; i.e. threads within a single warp take different paths –Different execution paths are serialized Example: if-then-else executed in two passes –One pass for threads executing the then path –Second pass for threads executing the else path The control paths taken by the threads in a warp are traversed one at a time until there is no more.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 10 Control Flow Instructions (cont.) A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: –If (threadIdx.x > 2) { } –This creates two different control paths for threads in a block –Branch granularity < warp size; threads 0, 1, and 2 follow different path than the rest of the threads in the first warp Example without divergence: –If (threadIdx.x / WARP_SIZE > 2) { } –Also creates two different control paths for threads in a block –Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 11 Parallel Reduction Given an array of values, “reduce” them to a single value in parallel Examples –sum reduction: sum of all values in the array –Max reduction: maximum of all values in the array Typically parallel implementation: –Recursively halve # threads, add two values per thread –Takes log(n) steps for n elements, requires n/2 threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 12 A Vector Reduction Example Assume an in-place reduction using shared memory –The original vector is in device global memory –The shared memory used to hold a partial sum vector –Each iteration brings the partial sum vector closer to the final sum –The final solution will be in element 0

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 13 A simple implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 14 Vector Reduction with Bank Conflicts Array elements iterations if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; For blockDim.x=32; t= threadIdx.x= 0,1,2,3,4,5,…,31; Stride= 1,2,4,8,16

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 15 Vector Reduction with Branch Divergence Array elements iterations Thread 0Thread 8Thread 2Thread 4Thread 6Thread 10

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 16 Some Observations In each iteration, two control flow paths will be sequentially traversed for each warp –Threads that perform addition and threads that do not –Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time –All odd index threads are disabled right from the beginning! –On average, less than ¼ of the threads will be activated for all warps over time. –For a block size of 512 (# warps= 16), starting after the 5 th iteration (stride=16), entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire

Iteration# Active Warps (in a 512 Threads Block) #Active Threads *16= *8= *4= *2= *1= *1=8 7 44*1=4 8 22*1=2 9 11*1=1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 17

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 18 Short comings of the implementation Assume we have already loaded array into –__shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 19 A better implementation Assume we have already loaded array into –__shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x >> 1; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 20 Thread 0 Execution of the revised algorithm (If blockDim.x=32; stride= 16,8,4,2,1) 0123… if (t < stride) partialSum[t] += partialSum[t+stride]; Stride=4 Stride=16 Stride=2

No Divergence using revised algorithm © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 21 If blockDim.x = 512; In 1 st iteration Threads 0 through 255execute add Threads 256 through 511 do not Pair-wise sums stored in elements after  All threads in warps 1 - warp 8 execute add statement  Warps 9 - warp 15 execute all skip the add  All threads in each warp take the same path There is no thread divergence!

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 22 Some Observations About the New Implementation Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress –For a 512-thread block, 4 iterations to shut down all but one warp in each block –Better resource utilization, will likely retire warps faster (and thus blocks faster) Iteration #threads performing #active warps # updated elements Divergence No Yes

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 23 A Potential Further Refinement For last 6 loops only one warp active (i.e. tid’s 0..31) –Shared reads & writes are SIMD synchronous within a warp So skip __syncthreads() and unroll last 6 iterations unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x >> 1; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } Iteration #active warps # updated elements Divergence No Yes

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 24 A Potential Further Refinement but bad idea unsigned int tid = threadIdx.x; for (unsigned int d = n>>1; d > 32; d >>= 1) { __syncthreads(); if (tid < d) shared[tid] += shared[tid + d]; } __syncthreads(); if (tid < 32) { // unroll last 6 predicated steps; Shared reads //and writes are SIMD synchronus within a wrap shared[tid] += shared[tid + 32]; shared[tid] += shared[tid + 16]; shared[tid] += shared[tid + 8]; shared[tid] += shared[tid + 4]; shared[tid] += shared[tid + 2]; shared[tid] += shared[tid + 1]; } This would not work properly if warp size decreases; need __synchthreads() between each statement! However, having ___synchthreads() in if statement is problematic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 25 Predicated Execution Concept LDR r1,r2,0 If p1 is TRUE, instruction executes normally If p1 is FALSE, instruction treated as NOP

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 26 Predication Example : if (x == 10) c = c + 1; : LDR r5, X p1 <- r5 eq 10 LDR r1 <- C ADD r1, r1, 1 STR r1 -> C :

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 27 B A C D ABCDABCD Predication very helpful to avoid divergence for if-else

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 28 If-else example : p1,p2 <- r5 eq 10 inst 1 from B inst 2 from B : inst 1 from C inst 2 from C : p1,p2 <- r5 eq 10 inst 1 from B inst 1 from C inst 2 from B inst 2 from C : schedule The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.