Lecture 3: CUDA Threads & Atomics

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic.

Advertisements

Intermediate GPGPU Programming in CUDA

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.

More on threads, shared memory, synchronization

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

An Introduction to Programming with CUDA Paul Richmond

SAGE: Self-Tuning Approximation for Graphics Engines

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Single Instruction Multiple Threads

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

Atomic Operations in Hardware

Atomic Operations in Hardware

Atomicity CS 2110 – Fall 2017.

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Lecture 16 Revisiting Strides, CUDA Threads…

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Mattan Erez The University of Texas at Austin

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

CUDA Parallelism Model

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

CS 179: Lecture 3.

Implementing Mutual Exclusion

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CS333 Intro to Operating Systems

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Lecture 3: CUDA Threads & Atomics

atomics

The Problem How do you do global communication? Finish a grid and start a new one

Global Communication Finish a kernel and start a new one All writes from all threads complete before a kernel finishes step1<<<grid1,blk1>>>(...); // The system ensures that all // writes from step1 complete. step2<<<grid2,blk2>>>(...);

Global Communication Would need to decompose kernels into before and after parts

Race Conditions Or, write to a predefined memory location Race condition! Updates can be lost

Race Conditions What is the value of a in thread 0? threadId:0 threadId:1917 // vector[0] was equal to 0 vector[0] += 5; vector[0] += 1; ... ... a = vector[0]; a = vector[0]; What is the value of a in thread 0? What is the value of a in thread 1917?

Race Conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time

Race Conditions Answer: not defined by the programming model, can be arbitrary

Atomics CUDA provides atomic operations to deal with this problem

Atomics An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi

Example: Histogram __global__ void histogram(int* color, int* buckets) // Determine frequency of colors in a picture // colors have already been converted into ints // Each thread looks at one pixel and increments // a counter atomically __global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); }

Example: Workqueue __global__ void workq(int* work_q, int* q_counter, // For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }

Atomics Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!

Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single global // maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); }

Example: Global Min/Max (Better) // introduce intermediate maximum results, so that // most threads do not try to update the global max __global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { // i and val as before … int region = i % num_regions; if(atomicMax(&reg_max[region],val) < val) atomicMax(max,val); }

Global Min/Max Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously See next lecture for even better version!

Summary Can’t use normal load/store for inter-thread communication because of race conditions Use atomic instructions for sparse and/or unpredictable global communication See next lectures for shared memory and scan for other communication patterns Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

Questions?

SM Execution & Divergence

How an SM executes threads Overview of how a Stream Multiprocessor works SIMT Execution Divergence

Scheduling Blocks onto SMs Streaming Multiprocessor Thread Block 5 Thread Block 27 Thread Block 61 Thread Block 2001 HW Schedules thread blocks onto available SMs No guarantee of ordering among thread blocks HW will schedule thread blocks as soon as a previous thread block finishes

Warps A warp = 32 threads launched together Control Control Control Control Control Control ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU ALU ALU A warp = 32 threads launched together Usually, execute together as well

Mapping of Thread Blocks Each thread block is mapped to one or more warps The hardware schedules each warp independently Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4

Thread Scheduling Example SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM * Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected

Control Flow Divergence What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B();

Control Flow Divergence Branch Path A Path B Branch Path A Path B From Fung et al. MICRO ‘07

Control Flow Divergence Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } do_C();

Control Flow Divergence Branch Branch Branch Path A Path B Path C

Control Flow Divergence You don’t have to worry about divergence for correctness (*) You might have to think about it for performance Depends on your branch conditions * Mostly true, except corner cases (for example intra-warp locks)

Control Flow Divergence Performance drops off with the degree of divergence switch(threadIdx.x % N) { case 0: ... case 1: }

Divergence

Atomics atomicAdd returns the previous value at a certain address Useful for grabbing variable amounts of data from a list

Questions?

Backup

Compare and Swap int compare_and_swap(int* register, int oldval, int newval) { int old_reg_val = *register; if(old_reg_val == oldval) *register = newval; return old_reg_val; }

Compare and Swap Most general type of atomic Can emulate all others with CAS

Locks Use very judiciously Always include a max_iter in your spinloop! Decompose your data and your locks