Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Advertisements

More on threads, shared memory, synchronization
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Single Instruction Multiple Threads
CS/EE 217 – GPU Architecture and Parallel Programming
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
CUDA and OpenCL Kernels
Lecture 5: GPU Compute Architecture
Synchronization, Matrix Transpose, Profiling, AWS Cluster
CS 179: GPU Programming Lecture 7.
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CS/EE 217 – GPU Architecture and Parallel Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 3: CUDA Threads & Atomics
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Parallel Computation Patterns (Histogram)
6- General Purpose GPU Programming
Presentation transcript:

Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose

Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many algorithms that require just a little bit of resource sharing can still be accelerated by massive parallelism of GPU Shared resources and communication between threads are really kind of the same thing.

Examples needing synchronization Parallel BFS Summing a list of numbers Loading data into a GPU’s shared memory For the first example, I’m assuming a thread with touch elements in shared memory that were loaded by another thread. We can’t start processing until we know all threads have finished loading their data into shared memory. When summing a list of numbers, multiple threads/blocks all need to write to a single memory address. Need some synchronization. dining philosophers

__syncthreads() __syncthreads() synchronizes all threads in a block. Remember that shared memory is per block. Every block that is launched will have to allocate shared memory for its own itself on its resident SM. This __synchthreads() call is very useful for kernels using shared memory.

Atomic instructions: motivation Two threads try to increment variable x=42 concurrently. Final value should be 44. Possible execution order: thread 0 load x (=42) into register r0 thread 1 load x (=42) into register r1 thread 0 increment r0 to 43 thread 1 increment r1 to 43 thread 0 store r0 (=43) into x thread 1 store r1 (=43) into x Actual final value of x: 43 :(

Atomic instructions An atomic instruction executes as a single unit, cannot be interrupted. Serializes access named because an atom is the “smallest unit” even though this isn’t really true. atomic adds solve the problem of the previous slide

Atomic instructions on CUDA atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} Syntax: atomicAdd(float *address, float val) Work in both global and shared memory!

(Synchronization) budget advice Do more cheap things and fewer expensive things! Example: computing sum of list of numbers Naive: each thread atomically increments each number to accumulator in global memory Can anyone think of a better solution?

Sum example Smarter solution: each thread computes its own sum in register THIS DOESN’T EXIST ANYMORE UPDATE ME use warp shuffle (next slide) to compute sum over warp each warp does a single atomic increment to accumulator in global memory Reduce number of atomic instructions by a factor of 32 (warp size) Optionally could compute sum over full block in shared memory, but shared memory increments are not much faster than global (before Maxwell), so I don’t know if this would have better performance.

Warp-synchronous programming What if I only need to synchronize between all threads in a warp? Warps are already synchronized! Can reduce __syncthreads() calls Why is “synchronizing between threads in a warp” silly?

Quick Aside: blur_v from Lab 1 Shared memory is great place to put blur_v. blur_v is relatively small and easily fits in shared memory. Every thread reads from blur_v Stride 0 access. No bank conflicts when i > GAUSSIAN_SIZE (majority of threads)

Lab 2 Questions on latency hiding, thread divergence, coalesced memory access, bank conflicts, instruction dependencies What you actually have to do: Need to comment on all non-coalesced memory accesses and bank conflicts in provided kernel code. Lastly, improve the matrix transpose kernel by using cache and memory optimizations.

Matrix Transpose An interesting IO problem, because you have a stride 1 access and a stride n access. Not a trivial access pattern like “blur_v” from Lab 1. Transpose is just a fancy memcpy, so memcpy provides a great performance target. Note: This example output is for a clean project without the shmem and optimal kernels completed. Your final output should show a decline in kernel time for the different kernels.

Matrix Transpose __global__ void naiveTransposeKernel(const float *input, float *output, int n) { // launched with (64, 16) block size and (n / 64, n / 64) grid size // each block transposes a 64x64 block const int i = threadIdx.x + 64 * blockIdx.x; int j = 4 * threadIdx.y + 64 * blockIdx.y; const int end_j = j + 4; for (; j < end_j; j++) { output[j + n * i] = input[i + n * j]; } What variable determines the warp? Which access (there are only 2, read and write) are non-coalesced? How many cache lines does the write touch? (32)

Shared memory & matrix transpose Idea to avoid non-coalesced accesses: Load from global memory with stride 1 Store into shared memory with stride x __syncthreads() Load from shared memory with stride y Store to global memory with stride 1 Choose values of x and y perform the transpose. x=64, y=1 works. Gives a huge bank conflict but coalesced accesses.

Bank Conflicts Example of an SM’s shared memory cache Let’s populate shared memory with random integers. Here’s what the first 8 of 32 banks look like: Bank Conflicts

Example of an SM’s shared memory cache Bank Conflicts

Example of an SM’s shared memory cache Bank Conflicts

Example of an SM’s shared memory cache Bank Conflicts

Avoiding bank conflicts You can choose x and y to avoid bank conflicts. Remember that there are 32 banks and the GPU runs threads in batches of 32 (called warps). A stride n access to shared memory avoids bank conflicts iff gcd(n, 32) == 1.

ta_utils.cpp Included in the UNIX version of this set Should minimize lag or infinite waits on GPU function calls. Please leave these functions in the code if you are using Titan/Haru/Maki Namespace TA_Utilities