CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Slides:



Advertisements
Similar presentations
More on threads, shared memory, synchronization
Advertisements

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Single Instruction Multiple Threads
CS/EE 217 – GPU Architecture and Parallel Programming
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
CUDA and OpenCL Kernels
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Synchronization, Matrix Transpose, Profiling, AWS Cluster
CS 179: GPU Programming Lecture 7.
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CS/EE 217 – GPU Architecture and Parallel Programming
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Parallel Computation Patterns (Reduction)
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
Convolution Layer Optimization
Mattan Erez The University of Texas at Austin
Lecture 3: CUDA Threads & Atomics
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Chapter 4:Parallel Programming in CUDA C
Synchronization These notes introduce:
Parallel Computation Patterns (Histogram)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization Instruction-level parallelism Latency hiding Matrix Transpose

Main requirements for Gpu performance Sufficient parallelism Latency hiding and occupancy Instruction-level parallelism Coherent execution within warps of thread Efficient memory usage Coalesced memory access for global memory Shared memory and bank conflicts

Latency hiding Idea: have enough warps to keep the GPU busy during the waiting time. Number of threads (instructions) needed to hide latency depends on Arithmetic instruction latency Number of instructions that can be issued in one clock time Warp scheduler and dispatcher

Loop Unrolling and ILP Reduce loop overhead for (i = 0; i < 10; i++) { output[i] = a[i] + b[i]; } Reduce loop overhead Increase parallelism when each iteration of the loop is independent Can increase register usage output[0] = a[0] + b[0]; output[1] = a[1] + b[1]; output[2] = a[2] + b[2]; …

Synchronization __syncthreads() Synchronizes all threads in a block Warps are already synchronized! (Can reduce __syncthreads() calls) Atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} Works in global and shared memory

Synchronization Advice Do more cheap things and fewer expensive things! Example: computing sum of list of numbers Naive: each thread atomically increments each number to accumulator in global memory Smarter solution: Each thread computes its own sum in register Use shared memory to sum across a block (Next week: Reduction) Each block does a single atomic increment in global memory

Lab 2 Part 1: Conceptual questions Latency hiding Thread divergence Coalesced memory access Bank conflicts and instruction dependencies Part 2: Matrix Transpose Optimization Naïve matrix transpose (given to you) Shared memory matrix transpose Optimal matrix transpose Need to comment on all non-coalesced memory accesses and bank conflicts in provided kernel code

Matrix Transpose An interesting IO problem, because you have a stride 1 access and a stride n access. Not a trivial access pattern like “blur_v” from Lab 1. The example output compares performance among CPU implementation and different GPU implementations.

Matrix transpose __global__ void naiveTransposeKernel(const float *input, float *output, int n) { // launched with (64, 16) block size and (n / 64, n / 64) grid size // each block transposes a 64x64 block const int i = threadIdx.x + 64 * blockIdx.x; int j = 4 * threadIdx.y + 64 * blockIdx.y; const int end_j = j + 4; for (; j < end_j; j++) { output[j + n * i] = input[i + n * j]; } What variable determines the warp? Which access (there are only 2, read and write) are non-coalesced? How many cache lines does the write touch? (32)

Shared memory matrix transpose Idea to avoid non-coalesced accesses: Load from global memory with stride 1 Store into shared memory with stride x __syncthreads() Load from shared memory with stride y Store to global memory with stride 1 Need to choose values of x and y to perform the transpose x=64, y=1 works. Gives a huge bank conflict but coalesced accesses.

Example of a shared memory cache Let’s populate shared memory with random integers. Here’s what the first 8 of 32 banks look like:

Example of a shared memory cache

Example of a shared memory cache

Example of a shared memory cache

Avoiding bank conflicts You can choose x and y to avoid bank conflicts. Remember that there are 32 banks and the GPU runs threads in batches of 32 (called warps). A stride n access to shared memory avoids bank conflicts iff gcd(n, 32) == 1.

ta_utils.cpp DO NOT DELETE THIS CODE! Included in the UNIX version of this set Should minimize lag or infinite waits on GPU function calls. Please leave these functions in the code if you are using Titan/Haru/Maki Namespace TA_Utilities