L15: CUDA, cont. Memory Hierarchy and Examples

Slides:



Advertisements
Similar presentations
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Advertisements

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.
L19: Advanced CUDA Issues November 10, Administrative CLASS CANCELLED, TUESDAY, NOVEMBER 17 Guest Lecture, November 19, Ganesh Gopalakrishnan Thursday,
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
L14: CUDA, cont. Execution Model and Memory Hierarchy October 27, 2011.
L17: CUDA, cont. October 28, Final Project Purpose: -A chance to dig in deeper into a parallel programming model and explore concepts. -Present.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
CS4961 Parallel Programming Lecture 11: Data Locality, cont
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
DRAM Bandwidth Slide credit: Slides adapted from
L3: Memory Hierarchy Optimization I, Locality and Data Placement
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Performance Considerations
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
L5: Memory Hierarchy, III
ECE 498AL Lecture 10: Control Flow
Review for Midterm.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
L5: Memory Bandwidth Optimization
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

L15: CUDA, cont. Memory Hierarchy and Examples November 1, 2011

Administrative Final projects Programming Assignment

Programming Assignment 3, Due 11:59PM Nov. 7 Purpose: Synthesize the concepts you have learned so far Data parallelism, locality and task parallelism Image processing computation adapted from a real application Turn in using handin program on CADE machines Handin cs4961 proj3 <file> Include code + README Three Parts: Locality optimization (50%): Improve performance using locality optimizations only (no parallelism) Data parallelism (20%): Improve performance using locality optimizations plus data parallel constructs in OpenMP Task parallelism (30%): Code will not be faster by adding task parallelism, but you can improve time to first result. Use task parallelism in conjunction with data parallelism per my message from the previous assignment.

Here’s the code /* compute array convolution */ … Initialize th[i][j] = 0 … /* compute array convolution */ for(m = 0; m < IMAGE_NROWS - TEMPLATE_NROWS + 1; m++){ for(n = 0; n < IMAGE_NCOLS - TEMPLATE_NCOLS + 1; n++){ for(i=0; i < TEMPLATE_NROWS; i++){ for(j=0; j < TEMPLATE_NCOLS; j++){ if(mask[i][j] != 0) { th[m][n] += image[i+m][j+n]; } /* scale array with bright count and template bias */ … th[i][j] = th[i][j] * bc – bias;

Things to think about Beyond the assigned work, how does parallelization affect the profitability of locality optimizations? What happens if you make the IMAGE SIZE larger (1024x1024 or even 2048x2048)? You’ll need to use “unlimit stacksize” to run these. What happens if you repeat this experiment on a different architecture with a different memory hierarchy (in particular, smaller L2 cache)? How does SSE or other multimedia extensions affect performance and optimization selection?

Shared Memory Architecture 2: Sun Ultrasparc T2 Niagara (water) Controller 512KB L2 Cache Memory Controller 512KB L2 Cache Memory Controller 512KB L2 Cache Memory Controller 512KB L2 Cache Full Cross-Bar (Interconnect) Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU 08/30/2011 CS4961

More on Niagara Target applications are server-class, business operations Characterization: Floating point? Array-based computation? Support for VIS 2.0 SIMD instruction set 64-way multithreading (8-way per processor, 8 processors) 08/30/2011 CS4961

Targets of Memory Hierarchy Optimizations Reduce memory latency The latency of a memory access is the time (usually in cycles) between a memory request and its completion Optimizations: Data placement in nearby portion of memory hierarchy (focus on registers and shared memory in this class) Maximize memory bandwidth Bandwidth is the amount of useful data that can be retrieved over a time interval Optimizations: Global memory coalescing, avoid shared memory bank conflicts Manage overhead Cost of performing optimization (e.g., copying) should be less than anticipated gain Requires sufficient reuse to amortize cost of copies to shared memory, for example

Global Memory Accesses Each thread issues memory accesses to data types of varying sizes, perhaps as small as 1 byte entities Given an address to load or store, memory returns/updates “segments” of either 32 bytes, 64 bytes or 128 bytes Maximizing bandwidth: Operate on an entire 128 byte segment for each memory transfer

Understanding Global Memory Accesses Memory protocol for compute capability 1.2* (CUDA Manual 5.1.2.1) Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type) Find other active threads requesting addresses within that segment and coalesce Reduce transaction size if possible Access memory and mark threads as “inactive” Repeat until all threads in half-warp are serviced *Includes Tesla and GTX platforms

Memory Layout of a Matrix in C Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Memory Layout of a Matrix in C Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Now Let’s Look at Shared Memory Common Programming Pattern (5.1.2 of CUDA manual) Load data into shared memory Synchronize (if necessary) Operate on data in shared memory Write intermediate results to global memory Repeat until done Shared memory Global memory Familiar concept???

Mechanics of Using Shared Memory __shared__ type qualifier required Must be allocated from global/device function, or as “extern” Examples: extern __shared__ float d_s_array[]; /* a form of dynamic allocation */ /* MEMSIZE is size of per-block */ /* shared memory*/ __host__ void outerCompute() { compute<<<gs,bs,MEMSIZE>>>(); } __global__ void compute() { d_s_array[i] = …; __global__ void compute2() { __shared__ float d_s_array[M]; /* create or copy from global memory */ d_s_array[j] = …; /* write result back to global memory */ d_g_array[j] = d_s_array[j]; }

Bandwidth to Shared Memory: Parallel Memory Accesses Consider each thread accessing a different location in shared memory Bandwidth maximized if each one is able to proceed in parallel Hardware to support this Banked memory: each bank can support an access on every memory cycle

Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

How addresses map to banks on G80 (older technology) Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Example: Matrix vector multiply for (i=0; i<n; i++) { for (j=0; j<n; j++) { a[i] += c[j][i] * b[j]; } Remember to: Consider data dependences in parallelization strategy to avoid race conditions Derive a partition that performs global memory coalescing Exploit locality in shared memory and registers

Let’s Take a Closer Look Implicitly use tiling to decompose parallel computation into independent work Additional tiling is used to map portions of “b” to shared memory since it is shared across threads “a” has reuse within a thread so use a register

Resulting CUDA code (Automatically Generated by our Research Compiler) __global__ mv_GPU(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; __shared__ float bcpy[32]; double acpy = a[tx + 32 * bx]; for (k = 0; k < 32; k++) { bcpy[tx] = b[32 * k + tx]; __syncthreads(); //this loop is actually fully unrolled for (j = 32 * k; j <= 32 * k + 32; j++) { acpy = acpy + c[j][32 * bx + tx] * bcpy[j]; } __synchthreads(); a[tx + 32 * bx] = acpy;

What happens if we transpose C? for (i=0; i<n; i++) { for (j=0; j<n; j++) { a[i] += c[i][j] * b[j]; } What else do we need to worry about?

Resulting CUDA code for Transposed Matrix Vector Multiply __global__ mv_GPU(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; __shared__ float bcpy[16]; __ shared__ float P1[16][17]; //pad double acpy = a[tx + 16 * bx]; for (k = 0; k < 16; k++) { bcpy[tx] = b[16 * k + tx]; for (l=0; l<16; l++) { _P1[l][tx] = c[k*bx+l][16*bx+tx]; // copy in coalesced order } __syncthreads(); //this loop is actually fully unrolled for (j = 16 * k; j <= 16 * k + 16; j++) { acpy = acpy + _P1[tx][j] * bcpy[j]; __synchthreads(); a[tx + 32 * bx] = acpy;

Summary of Lecture A deeper probe of performance issues Execution model Control flow Heterogeneous memory hierarchy Locality and bandwidth Tiling for CUDA code generation