1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

CUDA exercitation. Ex 1 Analyze device properties of each device on the node by using cudaGetDeviceProperties function Check the compute capability, global.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
An Introduction to Programming with CUDA Paul Richmond
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
GPU Programming with CUDA – Optimisation Mike Griffiths
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GPU Memories These notes will introduce:
Device Routines and device variables
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
DRAM Bandwidth Slide credit: Slides adapted from
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
Memory and Data Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Grids, Blocks, and Threads
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Measuring Performance
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance

2 Memory coalescing is combining separate memory accesses into one combined access – it is done by the GPU when the locations are sequential locations in global memory banks. Consider setting the elements of two-dimensional array to given data values. This could be done across rows or down columns In the following code, we will demonstrate the effects of each approach

3 Load numbers into a two-dimensional array Flattened global threadID of thread loaded into array element being accessed so one can tell which thread accesses which location when one prints out array Access done across rows and also across column and time of execution compared. In practice, a problem may dictate the access order GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x) Approach

4 __global__ void gpu_Comput1 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N;// thread ID int index = col + row * N;// array index for (int t = 0; t < T; t++)// loop to reduce other time effects h[index] = threadID; // load array with global thread ID } One way

5 __global__ void gpu_Comput2 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N;// thread ID int index = row + col * N;// array index for (int t = 0; t < T; t++)// loop to reduce other time effects h[index] = threadID; // load array with global thread ID } Another way

6 /* GPU Computation */ gpu_Comput1 >>(dev_h, N, T); // launch once kernel outside timing cudaEventRecord( start, 0 ); gpu_Comput1 >>(dev_h, N, T); cudaEventRecord( stop, 0 ); // measure end time cudaEventSynchronize( stop );// wait for event recording cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size,cudaMemcpyDeviceToHost); //Results to check printf("\nComputation with memory coalescing possible\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 similar

7 Some results A grid of one block and one iteration Array 32x32 No speedup recorded because time of other operations dominate execution time

8 A grid of one block and iterations Array 32 x 32 Speedup = 17.16

9 Repeat just to check results are consistent

10 A grid of 16 x 16 blocks and iterations Array 512x512 Speedup = Different numbers of iterations produce similar results

11 Array sizeSpeedup 32 x x x x x x x x iterations. Block size 32 x 32. Number of blocks to suit array size Different Array Sizes

12 Effects of memory access in matrix multiplication One thread is responsible for computing one result C ij and needs access a row of A and a column of B: Thread Each thread access one row of A and one column of B N 2 row/column combinations, N 2 threads

13 Seen another way, in first time period, each thread accesses the first element in a row of A: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different rows Given the row-major order of how A is stored, those threads will locations are not in consecutive locations – Bad cannot do memory coalescing. Question: how many threads access the same location?

14 Next, each thread accesses the first element in a column of B: Thread 0, …Thread I, …Thread N-1, … Consider those threads that access different columns Given the row-major order of how A is stored, those threads will locations are in consecutive locations. – Good! Can do memory coalescing. Question: how many threads access the same location?

15 How can we get better memory accesses and memory coalcesing? 1.Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according. (Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)

16 Sequential code for a transpose using same array: for (i=0; i < N; i++) for (j=0; j < i; j++) { temp = B[i][j]; B[i][j] = b[j][i]; B[j][i] = temp; } (In my code, I use separate arrays) Could be done on host prior to copying to device. How would the code look like if on device?

17 /* COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/ transposeArray(a, a_T, N);// transpose array cudaEventRecord(start, 0);// here time measured before // host-device copy, but not transpose //cudaEventSynchronize(start); // Needed? cudaMemcpy(dev_a, a_T, size,cudaMemcpyHostToDevice);// cpy transp. A cudaMemcpy(dev_b, b, size,cudaMemcpyHostToDevice);// copy B gpu_matrixmult_T >>(dev_a,dev_b,dev_c,N); cudaMemcpy(c_T,dev_c, size,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms2, start, stop ); printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

18 Some results 8 x 8 array 1 block of 8 x 8 threads Speedup = 1.62 over not transposing array

19 Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.17 over not transposing array

20 Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 0.89!! over not transposing array

21 Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 0.93!! over not transposing array

Questions