L5: Memory Hierarchy Optimization III, Data Placement, cont. and Memory Bandwidth Optimizations CS6963 1 L5: Memory Hierarchy, III.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS L3: Memory Hierarchy, 1.

L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

L5: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963.

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

CS4961 Parallel Programming Lecture 11: Data Locality, cont

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

CS4961 Parallel Programming Lecture 12: Data Locality, cont

L3: Memory Hierarchy Optimization I, Locality and Data Placement

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Memory and Data Locality

L15: CUDA, cont. Memory Hierarchy and Examples

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

L5: Memory Hierarchy, III

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

L5: Memory Bandwidth Optimization

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

L5: Memory Hierarchy Optimization III, Data Placement, cont. and Memory Bandwidth Optimizations CS L5: Memory Hierarchy, III

Administrative Next assignment available – Next four slides – Goals of assignment: – simple memory hierarchy management – block-thread decomposition tradeoff – Due Tuesday, Feb. 8, 5PM – Use handin program on CADE machines “handin cs6963 lab2 ” Mailing lists – Please use for all questions suitable for the whole class Feel free to answer your classmates questions! – Please use for questions to Sriram and me CS6963 L5: Memory Hierarchy, III 2

Assigment: Signal Recognition Definition: – Apply input signal (a vector) to a set of precomputed transform matrices – Examine result to determine which of a collection of transform matrices is closest to signal – Compute M 1 V, M 2 V, …, M n V – Revised formulation (for class purposes): compute MV 1, MV 2, …, MV n ApplySignal (float * mat, float *signal, int M) { float result = 0.0; /* register */ for (i=0; i<M; i++) { for (j=0; j<M; j++) { result[i] += mat[i][j] *signal[j]; } CS6963 L5: Memory Hierarchy, III Requirements: Use global memory, registers and shared memory only (no constant memory) Explore different ways of laying out data Explore different numbers of blocks and threads Be careful that formulation is correct 3

Assignment 2: What You Will Implement We provide the sequential code. Your goal is to write two CUDA versions of this code: (1) one that uses global memory (2) one that uses a combination of global memory and shared memory You'll time the code, but will not be graded on the actual performance. Rather, your score will be based on whether you produce two working versions of code, and the analysis of tradeoffs. For your two versions, you should try three different thread and block decomposition strategies: (1) a small number of blocks and a large number of threads (2) a large number of blocks and fewer threads (3) some intermediate point, or different number of dimensions in the block/thread decomposition L5: Memory Hierarchy, III 4

Assignment 2: Analyzing the Results You'll need to perform a series of experiments and report on results. For each measurement, you should compute the average execution time of five runs. What insights can you gain from the performance measurements and differences in behavior. EXTRA CREDIT: Can you come up with a better implementation of this code? You can use other memory structures, or simply vary how much work is performed within a thread. How much faster is it? L5: Memory Hierarchy, III 5

How to tell if results are correct Parallel execution may involve reordering the updates to memory locations – Recall “reduction” count6s from L1 Correct for commutative and associative operations (addition in this case) – Is IEEE floating point associative? (not really) Also, GPU and CPU arithmetic not always identically implemented Even completely independent operations may yield different results when comparing CPU and GPU implementations SO, we compare floating point values to a particular level of error tolerance to determine correctness Example: CUTBoolean res = cutComparefe( d_P, h_P, Width*Width, ); L5: Memory Hierarchy, III 6

Overview of Lecture Review: Tiling for computation partitioning and fixed capacity storage Review: More detailed derivation of matrix multiply from text Reading: – Chapter 5, Kirk and Hwu book – Or, book/Chapter4-CudaMemoryModel.pdf CS6963 L5: Memory Hierarchy, III 7

Targets of Memory Hierarchy Optimizations Reduce memory latency – The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth – Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead – Cost of performing optimization (e.g., copying) should be less than anticipated gain CS6963 L5: Memory Hierarchy, III 8

Tiling (Blocking): Another Loop Reordering Transformation Tiling reorders loop iterations to bring iterations that reuse data closer in time J I J I CS6963 L5: Memory Hierarchy, III 9

Tiling Example for (j=1; j<M; j++) for (i=1; i<N; i++) D[i] = D[i] + B[j][i]; for (j=1; j<M; j++) for (ii=1; ii<N; ii+=s) for (i=ii; i<min(ii+s-1,N); i++) D[i] = D[i] +B[j][i]; Strip mine for (ii=1; ii<N; ii+=s) for (j=1; j<M; j++) for (i=ii; i<min(ii+s-1,N); i++) D[i] = D[i] + B[j][i]; Permute (Seq. view) CS6963 L5: Memory Hierarchy, III 10

CUDA Version of Example (Tiling for Computation Partitioning) L5: Memory Hierarchy, III 11 Block dimension Thread dimension Loop within Thread for (ii=1; ii<N; ii+=s) for (i=ii; i<min(ii+s-1,N); i++) for (j=1; j<N; j++) D[i] = D[i] +B[j][i]; … >>(d_D, d_B, N); … __global__ ComputeI (float *d_D, float *d_B, int N) { int ii = blockIdx.x; int i = ii*s + threadIdx.x; for (j=0; j<N; j++) d_D[i] = d_D[i] + d_B[j*N+i]; }

Textbook Shows Tiling for Limited Capacity Shared Memory Compute Matrix Multiply using shared memory accesses We’ll show how to derive it using tiling L5: Memory Hierarchy, III 12

L5: Memory Hierarchy, III Matrix Multiplication A Simple Host Version in C M N P WIDTH / / Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 13

L5: Memory Hierarchy, III Tiled Matrix Multiply Using Thread Blocks One block computes one square sub- matrix P sub of size BLOCK_SIZE One thread computes one element of P sub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape M N P P sub BLOCK_SIZE WIDTH BLOCK_SIZE bx tx 01 bsize by ty bsize BLOCK_SIZE WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 14

Tiling View (Simplified Code) for (int i = 0; i < Width; ++i) ‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } L5: Memory Hierarchy, III 15 for (int i = 0; i < Width; ++i) ‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; } P[i][j] = sum; }

Let’s Look at This Code L5: Memory Hierarchy, III 16 for (int i = 0; i < Width; ++i) ‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; } P[i][j] = sum; } Tile i Tile j Tile k (inside thread)

Strip-Mined Code L5: Memory Hierarchy, III 17 for (int ii = 0; ii < Width; ii+=TI) ‏ for (int i=ii; i<ii*TI-1; i++) for (int jj=0; jj<Width; jj+=TJ) for (int j = jj; j < jj*TJ-1; j++) { double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk*TK-1; k++) sum += M[i][k] * N[k][j]; } P[i][j] = sum; } Block dimensions Thread dimensions To be used to stage data in shared memory

Derivation of code in text TI = TJ = TK = “TILE_WIDTH” All matrices square, Width x Width Copies of M and N in shared memory – TILE_WIDTH x TILE_WIDTH “Linearized” 2-d array accesses: a[i][j] is equivalent to a[i*Row + j] Each SM computes a “tile” of output matrix P from a block of consecutive rows of M and a block of consecutive columns of N – dim3 Grid (Width/TILE_WIDTH, Width/TILE_WIDTH); – dim3 Block (TILE_WIDTH, TILE_WIDTH) Then, location P[i][j] corresponds to P [by*TILE_WIDTH+ty] [bx*TILE_WIDTH+tx] or P[Row][Col] L5: Memory Hierarchy, III 18

Final Code (from text, p. 87) __global__ void MatrixMulKernel (float *Md, float *Nd, float *Pd, int Width) { 1.__shared__ float Mds [TILE_WIDTH] [TILE_WIDTH]; 2.__shared__ float Nds [TILE_WIDTH] [TILE_WIDTH]; 3 & 4. int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; //Identify the row and column of the Pd element to work on 5 & 6. int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m=0; m < Width / TILE_WIDTH; ++m) { // Collaborative (parallel) loading of Md and Nd tiles into shared memory 9. Mds [ty] [tx] = Md [Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds [ty] [tx] = Nd [(m*TILE_WIDTH + ty)*Width + Col]; 11. __syncthreads(); // make sure all threads have completed copy before calculation 12. for (int k = 0; k < TILE_WIDTH; ++k) // Update Pvalue for TKxTK tiles in Mds and Nds 13. Pvalue += Mds [ty] [k] * Nds [k] [tx]; 14. __syncthreads(); // make sure calculation complete before copying next tile } // m loop 15. Pd [Row*Width + Col] = Pvalue; } L5: Memory Hierarchy, III 19

L5: Memory Hierarchy, III Performance This code should run at about 150 Gflops on a GTX or Tesla. State-of-the-art mapping (in CUBLAS 3.2 on C2050) yields just above 600 Gflops. Higher on GTX

Matrix Multiply in CUDA Imagine you want to compute extremely large matrices. – That don’t fit in global memory This is where an additional level of tiling could be used, between grids CS6963 L5: Memory Hierarchy, III 21

“Tiling” for Registers A similar technique can be used to map data to registers Unroll-and-jam Unroll outer loops in a nest and fuse together resulting inner loops Equivalent to “strip-mine” followed by permutation and unrolling Fusion safe if dependences are not reversed Scalar replacement – May be followed by replacing array references with scalar variables to help compiler identify register opportunities CS L5: Memory Hierarchy II L5: Memory Hierarchy, III

23 Tiling inner loops I and K (+permutation) for (K = 0; K<N; K+=T K ) for (I = 0; I<N; I+=T I ) for (J =0; J<N; J++) for (KK = K; KK<min(K+T K, N); KK++) for (II = I; II<min(I+ T I, N); II++) P[J][II] = P[J][II] + M[KK][II] * N[J][KK]; TITI CAB TKTK Unroll-and-jam for matrix multiply

L5: Memory Hierarchy, III 24 Unroll II loop,T I = 4 (equiv. to unroll&jam) Now parallel computations are exposed * M N P + First, Apply Unroll-and-Jam for (K = 0; K<N; K+=T K ) for (I = 0; I<N; I+=4) for (J =0; J<N; J++) for (KK = K; KK<min(K+T K, N); KK++) P[J][II] = P[J][II] + M[KK][II] * N[J][KK]; P[J][II+1] = P[J][II+1] + M[KK][II+1] * N[J][KK]; P[J][II+2] = P[J][II+2] + M[KK][II+2] * N[J][KK]; P[J][II+3] = P[J][II+3] + M[KK][II+3] * N[J][KK]; * M N P + * M N P + * M N P +

L5: Memory Hierarchy, III 25 Scalar Replacement: Replace accesses to P with scalars Now P accesses can be mapped to “named registers” for (K = 0; K<N; K+=T K ) for (I = 0; I<N; I+=4) for (J =0; J<N; J++) { P0 = P[J][I]; P1 = P[J][I+1]; P2 = P[J][I+2,J]; P3 = P[J][I+3]; for (KK = K; KK<min(K+T K, N); KK++) { P0 = P0 + M[KK][II] * N[J][KK]; P1 = P1 + M[KK][II+1] * N[J][KK]; P2 = P2 + M[KK][II+2] * N[J][KK]; P3 = P3 + M[KK][II+3] * N[J][KK]; } P[J][I] = P0; P[J][I+1] = P1; P[J][I+2] = P2; P[J][I+3] = P3; } Now can expose registers using scalar replacement (or simply unroll kk loop)

Overview of Texture Memory Recall, texture cache of read-only data Special protocol for allocating and copying to GPU – texture texRef; Dim: 1, 2 or 3D objects Special protocol for accesses (macros) – tex2D(,dim1,dim2); In full glory can also apply functions to textures Writing possible, but unsafe if followed by read in same kernel 26 L5: Memory Hierarchy III CS6963

Using Texture Memory (simpleTexture project from SDK) cudaMalloc( (void**) &d_data, size); cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat); cudaArray* cu_array; cudaMallocArray( &cu_array, &channelDesc, width, height ); cudaMemcpyToArray( cu_array, 0, 0, h_data, size, cudaMemcpyHostToDevice); // set texture parameters tex.addressMode[0] = tex.addressMode[1] = cudaAddressModeWrap; tex.filterMode = cudaFilterModeLinear; tex.normalized = true; cudaBindTextureToArray( tex,cu_array, channelDesc); // execute the kernel transformKernel >>( d_data, width, height, angle); Kernel function: // declare texture reference for 2D float texture texture tex; … = tex2D(tex,i,j); 27 L5: Memory Hierarchy III CS6963

When to use Texture (and Surface) Memory (From 5.3 of CUDA manual) Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory: If memory reads to global or constant memory will not be coalesced, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads (this is less likely for devices of compute capability 2.x given that global memory reads are cached on these devices); Addressing calculations are performed outside the kernel by dedicated units; Packed data may be broadcast to separate variables in a single operation; 8-bit and 16-bit integer input data may be optionally converted to 32-bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Section ). L5: Memory Hierarchy, III 28

Memory Bandwidth Optimization Goal is to maximize utility of data for each data transfer from global memory Memory system will “coalesce” accesses for a collection of consecutive threads if they are within an aligned 128 byte portion of memory (from half-warp or warp) Implications for programming: – Desirable to have consecutive threads in tx dimension accessing consecutive data in memory – Significant performance impact, but Fermi data cache makes it slightly less important L5: Memory Hierarchy, III 29

Introduction to Global Memory Bandwidth: Understanding Global Memory Accesses Memory protocol for compute capability 1.2* (CUDA Manual ) Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type) Find other active threads requesting addresses within that segment and coalesce Reduce transaction size if possible Access memory and mark threads as “inactive” Repeat until all threads in half-warp are serviced *Includes Tesla and GTX platforms CS L5: Memory Hierarchy II L5: Memory Hierarchy, III

Protocol for most systems (including lab6 machines) even more restrictive For compute capability 1.0 and 1.1 – Threads must access the words in a segment in sequence – The kth thread must access the kth word CS6963 L5: Memory Hierarchy, III 31

L5: Memory Hierarchy, III M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 Access direction in Kernel code … 32 L5: Memory Hierarchy III

L5: Memory Hierarchy, III M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Memory Layout of a Matrix in C M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M T1T1 T2T2 T3T3 T4T4 Time Period 1 T1T1 T2T2 T3T3 T4T4 Time Period 2 Access direction in Kernel code … 33 L5: Memory Hierarchy III

Summary of Lecture Tiling transformation – For computation partitioning – For limited capacity in shared memory – For registers Matrix multiply example Unroll-and-jam for registers Bandwidth optimization – Global memory coalescing CS6963 L5: Memory Hierarchy, III 34

Next Time Complete bandwidth optimizations – Shared memory bank conflicts – Bank conflicts in global memory (briefly) CS6963 L5: Memory Hierarchy, III 35