L4: Memory Hierarchy Optimization II, Locality and Data Placement

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

L5: Memory Hierarchy Optimization III, Data Placement, cont. and Memory Bandwidth Optimizations CS L5: Memory Hierarchy, III.

L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

L5: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963.

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

L10: Dense Linear Algebra on GPUs

© David Kirk/NVIDIA and Wen-mei W

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

CS4961 Parallel Programming Lecture 11: Data Locality, cont

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

L3: Memory Hierarchy Optimization I, Locality and Data Placement

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Memory and Data Locality

L15: CUDA, cont. Memory Hierarchy and Examples

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

GPU Programming Lecture 5: Convolution

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

L5: Memory Bandwidth Optimization

Presentation transcript:

L4: Memory Hierarchy Optimization II, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1

Assignment due Friday, Jan. 18, 5 PM Administrative Assignment due Friday, Jan. 18, 5 PM Use handin program on CADE machines “handin CS6235 lab1 <probfile>” Mailing list CS6235@list.eng.utah.edu Please use for all questions suitable for the whole class Feel free to answer your classmates questions! CS6235 L3: Memory Hierarchy, 1

More on tiling for shared memory and constant memory Reading: Overview of Lecture More on tiling for shared memory and constant memory Reading: Chapter 5, Kirk and Hwu book Or, http://courses.ece.illinois.edu/ece498/al/textbook/Chapter4-CudaMemoryModel.pdf CS6235 L3: Memory Hierarchy, 1

Review: Targets of Memory Hierarchy Optimizations Reduce memory latency The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead Cost of performing optimization (e.g., copying) should be less than anticipated gain CS6235 L3: Memory Hierarchy, 1

Hardware Implementation: Memory Architecture Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Device memory Shared Memory Instruction Unit Processor 1 Registers … Processor 2 Processor M Constant Cache Texture The local, global, constant, and texture spaces are regions of device memory (DRAM) Each multiprocessor has: A set of 32-bit registers per processor On-chip shared memory Where the shared memory space resides A read-only constant cache To speed up access to the constant memory space A read-only texture cache To speed up access to the texture memory space Data cache (Fermi only) Data Cache, Fermi only Global, constant, texture memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign L3: Memory Hierarchy, 1

Review: Reuse and Locality Consider how data is accessed Data reuse: Same data used multiple times Intrinsic in computation Data locality: Data is reused and is present in “fast memory” Same data or same data transfer If a computation has reuse, what can we do to get locality? Appropriate data placement and layout Code reordering transformations CS6235 L3: Memory Hierarchy, 1

Memory Hierarchy Example: Matrix vector multiply for (i=0; i<n; i++) { for (j=0; j<n; j++) { a[i] += c[j][i] * b[j]; } Remember to: Consider correctness of parallelization strategy (next week) Exploit locality in shared memory and registers L3: Memory Hierarchy, 1

Resulting CUDA code (Automatically Generated by our Research Compiler) __global__ mv_GPU(float* a, float* b, float** c) { int bx = blockIdx.x; int tx = threadIdx.x; __shared__ float bcpy[32]; double acpy = a[tx + 32 * bx]; for (k = 0; k < 32; k++) { bcpy[tx] = b[32 * k + tx]; __syncthreads(); //this loop is actually fully unrolled for (j = 32 * k; j <= 32 * k + 32; j++) { acpy = acpy + c[j][32 * bx + tx] * bcpy[j]; } __synchthreads(); a[tx + 32 * bx] = acpy; L3: Memory Hierarchy, 1

Ch. 4: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; k j WIDTH M P i WIDTH k WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign L3: Memory Hierarchy, 1

Discussion (Simplified Code) for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; } P[i][j] = sum; L4: Memory Hierarchy, II

Let’s Look at This Code for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; } P[i][j] = sum; Tile i Tile j Tile k for shared memory (inside thread) L4: Memory Hierarchy, II

Strip-Mined Code for (int ii = 0; ii < Width; ii+=TI)‏ for (int i=ii; i<ii+TI; i++) for (int jj=0; jj<Width; jj+=TJ) for (int j = jj; j < jj+TJ; j++) { double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk+TK; k++) sum += M[i][k] * N[k][j]; } P[i][j] = sum; Block dimensions Thread dimensions Tiling for shared memory L4: Memory Hierarchy, II

But this code doesn’t match CUDA Constraints Block dimensions – must be unit stride for (int ii = 0; ii < Width; ii+=TI)‏ for (int i=ii; i<ii+TI; i++) for (int jj=0; jj<Width; jj+=TJ) for (int j = jj; j < jj+TJ; j++) { double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk+TK; k++) sum += M[i][k] * N[k][j]; } P[i][j] = sum; Thread dimensions – must be unit stride Tiling for shared memory Can we fix this? L4: Memory Hierarchy, II

Unit Stride Tiling – Reflect Stride in Subscript Expressions Block dimensions – must be unit stride for (int ii = 0; ii < Width/TI; ii++)‏ for (int i=0; i<TI; i++) for (int jj=0; jj<Width/TJ; jj++) for (int j = 0; j < TJ; j++) { double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk+TK; k++) sum += M[ii*TI+i][k] * N[k][jj*TJ+j]; } P[ii*TI+i][jj*TJ+j] = sum; Thread dimensions – must be unit stride Tiling for shared memory, no need to change L4: Memory Hierarchy, II

What Does this Look Like in CUDA #define TI 32 #define TJ 32 dim3 dimGrid(Width/TI, Width/TJ); dim3 dimBlock(TI,TJ); matMult<<<dimGrid,dimBlock>>>(M,N,P); __global__ matMult(float *M, float *N, float *P) { ii = blockIdx.y; jj = blockIdx.x; i = threadIdx.y; j = threadIdx.x; double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { for (int k = kk; k < kk+TK; k++) { sum += M[(ii*TI+i)*Width+k] * N[k*Width+jj*TJ+j]; } P[(ii*TI+i)*Width+jj*TJ+j] = sum; Block and thread loops disappear Tiling for shared memory, next slides Array accesses to global memory are “linearized” L4: Memory Hierarchy, II

What Does this Look Like in CUDA #define TI 32 #define TJ 32 dim3 dimGrid(Width/TI, Width/TJ); dim3 dimBlock(TI,TJ); matMult<<<dimGrid,dimBlock>>>(M,N,P); __global__ matMult(float *M, float *N, float *P) { ii = blockIdx.y; jj = blockIdx.x; i = threadIdx.y; j = threadIdx.x; double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { Mds[j][i] = M[(ii*TI+i)*Width+kk*TK+j]; Nds[j][i] = N[(kk*TK+i)*Width+jj*TJ+j]; __synchthreads(); for (int k = 0; k < TK; k++) { sum += Mds[j][k]* Nds[k][i]; } P[(ii*TI+i)*Width+jj*TJ+j] = sum; Block and thread loops disappear Tiling for shared memory Array accesses to global memory are “linearized” L4: Memory Hierarchy, II

Derivation of code in text TI = TJ = TK = “TILE_WIDTH” All matrices square, Width x Width Copies of M and N in shared memory TILE_WIDTH x TILE_WIDTH “Linearized” 2-d array accesses: a[i][j] is equivalent to a[i*Row + j] Each SM computes a “tile” of output matrix P from a block of consecutive rows of M and a block of consecutive columns of N dim3 Grid (Width/TILE_WIDTH, Width/TILE_WIDTH); dim3 Block (TILE_WIDTH, TILE_WIDTH) Then, location P[i][j] corresponds to P [by*TILE_WIDTH+ty] [bx*TILE_WIDTH+tx] or P[Row][Col] L5: Memory Hierarchy, III

Final Code (from text, p. 87) __global__ void MatrixMulKernel (float *Md, float *Nd, float *Pd, int Width) { __shared__ float Mds [TILE_WIDTH] [TILE_WIDTH]; __shared__ float Nds [TILE_WIDTH] [TILE_WIDTH]; 3 & 4. int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; //Identify the row and column of the Pd element to work on 5 & 6. int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m=0; m < Width / TILE_WIDTH; ++m) { // Collaborative (parallel) loading of Md and Nd tiles into shared memory 9. Mds [ty] [tx] = Md [Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds [ty] [tx] = Nd [(m*TILE_WIDTH + ty)*Width + Col]; 11. __syncthreads(); // make sure all threads have completed copy before calculation 12. for (int k = 0; k < TILE_WIDTH; ++k) // Update Pvalue for TKxTK tiles in Mds and Nds 13. Pvalue += Mds [ty] [k] * Nds [k] [tx]; 14. __syncthreads(); // make sure calculation complete before copying next tile } // m loop 15. Pd [Row*Width + Col] = Pvalue; } L5: Memory Hierarchy, III

Matrix-matrix multiply example Summary of Lecture Matrix-matrix multiply example CS6235 L3: Memory Hierarchy, 1