©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 ECE408/CS483 Applied Parallel Programming Lecture.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

© David Kirk/NVIDIA and Wen-mei W

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L18: CUDA, cont. Memory Hierarchy and Examples

VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

DRAM Bandwidth Slide credit: Slides adapted from

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Memory and Data Locality

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W

Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7a: DRAM Bandwidth 1

Objective To understand DRAM bandwidth –Cause of the DRAM bandwidth problem –Programming techniques that address the problem: memory coalescing, corner turning, ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

Review: DRAM Bank Organization Each core array has about 1M bits Each bit is stored in a tiny capacitor, made of one transistor Memory Cell Core Array Row Decoder Sense Amps Column Latches Mux Row Addr Column Addr Off-chip Data Wide Narrow Pin Interface 3

Review: DRAM core arrays are slow Reading from a cell in the core array is a very slow process –DDR: Core speed = ½ interface speed –DDR2/GDDR3: Core speed = ¼ interface speed –DDR3/GDDR4: Core speed = ⅛ interface speed –… likely to be worse in the future decode To sense amps A very small capacitance that stores a data bit About 1000 cells connected to each vertical line ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

Review: DRAM Bursting ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, decode 010 Sense amps Mux 5

Review: DRAM Bursting ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, decode 011 Sense amps and buffer Mux 6

Review: DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Non-burst timing Burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. 7

Review: Multiple DRAM Banks ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, decode Sense amps Mux decode Sense amps Mux 0110 Bank 0 Bank 1 8

Review: DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time 9

First-order Look at the GPU off-chip memory subsystem nVidia GTX280 GPU: –Peak global memory bandwidth = 141.7GB/s Global memory (GDDR3) 1.1GHz –(Core 276Mhz) –For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers per clock) –We need a lot more bandwdith (141.7 GB/s) – thus 8 memory channels ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

Multiple Memory Channels Divide the memory address space into N parts –N is number of memory channels –Assign each portion to a channel ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, Channel 0 Channel 1 Channel 2 Channel 3 Bank 11

Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect –DRAM controllers are interleaved –Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M linearized order in increasing address Placing a 2D C array into linear memory space ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 14 ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

Two Access Patterns 15 d_M d_N W I D T H WIDTH Thread 1 Thread 2 (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

16 N T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in Kernel code … N 0,2 N 1,1 N 0,1 N 0,0 N 1,0 N 0,3 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 N 0,2 N 0,1 N 0,0 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 N accesses are coalesced. ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

M accesses are not coalesced. 17 M T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in Kernel code … M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 d_M[Row*Width+k] ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; } ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

19 d_M d_N W I D T H WIDTH d_M d_N Original Access Pattern Tiled Access Pattern Copy into scratchpad memory Perform multiplication with scratchpad values Figure 6.10: Using shared memory to enable coalescing ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,

ANY MORE QUESTIONS? READ CHAPTER 6 ©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois,