©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Introduction to CUDA 2 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
Slides from “PMPP” book
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Convolution Layer Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 3: Blocking/Tiling for Locality

Objective Reuse each data accessed from the global memory multiple times –Across threads – shared memory blocking –With a thread - register tiling Register tiling is also often used to re-use computation results for increased efficiency. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Thread 1Thread 2 … in Global Memory Thread 1Thread 2 … Global Memory in On-chip Memory

Basic Concept of Blocking/Tiling In a congested traffic system, significant reduction of vehicles can greatly improve the delay seen by all vehicles –Carpooling for commutes –Blocking/Tiling for global memory accesses ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Some computations are more challenging to block/tile than others. Some carpools may be easier than others –More efficient if neighbors are also classmates or co- workers –Some vehicles may be more suitable for carpooling Similar variations exist in blocking/tiling ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Carpools need synchronization. Good – when people have similar schedule Bad – when people have very different schedule ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Worker A Worker B Time sleep work dinner Worker A Worker B time sleep work dinner party

Same with Blocking/Tiling Good – when threads have similar access timing Bad – when threads have very different timing ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Thread 1 Thread 2 Time Thread 1 Thread 2 time …

Outline of Technique Identify a block/tile of global memory content that are accessed by multiple threads Load the block/tile from global memory into on- chip memory Have the multiple threads to get their data from the on-chip memory Move on to the next block/tile ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Matrix Multiply Each row of Md is accessed by multiple threads Problem: some threads can be much further along than others –An entire row may need to be in on-chip memory –Not enough on-chip memory for large input matrices ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Thread 1 Thread 2

A Small Example Can we use two on- chip memory locations to reduce the number of M accesses by the two threads? –Not if the two threads can have very different timing! ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010,,,0,, Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3

Every M and N Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 Access order ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Pd 1,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0,0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 Breaking Md and Nd into Tiles ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Each phase uses one tile from Md and one from Nd Step 4 Step 5 Step 6 T 0,0 Md 0,0 ↓ Mds 0,0 Nd 0,0 ↓ Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 Md 2,0 ↓ Mds 0,0 Nd 0,2 ↓ Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 T 1,0 Md 1,0 ↓ Mds 1,0 Nd 1,0 ↓ Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 Md 3,0 ↓ Mds 1,0 Nd 1,2 ↓ Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 T 0,1 Md 0,1 ↓ Mds 0,1 Nd 0,1 ↓ Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 Md 2,1 ↓ Mds 0, 1 Nd 0,3 ↓ Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 T 1,1 Md 1,1 ↓ Mds 1,1 Nd 1,1 ↓ Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Md 3,1 ↓ Mds 1,1 Nd 1,3 ↓ Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Phase 1Phase 2 time ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

First-order Size Considerations Each thread block should have many threads –TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks –A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. –Memory bandwidth no longer a limiting factor ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

CUDA Code – Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[tx][ty] = Md[(m*TILE_WIDTH + tx)*Width+Row]; 10. Nds[tx][ty] = Nd[Col*Width+(m*TILE_WIDTH + ty)]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); 15.} 16. Pd[Row*Width+Col] = Pvalue; } ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Shared Memory and Threading ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Each SM in Fermi has 64KB on-chip SRAM, partitioned into 48KB L1 cache and 16KB shared memory, or vice versa –SM size is implementation dependent! –For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. –Can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) –The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 –A 150GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

More Difficult Blocking/Tiling Cases Some applications do not access input data in a uniform way –Convolution –PDE solvers ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

STENCIL CODE EXAMPLE ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Stencil Computation Describes the class of nearest neighbour computations on structured grids. Each point in the grid is a weighted linear combination of a subset of neighbouring values. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Overview Application: 7-pt Stencil Computation/ Stencil Probe. Optimizations and concepts covered : Improving locality and Data Reuse –2D Tiling in Shared Memory –Register Tiling ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Stencil Computation High parallelism: Conceptually, all points in the grid can be updated in parallel. Each computation performs a global sweep through the data structure. Low computational intensity: High memory traffic for very few computations. Challenge: Exploit parallelism without overusing memory bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Implementation Details General Equation: Separate read and write arrays. Mapping of arrays from 3D space to linear array space. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Naïve implementation Each thread calculates a one-element thin column along the z-dimension –Each block computes a rectangular column along the z-dimension Each thread loads its input elements from global memory, independently of other threads –High read redundancy, heavy global memory traffic Optimization – each thread can reuse data along the z-dimension –The current center input becomes the bottom input –The current top input becomes the center input ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Sample Naïve Kernel Code ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Memory Loads in the Naïve Kernel Assume no data reuse along the z-direction within each thread, –A thread loads 7 input elements for each output element. With data reuse within each thread, –A thread loads 5 input elements for each output ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Data Reuse Each internal point is used to calculate seven output values –self, 4 planar neighbors, top and bottom neighbors Surface, edge, and corner points are used for fewer output values ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Improving Locality: 2D Tiling Assume that all threads of a block march up the z-direction in synchronized phases In each phase, all threads calculate a 2-D slide of the rectangular output column For each phase, maintain three slices of relevant input data in the on-chip memories –One top and one bottom element in each thread’s private registers –All current elements also in shared memory ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Improving Locality: 2D Tiling (cont.) From one phase to next, the kernel code –Moves current element to register for lower element –Moves top element from top register to current register and shared memory –Load new top element from Global Memory to register Need to load halo data as well –Needed to calculate edge elements of the column –For each 3D nxmxp output block to be computed, we need to load (n+2)x(m+2)x(p+2) inputs.. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Halo overhead can hurt. For small n and m, the halo overhead can be very significant –If n=16 and m = 8, each slice calculates 16*8=128 output elements in each slice and needs to load 18*10=180 input elements –In optimized naive code, each output element needs 5 loads from global memory, a total of 5*128=640 loads –The total ratio of improvement is 640/180 = 3.5, rather than 5 times –The value of n and m are limited by the amount of registers and shared memory in each SM ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

CONVOLUTION EXAMPLE ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Convolution A convolution kernel is used to change an input element to the weighed linear combination of its neighbors. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Thread 1Thread 2 … in out

Example: Convolution – Base Parallel Code Each parallel task calculates an output element Figure shows –1D convolution with K=5 kernel –Calculation of 3 output elements Highly parallel but memory bandwidth inefficient –Uses massive threading to tolerate memory latency –Each input element loaded up to K times Input elements in main memory ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Example: convolution using on-chip caching Output elements calculated from cache contents –Each input element loaded only once –Cache pressure – (K-1+N) input elements needed for N output elements 7/3 = 2.3, 7 2 /3 2 = 5.4, 7 3 / 3 3 = 12 For small caches, the benefit can be significantly reduced due to the high-ratio of additional elements loaded. Input elements first loaded into cache ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Tiling for convolution is both easy and hard Convolution is conceptually easy to block –All threads that calculate a tile of output can share a tile of input High-performance tiling can be hard to achieve –Larger convolution kernels increase data sharing but also increase halo overhead –Many 3D convolution kernels cannot be done by traversing along the z-dimension sequentially –Higher-dimension convolution quickly run out of on- chip memory space ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010