© David Kirk/NVIDIA and Wen-mei W

Slides:



Advertisements
Similar presentations
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Advertisements

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 / CS483 Applied Parallel Programming Lecture 27:
Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 4: The CUDA Memory Model (Cont.)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
DRAM Bandwidth Slide credit: Slides adapted from
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Scan)
Parallel Computation Patterns (Reduction)
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computation Patterns (Stencil)
Convolution Layer Optimization
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 9: Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Host Code // global variable, outside any function __constant__ float Mc[KERNEL_SIZE][KERNEL_SIZE]; … // allocate N, P, initialize N elements, copy N to Nd Matrix M; M = AllocateMatrix(KERNEL_SIZE, KERNEL_SIZE, 1); // initialize M elements …. cudaMemcpyToSymbol(Mc, M.elements, KERNEL_SIZE*KERNEL_SIZE*sizeof(float)); ConvolutionKernel<<<dimGrid, dimBlock>>>(Nd, Pd); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Observation on Convolution Each N element is used in calculating up to KERNEL_SIZE * KERNEL_SIZE elements of P 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 1 1 3 1 1 1 3 1 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 1 1 3 1 1 1 3 1 1 1 3 1

Input tiles need to be larger than output tiles. 3 4 5 6 7 2 3 4 5 6 1 2 3 4 5 2 3 5 6 7 Input Tile 1 1 3 1 Output Tile 3 4 5 6 7 2 3 4 5 6 1 2 3 4 5 2 3 5 6 7 1 1 3 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Dealing with Mismatch Use a thread block that matches input tile Why? Each thread loads one element of the input tile Special case: halo cells Not all threads participate in calculating output There will be if statements and control divergence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Shifting from output coordinates to input coordinates BLOCK_SIZE TILE_SIZE tx ty Output Tile © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Shifting from output coordinates to input coordinate int tx = threadIdx.x; int ty = threadIdx.y; int row_o = blockIdx.y * TILE_SIZE + ty; int col_o = blockIdx.x * TILE_SIZE + tx; int row_i = row_o - 2; // Assumes kernel size is 5 int col_i = col_o - 2; // Assumes kernel size is 5 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Threads that loads halos outside N should return 0.0 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Taking Care of Boundaries float output = 0.0f; __shared__ float Ns[TILE_SIZE+KERNEL_SIZE-1][TILE_SIZE+KERNEL_SIZE-1]; if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) ) Ns[ty][tx] = N.elements[row_i*N.width + col_i]; else Ns[ty][tx] = 0.0f; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Some threads do not participate in calculating output. if(ty < TILE_SIZE && tx < TILE_SIZE){ for(i = 0; i < 5; i++) for(j = 0; j < 5; j++) output += Mc[i][j] * Ns[i+ty][j+tx]; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Some threads do not write output if(row_o < P.height && col_o < P.width) P.elements[row_o * P.width + col_o] = output; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Setting Block Size #define BLOCK_SIZE (TILE_SIZE + 4) dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE); In general, block size should be tile size + (kernel size -1) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

In General BLOCK_SIZE is limited by the maximal number of threads in a thread block Thread Coarsening Input tile sizes could be integer multiples of BLOCK_SIZE Each thread calculates n output points n is limited by the shared memory size KERNEL_SIZE is decided by application needs © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

A Simple Analysis for a small 8X8 output tile example, 5x5 mask 12X12=144 N elements need to be loaded into shared memory The calculation of each P element needs to access 25 N elements 8X8X25 = 1600 global memory accesses are converted into shared memory accesses A reduction of 1600/144 = 11X © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

In General (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory The calculation of each element of P needs to access Mask_Width 2 elements of N Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Bandwidth Reduction for 2D The reduction is Mask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Size-1) 2 Tile_Width 8 16 32 64 Reduction Mask_Width = 5 11.1 19.7 22.1 Mask_Width = 9 20.3 36 51.8 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Any MORE QUESTIONS? Read Chapter 8 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012