CS/EE 217 – GPU Architecture and Parallel Programming

Slides:

Advertisements

Similar presentations

More on threads, shared memory, synchronization

Advertisements

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W

Image Convolution with CUDA

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides from “PMPP” book

Presented by: Isaac Martin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

© 2012 Elsevier, Inc. All rights reserved.

Programming Massively Parallel Processors Performance Considerations

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© 2012 Elsevier, Inc. All rights reserved.

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

CUDA Grids, Blocks, and Threads

© 2012 Elsevier, Inc. All rights reserved.

ECE 498AL Lecture 10: Control Flow

GPU Programming Lecture 5: Convolution

Parallel Computation Patterns (Stencil)

Convolution Layer Optimization

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

© David Kirk/NVIDIA and Wen-mei W

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

CS/EE 217 – GPU Architecture and Parallel Programming Midterm Review

Material on exam Module 1-10 Chapters 3-6, 8 and 9 Module 5 is mixed in there somewhere Chapters 3-6, 8 and 9 Understand the CUDA C programming model Understand the architecture limitations and how to navigate them to improve the performance of your code Parallel programming patterns. Analyze for run-time, memory performance (global memory traffic; memory coalescing), work efficiency, resource efficiency, ...

Review Problems Problem 3.5 from the book. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index. i = blockIdx.x*blockDim.x + threadIdx.x

Problem 3.6. We want to use each thread to calculate two adjacent elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index. i = (blockIdx.x*blockDim.x + threadIdx.x)*2

Assume that the vector length is 2000, and each thread calculates one output element, with a block size of 512. How many threads will there be in the grid? 2000 How many warps will have divergence? 1, the last warp only have 16 elements

4.4: You need to write a kernel that operates on an image of size 400x900. You would like to allocate one thread to each pixel. You would like the thread blocks to be square and to use the maximum number of threads per block possible on the device (assume max # of threads per block is 1024). How would you select the grid and block dimensions? Block dim = 32x32 (=1024 threads) Grid dim = ceil(400/32) x ceil(900/32) = 13x29 Assuming next that we use blocks of size 16x16, how many warps would experience thread divergence? 200 warps. A warp spans 2x16 of the block. The right most block only have 4 threads active (900%16).

For the simple reduction kernel, if the block size is 1024, how many warps will have thread divergence during the fifth iteration? All warps in the thread block will be divergent How many for the improved kernel?

Recall the more efficient reduction kernel for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } A bright engineer wanted to optimize this kernel by unrolling the last five steps as follows.

What are they thinking? Will this work? Will performance be better? for (unsigned int stride = blockDim.x; stride >= 32; stride >>= 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } __syncthreads(); if(t < 32) { partialSum[t]+= partialSum[t+16]; partialSum[t]+= partialSum[t+8]; partialSum[t]+= partialSum[t+4]; partialSum[t]+= partialSum[t+2]; partialSum[t]+= partialSum[t+1]; } What are they thinking? Will this work? Will performance be better? Eliminated syncthreads for last 5 iter.. Rely on implicit warp sync.

8.5: Consider performing a 2D convolution on a square matrix of size nxn with a mask of size mxm. How many halo elements will there be? (n+m-1)(n+m-1)-n*n What percentage of the multiplications involves halo elements? There are a total of m*m*n*n mult. Each element does m*m mult. We have n*n elements in total. Use algebra to remove corners and edges… (too time consuming) What is the saving in memory accesses for an internal tile (no ghost elements) vs. an untiled implementation? O_TILE_WIDTH2 * MASK_WIDTH2 / (O_TILE_WIDTH+MASK_WIDTH-1)2 from slide 50 in Module8-Stencil.pdf Assuming the implementation where every element has a thread to load into shared memory, how many warps will there be per block? Every thread loads an element for the tile. A tile t * t requires (t+m-1)(t+m-1) elements. Therefore, each threadblock requires (t+m-1)(t+m-1)/32 warps.

9.7: Consider the following array: [4 6 7 1 2 8 5 2] Perform inclusive prefix sum on the array using the work inefficient algorithm. Report the intermediate results at every step. Initial array: 4 6 7 1 2 8 5 2 Work inefficient scan: 4 10 13 8 3 10 13 7 4 10 17 18 16 18 16 17 4 10 17 18 20 28 33 35 Repeat with the work efficient kernel Reduction steps: 4 10 7 8 2 10 5 7 4 10 7 18 2 10 5 17 4 10 7 18 2 10 5 35 Reverse phase: 4 10 7 18 2 28 5 35 4 10 17 18 20 28 33 35