ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Advertisements

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution
L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
CUDA and OpenCL Kernels
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Performance Considerations
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Memory and Data Locality
L15: CUDA, cont. Memory Hierarchy and Examples
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
ECE 498AL Lecture 10: Control Flow
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computation Patterns (Stencil)
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
Parallel Computation Patterns (Histogram)
© David Kirk/NVIDIA and Wen-mei W
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Objective Introduce the concept of Reduction Trees, arguably the most widely used parallel computation pattern Memory coalescing Control divergence Thread utilization © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Scales of Parallelism Instruction Level Data Level Task Level Work Level © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

How would we parallelize this? H.264 Video Encode © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

What is a reduction computation Summarize a set of input values into one value using a “reduction operation” Max Min Sum Product Often with user defined reduction operation function as long as the operation Is associative and commutative Has a well-defined identity value (e.g., 0 for sum) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Partition and Summarize A commonly used strategy for processing large input data sets There is no required order of processing elements in a data set (associative and commutative) Partition the data set into smaller chunks Have each thread to process a chunk Use a reduction tree to summarize the results from each chunk into the final answer We will focus on the reduction tree step for now. Google MapReduce and Hadoop frameworks are examples of this pattern © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A sequential reduction algorithm performs O(N) operations Initialize the result as an identity value for the reduction operation Smallest possible value for max reduction Largest possible value for min reduction 0 for sum reduction 1 for product reduction Scan through the input and perform the reduction operation between the result value and the current input value © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A parallel reduction tree algorithm performs N-1 Operations in log(N) steps 3 1 7 4 1 6 3 max max max max 3 7 4 6 max max 7 6 max © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 7

A Quick Analysis For N input values, the reduction tree performs (1/2)N + (1/4)N + (1/8)N + … (1/2logN)N) = N-1 operations In Log (N) steps – 1,000,000 input values take 20 steps Assuming that we have enough execution resources Average Parallelism (N-1)/Log(N)) For N = 1,000,000, average parallelism is 50,000 However, peak resource requirement is 500,000! This is a work-efficient parallel algorithm The amount of work done is comparable to sequential Many parallel algorithms are not work efficient © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Simple Vector Reduction Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 steps Partial Sum elements © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Active Partial Sum elements A Sum Example Thread 0 Thread 1 Thread 2 Thread 3 3 1 7 4 1 6 3 1 4 7 5 9 Active Partial Sum elements 2 11 14 3 25 steps © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Simple Thread Index to Data Mapping Each thread is responsible of an even-index location of the partial sum vector One input is the location of responsibility After each step, half of the threads are no longer needed In each step, one of the inputs comes from an increasing distance away © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Simple Thread Block Design Each thread block takes 2* BlockDim input elements Each thread loads 2 elements into shared memory __shared__ float partialSum[2 * BLOCK_SIZE]; unsigned int tx = threadIdx.x; unsigned int start = 2 * blockIdx.x * blockDim.x; partialSum[t] = input[start + tx]; partialSum[blockDim + tx] = input[start + blockDim.x + tx]; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

The Reduction Step Let’s think about __syncthreads placement? unsigned int stride; for (stride = 1; stride <= blockDim.x; stride *= 2) { __syncthreads(); if (tx % stride == 0) partialSum[2 * tx] += partialSum[2 * tx + stride]; } Let’s think about __syncthreads placement? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Any More QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012