Parallel Computation Patterns (Scan)

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

© David Kirk/NVIDIA and Wen-mei W

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Single Instruction Multiple Threads

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

PRAM Algorithms.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Slides from “PMPP” book

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CS/EE 217 – GPU Architecture and Parallel Programming

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Reduction)

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

© 2012 Elsevier, Inc. All rights reserved.

Programming Massively Parallel Processors Performance Considerations

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

ECE 498AL Lecture 10: Control Flow

Parallel Computation Patterns (Stencil)

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

Data Parallel Pattern 6c.1

Parallel Computation Patterns (Histogram)

© David Kirk/NVIDIA and Wen-mei W

Presentation transcript:

Parallel Computation Patterns (Scan) Slide credit: Slides adapted from © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016

Parallel Scan (prefix sum) algorithms Frequently used for parallel work assignment and resource allocation A key primitive in many parallel algorithms to convert serial computation into parallel computation A foundational parallel computation pattern Work efficiency in parallel code/algorithms

Inclusive Scan (Prefix-Sum) Definition Definition: The scan operation takes a binary associative operator ⊕ (pronounced as circle plus), and an array of n elements [x0, x1, …, xn-1], and returns the array [x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)]. Example: If ⊕ is addition, then scan operation on the array would return [3 1 7 0 4 1 6 3], [3 4 11 11 15 16 22 25].

An Inclusive Scan Application Example Assume that we have a 100-inch sandwich to feed 10 people We know how much each person wants in inches [3 5 2 7 28 4 3 0 8 1] How do we cut the sandwich quickly? How much will be left? Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc. Method 2: calculate prefix sum: [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)

An Inclusive Sequential Addition Scan Given a sequence [x0, x1, x2, ... ] Calculate output [y0, y1, y2, ... ] Such that y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2 … Using a recursive definition yi = yi − 1 + xi

A Work Efficient C Implementation y[0] = x[0]; for (i = 1; i < Max_i; i++) y[i] = y [i-1] + x[i]; Computationally efficient: N additions needed for N elements - O(N)!

A Naïve Inclusive Parallel Scan Assign one thread to calculate each y element Have every thread to add up all x elements needed for the y element y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2 - Is this faster than a sequential algorithm? “Parallel programming is easy as long as you do not care about performance.”

Work-inefficient parallel scan

A Better Parallel Scan Algorithm Read input from device global memory to shared memory Iterate log(n) times; stride from 1 to n-1: double stride each iteration

A Better Parallel Scan Algorithm Read input from device global memory to shared memory Iterate log(n) times; stride from 1 to n-1: double stride each iteration.

A Better Parallel Scan Algorithm Read input from device global memory to shared memory Iterate log(n) times; stride from 1 to n-1: double stride each iteration Write output from shared memory to device memory

Handling Dependencies During every iteration, each thread can overwrite the input of another thread Barrier synchronization to ensure all inputs have been properly generated All threads secure input operand that can be overwritten by another thread Barrier synchronization is required to ensure that all threads have secured their inputs All threads perform addition and write output

A Work-Inefficient Scan Kernel __global__ void work_inefficient_scan_kernel(float *X, float *Y, int InputSize) { __shared__ float XY[SECTION_SIZE]; int i = blockIdx.x * blockDim.x + threadIdx.x; // indexing if (i < InputSize) {XY[threadIdx.x] = X[i];} // to shared mem // the code below performs iterative scan on XY for (unsigned int stride = 1; stride <= threadIdx.x; stride *= 2) { // double stride every step __syncthreads(); // all previous values written float in1 = XY[threadIdx.x - stride]; // read input value __syncthreads(); // all input values read XY[threadIdx.x] += in1; // sum } __ syncthreads(); If (i < InputSize) {Y[i] = XY[threadIdx.x];}

Work Efficiency Considerations This Scan executes log(n) parallel iterations The iterations do (n-1), (n-2), (n-4),..(n- n/2) adds each Total adds: n * log(n) - (n-1)  O(n*log(n)) work This scan algorithm is not work efficient Sequential scan algorithm does n adds A factor of log(n) can hurt: 10x for 1024 elements! A parallel algorithm can be slower than a sequential one when execution resources are saturated from low work efficiency Comment on the frequent double barrier synchronization. Need more blocks per Streaming Processors to ensure high execution throughput

Work-efficient parallel scan

Improving Efficiency Balanced Trees For scan: Form a balanced binary tree on the input data and sweep it to and from the root Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: Traverse down from leaves to the root building partial sums at internal nodes in the tree The root holds the sum of all leaves Traverse back up the tree building the output from the partial sums

Parallel Scan - Reduction Phase x0 x1 x2 x3 x4 x5 x6 x7 + + + + Time ∑x0..x1 ∑x2..x3 ∑x4..x5 ∑x6..x7 + + ∑x4..x7 ∑x0..x3 + In-place calculation ∑x0..x7 Value after reduce

Reduction Phase Kernel Code // XY[2*BLOCK_SIZE] is in shared memory for (unsigned int stride = 1;stride <= BLOCK_SIZE; stride *= 2) { int index = (threadIdx.x+1)*stride*2 - 1; if(index < 2*BLOCK_SIZE) XY[index] += XY[index-stride]; __syncthreads(); } threadIdx.x+1 = 1, 2, 3, 4…. stride = 1, index = 1, 3, 5, 7, … stride = 2, index = 3, 7, 11, 15, …

Parallel Scan - Post Reduction Reverse Phase + x0 x4 x6 x2 ∑x0..x1 ∑x4..x5 ∑x0..x3 ∑x0..x7 ∑x0..x5 Move (add) a critical value to a central location where it is needed

Parallel Scan - Post Reduction Reverse Phase + x0 x4 x6 x2 ∑x0..x1 ∑x4..x5 ∑x0..x3 ∑x0..x7 ∑x0..x5 + + + ∑x0..x2 ∑x0..x4 ∑x0..x6

Putting it Together Stride 1 2 4 8 4 2 1 Block Size = 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Stride 1 2 4 8 4 2 1 End of Forward Phase ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Post Reduction Reverse Phase Kernel Code for (unsigned int stride = BLOCK_SIZE/2; stride > 0; stride /= 2) { __syncthreads(); int index = (threadIdx.x+1)*stride*2 - 1; if(index+stride < 2*BLOCK_SIZE) { XY[index + stride] += XY[index]; } if (i < InputSize) Y[i] = XY[threadIdx.x]; Stride = 4 threadID index 0 7 1 15 2 23 Stride = 2 threadID index 0 3 1 7 2 11 Stride = 1 threadID index 0 1 1 3 2 5

More on parallel scan

Work Analysis of the Work Efficient Kernel The work efficient kernel executes log(n) parallel iterations in the reduction step The iterations do n/2, n/4,..1 adds Total adds: (n-1)  O(n) work It executes log(n)-1 parallel iterations in the post-reduction reverse step The iterations do 2-1, 4-1, …. n/2-1 adds Total adds: (n-2) – (log(n)-1)  O(n) work Both phases perform up to no more than 2x(n-1) adds The total number of adds is no more than twice of that done in the efficient sequential algorithm The benefit of parallelism can easily overcome the 2X work when there is sufficient hardware

Some Tradeoffs The work efficient scan kernel is normally more desirable Better Energy efficiency Less execution resource requirement However, the work inefficient kernel could be better for absolute performance due to its single-phase nature (forward phase only) There is sufficient execution resource

Handling Large Input Vectors Build on the work efficient scan kernel Have each section of 2*blockDim.x elements assigned to a block Perform parallel scan on each section Have each block write the sum of its section into a Sum[] array indexed by blockIdx.x Run the scan kernel on the Sum[] array Add the scanned Sum[] array values to all the elements of corresponding sections Adaptation of work inefficient kernel is similar.

Overall Flow of Complete Scan

Exclusive Scan Definition Definition: The exclusive scan operation takes a binary associative operator ⊕, and an array of n elements [x0, x1, …, xn-1] and returns the array [0, x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-2)]. Example: If ⊕ is addition, then the exclusive scan operation on the array [3 1 7 0 4 1 6 3], would return [0 3 4 11 11 15 16 22].

Why Use Exclusive Scan? To find the beginning address of allocated buffers Inclusive and exclusive scans can be easily derived from each other; it is a matter of convenience [3 1 7 0 4 1 6 3] Exclusive [0 3 4 11 11 15 16 22] Inclusive [3 4 11 11 15 16 22 25]

A Simple Exclusive Scan Kernel Adapt an inclusive, work inefficient scan kernel Block 0: Thread 0 loads 0 into XY[0] Other threads load X[threadIdx.x-1] into XY[threadIdx.x] All other blocks: All thread load X[blockIdx.x*blockDim.x+threadIdx.x-1] into XY[threadIdex.x] Similar adaption for work efficient scan kernel but ensure that each thread loads two elements Only one zero should be loaded All elements should be shifted to the right by only one position