© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

More on threads, shared memory, synchronization

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

CS/EE 217 – GPU Architecture and Parallel Programming

CUDA C/C++ Basics Part 2 - Blocks and Threads

© David Kirk/NVIDIA and Wen-mei W

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

Mattan Erez The University of Texas at Austin

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CS/EE 217 – GPU Architecture and Parallel Programming

Trying to avoid pipeline delays

EKT150 : Computer Programming

Parallel Computation Patterns (Scan)

GPGPU: Parallel Reduction and Scan

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Reduction)

© 2012 Elsevier, Inc. All rights reserved.

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Parallel build blocks.

ECE 498AL Lecture 10: Control Flow

ECE 498AL Spring 2010 Lecture 10: Control Flow

Chapter 4:Parallel Programming in CUDA C

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2

2 Recall: a Slightly Better Parallel Inclusive Scan Algorithm 1.Read input from device memory to shared memory Each thread reads one value from the input array in device memory into shared memory array T0. Thread 0 writes 0 into shared memory array. T

3 A Slightly Better Parallel Scan Algorithm 1.(previous slide) 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Active threads: stride to n-1 (n-stride threads) Thread j adds elements j and j-stride from T0 and writes result into shared memory buffer T1 (ping-pong) Iteration #1 Stride = 1 T Stride 1 T

4 A Slightly Better Parallel Scan Algorithm T T Stride 1 Stride 2 1.Read input from device memory to shared memory. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #2 Stride = 2 T

5 A Slightly Better Parallel Scan Algorithm T Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared memory arrays) 3.Write output from shared memory to device memory Iteration #3 Stride = 4 T T Stride 1 Stride 2 T Stride 4

6 Work Efficiency Considerations The first-attempt Scan executes log(n) parallel iterations –The steps do (n-1), (n-2), (n-4),..(n- n/2) adds each –Total adds: n * log(n) - (n-1)  O(n*log(n)) work This scan algorithm is not very work efficient –Sequential scan algorithm does n adds –A factor of log(n) hurts: 20x for 10^6 elements! A parallel algorithm can be slow when execution resources are saturated due to low work efficiency

7 Improving Efficiency A common parallel algorithm pattern: Balanced Trees –Build a balanced binary tree on the input data and sweep it to and from the root –Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: –Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves –Traverse back up the tree building the scan from the partial sums

Parallel Scan - Reduction Step x0x0 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 ∑ x 0.. x 1 ∑ x 2.. x 3 ∑ x 4.. x 5 ∑ x 6.. x 7 ∑ x 0.. x 3 ∑ x 4.. x 7 ∑ x 0.. x 7 Time In place calculation Final value after reduce

Inclusive Post Scan Step 9 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x 3 ∑ x 0.. x 7 ∑ x 0.. x 5 Move (add) a critical value to a central location where it is needed

Inclusive Post Scan Step 10 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x 3 ∑ x 0.. x 7 ∑ x 0.. x ∑ x 0.. x 2 ∑ x 0.. x 4 + ∑ x 0.. x 6

Putting it Together 11

Reduction Step Kernel Code 12 // XY[2*BLOCK_SIZE] is in shared memory for(int stride=1; stride <= BLOCK_SIZE; stride=stride*2) { int index = (threadIdx.x+1)*stride*2 - 1; if(index < 2*BLOCK_SIZE) XY[index] += XY[index-stride]; stride = stride*2; __syncthreads(); } threadIdx.x+1 = 1, 2, 3, 4…. stride = 1, index =

Putting it together 13

Post Scan Step 14 for(int stride=BLOCK_SIZE/2; stride > 0; stride /= 2) { __synchthreads(); int index = (threadIdx.x+1)*stride*2 - 1; if(index + stride < 2*BLOCK_SIZE) { XY[index+stride] += XY[index]; } stride = stride / 2; } __syncthreads(); if(I < InputSize) Y[i] = XY[threadIdx.x];

Work Analysis Work efficient kernel executes log(n) iterations in reduction step –Identical to reduction; O(n) operations. log(n)-1 iterations in post reduction reverse step –2-1, 4-1, 8-1, … n/2 -1 operations in each –Total? (n-1) – log(n) or O(n) work Both perform no more than 2*(n-1) adds Is this ok? What needs to happen for the parallel implementation to be better than sequential? 15

Some Tradeoffs Work efficient kernel is normally superior –Better energy efficiency (why?) –Less execution resource requirements However, the work inefficient kernel could be better under some special circumstances –What needs to happen for that? –Small lists where there are sufficient execution resources 16

(Exclusive) Prefix-Sum (Scan) Definition 17 Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [x 0, x 1, …, x n-1 ] and returns the array [0, x 0, (x 0 ⊕ x 1 ), …, (x 0 ⊕ x 1 ⊕ … ⊕ x n-2 )]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [ ], would return[ ].

Why Exclusive Scan To find the beginning address of allocated buffers Inclusive and Exclusive scans can be easily derived from each other; it is a matter of convenience 18 [ ] Exclusive [ ] Inclusive [ ]

Simple exclusive scan kernel Adapt work inefficient scan kernel Block 0: –Thread 0 loads 0 in XY[0] –Other threads load X[threadIdx.x-1] into XY[threadIdx.x] All other blocks: –Load X[blockIdx.x*blockDim.x+threadIdx.x-1] into XY[threadIdx.x] Similar adaptation for work efficient kernel, except each thread loads two values – only one zero should be loaded 19

Alternative (read Harris Article) Uses add-move operation pairs Similar in complexity to the work efficient algorithm We’ll quickly overview 20

An Exclusive Post Scan Step (Add-move Operation) 21 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x 3 0 0

Exclusive Post Scan Step 22 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x ∑ x 0.. x 5 ∑ x 0.. x ∑ x 0.. x 6 ∑ x 0.. x 5 ∑ x 0.. x 4 ∑ x 0.. x 3 ∑ x 0.. x 1 ∑ x 0.. x 2 x0x0 0

23 Exclusive Scan Example – Reduction Step T Assume array is already in shared memory

24 Reduction Step (cont.) T T Stride 1 Iteration 1, n/2 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread.

25 Reduction Step (cont.) T T T Stride 1 Stride 2 Iteration 2, n/4 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread.

26 Reduction Step (cont.) T T T T Iterate log(n) times. Each thread adds value stride elements away to its own value. Note that this algorithm operates in-place: no need for double buffering Iteration log(n), 1 thread Stride 1 Stride 2 Stride 4 Each corresponds to a single thread.

27 Zero the Last Element T We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element.

28 Post Scan Step from Partial Sums T

29 Post Scan Step from Partial Sums (cont.) T T Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. Iteration 1 1 thread Stride 4 Each corresponds to a single thread.

30 Post Scan From Partial Sums (cont.) T T T Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. Iteration 2 2 threads Stride 4 Stride 2 Each corresponds to a single thread.

31 Post Scan Step From Partial Sums (cont.) T T T T Done! We now have a completed scan that we can write out to device memory. Total steps: 2 * log(n). Total work: 2 * (n-1) adds = O(n) Work Efficient! Iteration log(n) n/2 threads Stride 2 Stride 4 Stride 1 Each corresponds to a single thread.

32 Work Analysis The parallel Inclusive Scan executes 2* log(n) parallel iterations –log(n) in reduction and log(n) in post scan –The iterations do n/2, n/4,..1, 1, …., n/4. n/2 adds –Total adds: 2* (n-1)  O(n) work The total number of adds is no more than twice of that done in the efficient sequential algorithm –The benefit of parallelism can easily overcome the 2X work when there is sufficient hardware

Working on Arbitrary Length Input Build on the scan kernel that handles up to 2*blockDim.x elements Have each section of 2*blockDim elements assigned to a block Have each block write the sum of its section into a Sum array indexed by blockIdx.x Run parallel scan on the Sum array –May need to break down Sum into multiple sections if it is too big for a block Add the scanned Sum array values to the elements of corresponding sections 33

Overall Flow of Complete Scan 34

ANY MORE QUESTIONS? READ CHAPTER 9 © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois,