© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
1/16 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
1/16 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
Parallel Sorting Algorithms
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
Mattan Erez The University of Texas at Austin
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
© 2012 Elsevier, Inc. All rights reserved.
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation Patterns – Parallel Prefix Sum (Scan)

2 Objective To master parallel Prefix Sum (Scan) algorithms –frequently used for parallel work assignment and resource allocation –A key primitive to in many parallel algorithms to convert serial computation into parallel computation –Based on reduction tree and reverse reduction tree Reading –Efficient Parallel Scan Algorithms for GPUs – pdf

(Inclusive) Prefix-Sum (Scan) Definition 3 Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [x 0, x 1, …, x n-1 ], and returns the array [x 0, (x 0 ⊕ x 1 ), …, (x 0 ⊕ x 1 ⊕ … ⊕ x n-1 )]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [ ], would return[ ].

A Inclusive Scan Application Example Assume that we have a 100-inch sausage to feed 10 We know how much each person wants in inches –[ ] How do we cut the sausage quickly? How much will be left Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc. Method 2: calculate Prefix scan –[3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left) 4

5 Typical Applications of Scan Scan is a simple and useful parallel building block –Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); –into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp); Useful for many parallel algorithms: radix sort quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Histograms Etc.

Other Applications Assigning camp slots Assigning farmer market space Allocating memory to parallel threads Allocating memory buffer for communication channels … 6

An Inclusive Sequential Scan Given a sequence [x 0, x 1, x 2,... ] Calculate output[y 0, y 1, y 2,... ] Such that y 0 = x 0 y 1 = x 0 + x 1 y 2 = x 0 + x 1 + x 2 … Using a recursive definition y i = y i − 1 + x i 7

A Work Efficient C Implementation y[0] = x[0]; for (i = 1; i < Max_i; i++) y[i] = y [i-1] + x[i]; Computationally efficient: N additions needed for N elements - O(N)! 8

How do we do this in parallel? What is the relationship between Parallel Scan and Reduction? –Multiple reduction operations! –What if we implement it that way? How many operations? –Each reduction tree is O(n) operations –Reduction trees of size n, n-1, n-2, n-3, … 1 –Very work inefficient! Important concept 9

10 A Slightly Better Parallel Inclusive Scan Algorithm 1.Read input from device memory to shared memory Each thread reads one value from the input array in device memory into shared memory array T0. Thread 0 writes 0 into shared memory array. T

11 A Slightly Better Parallel Scan Algorithm 1.(previous slide) 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Active threads: stride to n-1 (n-stride threads) Thread j adds elements j and j-stride from T0 and writes result into shared memory buffer T1 (ping-pong) Iteration #1 Stride = 1 T Stride 1 T

12 A Slightly Better Parallel Scan Algorithm T T Stride 1 Stride 2 1.Read input from device memory to shared memory. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #2 Stride = 2 T

13 A Slightly Better Parallel Scan Algorithm T Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2.Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared memory arrays) 3.Write output from shared memory to device memory Iteration #3 Stride = 4 T T Stride 1 Stride 2 T Stride 4

Handling Dependencies During every iteration, each therad can overwrite the input of another thread –Need barrier synchronization to ensure all inputs have been properly generated –All threads secure input operand that can be overwritten by another thread –Barrier synchronization to ensure that threads have secured their inputs –All threads perform addition to write output 14

Work inefficient scan kernel __shared__ float XY[SECTION_SIZE]; int i = blockIdx.x * blockDim.x + threadIdx.x; //load into shared memory if (i < InputSize) { XY[threadIdx.x] = X[i];} //perform iterative scan on XY for (unsigned int stride = 1; stride <= threadIdx.x; stride *=2) { __synchthreads(); float in1 = XY[threadIdx.x – stride]; __synchthreads(); XY[threadIdx.x]+=in1; } 15

16 Work Efficiency Considerations The first-attempt Scan executes log(n) parallel iterations –The steps do (n-1), (n-2), (n-4),..(n- n/2) adds each –Total adds: n * log(n) - (n-1)  O(n*log(n)) work This scan algorithm is not very work efficient –Sequential scan algorithm does n adds –A factor of log(n) hurts: 20x for 10^6 elements! A parallel algorithm can be slow when execution resources are saturated due to low work efficiency

17 Improving Efficiency A common parallel algorithm pattern: Balanced Trees –Build a balanced binary tree on the input data and sweep it to and from the root –Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: –Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves –Traverse back up the tree building the scan from the partial sums

Parallel Scan - Reduction Step x0x0 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 ∑ x 0.. x 1 ∑ x 2.. x 3 ∑ x 4.. x 5 ∑ x 6.. x 7 ∑ x 0.. x 3 ∑ x 4.. x 7 ∑ x 0.. x 7 Time In place calculation Final value after reduce

Reduction Step Kernel Code 19 // scan_array[BLOCK_SIZE*2] is in shared memory for(int stride=1; stride<= BLOCK_SIZE; stride *=2) { int index = (threadIdx.x+1)*stride*2 - 1; if(index < 2*BLOCK_SIZE) scan_array[index] += scan_array[index-stride]; stride = stride*2; __syncthreads(); }

Inclusive Post Scan Step 20 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x 3 ∑ x 0.. x 7 ∑ x 0.. x 5 Move (add) a critical value to a central location where it is needed

Inclusive Post Scan Step 21 + x0x0 x4x4 x6x6 x2x2 ∑ x 0.. x 1 ∑ x 4.. x 5 ∑ x 0.. x 3 ∑ x 0.. x 7 ∑ x 0.. x ∑ x 0.. x 2 ∑ x 0.. x 4 + ∑ x 0.. x 6

Putting it Together 22

ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,