Mattan Erez The University of Texas at Austin

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

List Ranking and Parallel Prefix

Lecture 3: Parallel Algorithm Design

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter

Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.

Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Section Recursion  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W

Lecture 3: Parallel Algorithm Design

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Scan)

GPGPU: Parallel Reduction and Scan

Parallel Computation Patterns (Reduction)

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

ECE 498AL Lecture 15: Reductions and Their Implementation

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel build blocks.

Mattan Erez The University of Texas at Austin

ECE 498AL Lecture 10: Control Flow

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

Sorting We have actually seen already two efficient ways to sort:

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Mattan Erez The University of Texas at Austin

Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 – Parallelism in Software I Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007 Parallel Scan slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) Taken from EE493-AI taught at UIUC in Sprig 2007 EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Reductions Many to one Many to many Use Simply multiple reductions Also known as scatter-add and subset of parallel prefix sums Use Histograms Superposition Physical properties EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Serial Reduction When reduction operator is not associative Usually followed by a broadcast of result A[0:3] EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Tree-based Reduction n steps for 2n units of execution When reduction operator is associative Especially attractive when only one task needs result EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Vector Reduction with Bank Conflicts Array elements 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

No Bank Conflicts 1 2 3 … 13 14 15 16 17 18 19 1 2 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Recursive-doubling Reduction A[0] A[1] A[2] A[3] A[0:1] A[0:1] A[2:3] A[2:3] A[0:3] A[0:3] A[0:3] A[0:3] n steps for 2n units of execution If all units of execution need the result of the reduction EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Recursive-doubling Reduction Better than tree-based approach with broadcast Each units of execution has a copy of the reduced value at the end of n steps In tree-based approach with broadcast Reduction takes n steps Broadcast cannot begin until reduction is complete Broadcast can take n steps (architecture dependent) EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator  with identity I, and an array of n elements [a0, a1, …, an-1] and returns the ordered set [I, a0, (a0  a1), …, (a0  a1  …  an-2)]. Example: if  is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] Exclusive scan: last input element is not included in the result © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign (From Blelloch, 1990, “Prefix Sums and Their Applications) EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Applications of Scan Scan is a simple and useful parallel building block Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp); Useful for many parallel algorithms: radix sort quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Building data structures Etc. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Scan on a serial CPU void scan( float* scanned, float* input, int length) { scanned[0] = 0; for(int i = 1; i < length; ++i) scanned[i] = input[i-1] + scanned[i-1]; } Just add each element to the sum of the elements before it Trivial, but sequential Exactly n adds: optimal © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm Read input to shared memory. Set first element to zero and shift others right by one. In 3 1 7 4 6 T0 3 1 7 4 6 Each UE reads one value from the input array in device memory into shared memory array T0. UE 0 writes 0 into shared memory array. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm (previous slide) Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) In 3 1 7 4 6 T0 3 1 7 4 6 Stride 1 T1 3 4 8 7 5 Active UEs: stride to n-1 (n-stride UEs) UE j adds elements j and j-stride from T0 and writes result into shared memory buffer T1 (ping-pong) Iteration #1 Stride = 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm Read input from device memory to shared memory. Set first element to zero and shift others right by one. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) In 3 1 7 4 6 T0 3 1 7 4 6 Stride 1 T1 3 4 8 7 5 Stride 2 T0 3 4 11 12 Iteration #2 Stride = 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm Read input from device memory to shared memory. Set first element to zero and shift others right by one. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) In 3 1 7 4 6 T0 3 1 7 4 6 Stride 1 T1 3 4 8 7 5 Stride 2 T0 3 4 11 12 T1 3 4 11 15 16 22 Iteration #3 Stride = 4 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm Read input from device memory to shared memory. Set first element to zero and shift others right by one. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Write output. In 3 1 7 4 6 T0 3 1 7 4 6 Stride 1 T1 3 4 8 7 5 Stride 2 T0 3 4 11 12 T1 3 4 11 15 16 22 Ou t 3 4 11 15 16 22 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

What is wrong with our first-attempt parallel scan? Work Efficient: A parallel algorithm is work efficient if it does the same amount of work as an optimal sequential complexity Scan executes log(n) parallel iterations The steps do n-1, n-2, n-4,... n/2 adds each Total adds: n * (log(n) – 1) + 1  O(n*log(n)) work This scan algorithm is NOT work efficient Sequential scan algorithm does n adds A factor of log(n) hurts: 20x for 10^6 elements! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Improving Efficiency A common parallel algorithm pattern: Balanced Trees Build a balanced binary tree on the input data and sweep it to and from the root Tree is not an actual data structure, but a concept to determine what each UE does at each step For scan: Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves Traverse back up the tree building the scan from the partial sums © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree T 3 1 7 4 6 Assume array is already in shared memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree T 3 1 7 4 6 T 3 4 7 5 6 9 Stride 1 4 6 Stride 1 Iteration 1, n/2 UEs T 3 4 7 5 6 9 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree T 3 1 7 4 6 Stride 1 T 3 4 7 5 6 9 Stride 2 Iteration 2, n/4 UEs T 3 4 7 11 5 6 14 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree T 3 1 7 4 6 Stride 1 T 3 4 7 5 6 9 Stride 2 T 3 4 7 11 5 6 14 Stride 4 Iteration log(n), 1 UE T 3 4 7 11 5 6 25 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value. Note that this algorithm operates in-place: no need for double buffering © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Zero the Last Element T 3 4 7 11 5 6 We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 3 4 7 11 5 6 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 3 4 7 11 5 6 Iteration 1 1 UE Stride 4 T 3 4 7 5 6 11 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 3 4 7 11 5 6 Stride 4 T 3 4 7 5 6 11 Iteration 2 2 UEs Stride 2 T 3 7 4 11 6 16 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 3 4 7 11 5 6 Stride 4 T 3 4 7 5 6 11 Stride 2 T 3 7 4 11 6 16 Iteration log(n) n/2 UEs Stride 1 T 3 4 11 15 16 22 Each corresponds to a single UE. Done! We now have a completed scan that we can write out to device memory. Total steps: 2 * log(n). Total work: 2 * (n-1) adds = O(n) Work Efficient! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez

Building Data Structures with Scans Fun on the board EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 13 (c) Rodric Rabbah, Mattan Erez