Parallel Prefix The Parallel Prefix Method –This is our first example of a parallel algorithm –Watch closely what is being optimized.

Slides:

Advertisements

Similar presentations

CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,

Advertisements

Lecture 3: Parallel Algorithm Design

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel

CS 240A: Parallel Prefix Algorithms or Tricks with Trees

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

Computability and Complexity 32-1 Computability and Complexity Andrei Bulatov Boolean Circuits.

1 CS 267 Tricks with Trees James Demmel

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

Great Theoretical Ideas in Computer Science.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

Introduction to Programming (in C++) Loops Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC.

Week 12 - Wednesday.  What did we talk about last time?  Asymptotic notation.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

Dan Boneh Intro. Number Theory Arithmetic algorithms Online Cryptography Course Dan Boneh.

Grade School Again: A Parallel Perspective CS Lecture 7.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.

Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean Logic.

Analysis of Algorithms and Inductive Proofs

PRAM and Parallel Computing

MAT 322: LINEAR ALGEBRA.

Lecture 3: Parallel Algorithm Design

MA/CSSE 473 Day 05 Factors and Primes Recursive division algorithm.

PRAM Model for Parallel Computation

Parallel Patterns Reduce & Scan

Analysis of Algorithms

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Catie Baker Spring 2015.

Parallel Algorithms (chap. 30, 1st edition)

Lecture 2: Parallel computational models

Randomized Algorithms

Algorithm Analysis CSE 2011 Winter September 2018.

CS 213: Data Structures and Algorithms

PRAM Algorithms.

PRAM Model for Parallel Computation

Problem Solving: Brute Force Approaches

Big-Oh and Execution Time: A Review

Algorithm Analysis (not included in any exams!)

The Fast Multipole Method: It’s All about Adding Functions

Algorithm design and Analysis

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Unit-2 Divide and Conquer

CSE 373, Copyright S. Tanimoto, 2002 Performance Analysis -

Decomposition Data Decomposition Functional Decomposition

Parallel Prefix.

Comparison Networks Sorting Sorting binary values

Pipelined Computations

Data Structures Review Session

CS 201 Fundamental Structures of Computer Science

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.

Parallel Computing’s Challenges

James Demmel CS 267 Tricks with Trees James Demmel 01/31/2012 CS267 Lecture.

Guest Lecture by David Johnston

Unit –VIII PRAM Algorithms.

CSE 373: Data Structures & Algorithms

James Demmel and Kathy Yelick

Analysis of Algorithms

Lecture 8. Paradigm #6 Dynamic Programming

UNIVERSITY OF MASSACHUSETTS Dept

Programming Languages

Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.

UNIVERSITY OF MASSACHUSETTS Dept

James Demmel and Horst Simon

CSCI 256 Data Structures and Algorithm Analysis Lecture 12

Analysis of Algorithms

DETERMINANT MATH 80 - Linear Algebra.

Parallel Sorting Algorithms

James Demmel CS 267 Tricks with Trees James Demmel 02/11/2009 CS267 Lecture.

Analysis of Algorithms

Design and Analysis of Algorithms

Presentation transcript:

Parallel Prefix

The Parallel Prefix Method –This is our first example of a parallel algorithm –Watch closely what is being optimized for Parallel steps –Beautiful idea with surprising uses –Not sure if the parallel prefix method is used much in the real world Might maybe be inside MPI scan Might be used in some SIMD and SIMD like cases –The real key: What is it about the real world that differs from the naïve mental model of parallelism?

Students early mental models Look up or figure out how to do things in parallel Then we get speedups! –NOT!

A theoretical (may or may not be practical) secret to turning serial into parallel 2.Suppose you bump into a parallel algorithm that surprises you  “there is no way to parallelize this algorithm” you say 3.Probably a variation on parallel prefix! Parallel Prefix Algorithms

Example of a prefix Sum Prefix Input x = (x1, x2,..., xn) Output y = (y1, y2,..., yn) y i = Σ j=1:I x j Example x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Prefix Functions-- outputs depend upon an initial string

What do you think? Can we really parallelize this? It looks like this sort of code: y=0; for i=2:n, y(i)=y(i-1)+x(i); end The ith iteration of the loop is not at all decoupled from the (i-1)st iteration. Impossible to parallelize right?

A clue? x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Is there any value in adding, say, ? Note if we separately have 1+2+3, what can we do? Suppose we added 1+2, 3+4, etc. pairwise, what could we do?

Prefix Functions -- outputs depend upon an initial string Suffix Functions -- outputs depend upon a final string Other Notations 1.+\ “plus scan” APL (“A Programming Language” source of the very name “scan”, an array based language that was ahead of its time) 2.MPI_scan 3.MATLAB command: y=cumsum(x) 4.MATLAB matmul: y=tril(ones(n))*x

Parallel Prefix Recursive View Pairwise sums Recursive prefix Update “odds” Any associative operator ( [ ])=[ ] prefix( [ ])=[ ]

MATLAB simulation function y=prefix(x) n=length(x); if n==1, y=x; else w=x(1:2:n)+x(2:2:n); % Pairwise adds w=prefix(w); % Recur y(1:2:n)= x(1:2:n)+[0 w(1:end-1) ]; y(2:2:n)=w; % Update Adds end What does this reveal? What does this hide?

Operation Count Notice # adds = 2n # required = n Parallelism at the cost of more work!

All (=and) Any (= or) Input: Bits (Boolean) Sum (+) Product (*) Max Min Input: Reals Any Associative Operation works Associative: (a b) c = a (b c) MatMul Inputs: Matrices 

Fibonacci via Matrix Multiply Prefix F n+1 = F n + F n-1 Can compute all F n by matmul_prefix on [,,,,,,,, ] then select the upper left entry

Arithmetic Modulo 2 (binary arithmetic) 0+0=0 0*0=0 0+1=1 0*1=0 1+0=1 1*0=0 1+1=0 1*1=1 Add = exclusive or Mult = and

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Carry First Int Second Int Sum

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Intb 3 b 2 b 1 b Sums 3 s 2 s 1 s 0

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Inta 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 (addition mod 2)

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Inta 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 c i a i + b i a i b i c i = (addition mod 2)

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Inta 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 c i a i + b i a i b i c i = (addition mod 2) Matmul prefix with binary arithmetic is equivalent to carry-look ahead! Compute c i by prefix, then s i = a i + b i +c i-1 in parallel

Tridiagonal Factor a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3 a 4 b 4 c 4 a 5 = T = D n-1 D n d n = D n /D n-1 l n = c n /d n 1 d 1 b 1 l 1 1 d 2 b 2 l 2 1 d 3 D n = a n D n-1 - b n-1 c n-1 D n-2 [ ] Determinants (D 0 =1, D 1 =a 1 ) (D k is the det of the kxk upper left): D n a n -b n-1 c n-1 D n-1 D n D n-2 Compute D n by matmul_prefix 3 embarassing Parallels + prefix

The log 2 n parallel steps is not the main reason for the usefulness of parallel prefix. Say n = 1000p (1000 summands per processor) Time = (2000 adds) + (log 2 P message passings) fast & embarassingly parallel (2000 local adds are serial for each processor of course) The “Myth” of log n

, , , 000 adds + 3 communication hops total speed is as if there is no communication log 2 n = number of steps to add n numbers (NO!!) 40, , 000 Myth of log n Example

Any Prefix Operation May Be Segmented!

Segmented Operations 2 (y, T)(y, F) (x, T)(x y, T)(y, F) (x, F)(y, T)(x  y, F) e. g TT FFF T F T Result Inputs = Ordered Pairs (operand, boolean) e.g. (x, T) or (x, F) Change of segment indicated by switching T/F  

Copy Prefix: x y = x (is associative) Segmented TTFFFTFT 

High Performance Fortran SUM_PREFIX ( ARRAY, DIM, MASK, SEG, EXC) T T T T T A = M =F F T T T T F T F F SUM_PREFIX(A) = SUM_SUFFIX(A) SUM_PREFIX(A, DIM = 2) = SUM_PREFIX(A, MASK = M) =

More HPF Segmented A = T T F F F S = F T T F F T T T T T Sum_Prefix (A, SEGMENTS = S) T T | F | T T | F F

A =12345 Sum_Prefix(A) Sum_Prefix(A, EXCLUSIVE = TRUE) Example of Exclusive (Exclusive: Don’t count myself)

Parallel Prefix Pairwise sums Recursive prefix Update “evens” Any associative operator AKA: +\ (APL), cumsum(Matlab), MPI_SCAN, ( [ ])=[ ] prefix( [ ])=[ ]

Variations on Prefix exclusive( [ ])=[ ] 1)Pairwise Sums 2)Recursive Prefix 3)Update “odds”

Variations on Prefix exclusive( [ ])=[ ] 1)Pairwise Sums 2)Recursive Prefix 3)Update “odds” The Family... Directions Left Inclusive Exc=0 Prefix Exclusive Exc=1 Exc Prefix

Variations on Prefix exclusive( [ ])=[ ] 1)Pairwise Sums 2)Recursive Prefix 3)Update “evens” The Family... Directions Left Right Inclusive Exc=0 Prefix Suffix Exclusive Exc=1 Exc Prefix Exc Suffix

Variations on Prefix reduce( [ ])=[ ] 1)Pairwise Sums 2)Recursive Reduce 3)Update “odds” The Family... Directions Left Right Left/Right Inclusive Exc=0 Prefix Suffix Reduce Exclusive Exc=1 Exc Prefix Exc Suffix Exc Reduce

Variations on Prefix exclusive( [ ])=[ ] 1)Pairwise Sums 2)Recursive Prefix 3)Update “evens” The Family... Directions Left Right Left/Right Inclusive Exc=0 Prefix Suffix Reduce Exclusive Exc=1 Exc Prefix Exc Suffix Exc Reduce Neighbor Exc Exc=2 Left Multipole Right " " " Multipole

Multipole in 2d or 3d etc The Family... Directions Left Right Left/Right Inclusive Exc=0 Prefix Suffix Reduce Exclusive Exc=1 Exc Prefix Exc Suffix Exc Reduce Neighbor Exc Exc=2 Left Multipole Right " " " Multipole Notice that left/right generalizes more readily to higher dimensions Ask yourself what Exc=2 looks like in 3d

Not Parallel Prefix but PRAM  Only concerned with minimizing parallel time (not communication)  Arbitrary number of processors  One element per processor

Csanky’s (1977) Matrix Inversion Lemma 2: Cayley - Hamilton p(x) = det (xI - A) = x n + c 1 x n c n (c n = det A) 0 = p(A) = A n + c 1 A n c n I A -1 = (A n-1 + c 1 A n c n-1 )(-1/c n ) Powers of A via Parallel Prefix ± Lemma 1: ( -1 ) in O(log 2 n) (triangular matrix inv) Proof Idea: A 0 -1 A -1 0 C B -B -1 CA -1 B -1 =

Lemma 3: Leverier’s Lemma 1c 1 s 1 s 1 2c 2 s 2 s 2 s 1.c 3 s 3 s k = tr (A k ) : :..: : s n-1.. s 1 nc n s n Csanky 1) Parallel Prefix powers of A 2) s k by directly adding diagonals 3) c i from lemmas 1 and 3 4) A -1 obtained from lemma 2 = - Horrible for A=3I and n>50 !!

Matrix multiply can be done in log n steps on n 3 processors with the pram model Can be useful to think this way, but must also remember how real machines are built! Parallel steps are not the whole story Nobody puts one element per processor