CS 240A: Parallel Prefix Algorithms or Tricks with Trees

Slides:

Advertisements

Similar presentations

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel

Advertisements

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Numerical Algorithms Matrix multiplication

Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.

Numerical Algorithms • Matrix multiplication

Chapter 7 Dynamic Programming 7.

Data Parallel Algorithms Presented By: M.Mohsin Butt

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a.

1 CS 267 Tricks with Trees James Demmel

UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.

Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

Topic Overview One-to-All Broadcast and All-to-One Reduction

CS240A: Conjugate Gradients and the Model Problem.

Theory I Algorithm Design and Analysis (9 – Randomized algorithms) Prof. Dr. Th. Ottmann.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Systems of Linear Equation and Matrices

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Great Theoretical Ideas in Computer Science.

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

Analysis of Algorithms

Chapter 3 Sec 3.3 With Question/Answer Animations 1.

High Performance Circuit Design By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600.

Set A formal collection of objects, which can be anything May be finite or infinite Can be defined by: – Listing the elements – Drawing a picture – Writing.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Great Theoretical Ideas in Computer Science.

Linear algebra: matrix Eigen-value Problems Eng. Hassan S. Migdadi Part 1.

CS240A: Conjugate Gradients and the Model Problem.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Greatest Common Divisors & Least Common Multiples  Definition 4 Let a and b be integers, not both zero. The largest integer d such that d|a and d|b is.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

1/4 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.

Grade School Again: A Parallel Perspective CS Lecture 7.

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

CHAPTER 7 Determinant s. Outline - Permutation - Definition of the Determinant - Properties of Determinants - Evaluation of Determinants by Elementary.

Parallel Prefix The Parallel Prefix Method –This is our first example of a parallel algorithm –Watch closely what is being optimized.

Numerical Algorithms Chapter 11.

CS 267 Tricks with Trees James Demmel Why talk about this now?

Propositional Equivalence

Parallel Patterns Reduce & Scan

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

CS 267 Tricks with Trees James Demmel Why talk about this now?

CS 213: Data Structures and Algorithms

CS 267 Tricks with Trees Why talk about this now?

Parallel Prefix.

Instructor: Prof. Chung-Kuan Cheng

James Demmel CS 267 Tricks with Trees James Demmel 01/31/2012 CS267 Lecture.

Unit –VIII PRAM Algorithms.

James Demmel and Kathy Yelick

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

James Demmel CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality.

James Demmel and Horst Simon

Data Parallel Pattern 6c.1

James Demmel CS 267 Tricks with Trees James Demmel 02/11/2009 CS267 Lecture.

Introduction to parallel algorithms

Presentation transcript:

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands … CS267 Lecture 2

Parallel Vector Operations Vector add: z = x + y Embarrassingly parallel if vectors are aligned DAXPY: z = a*x + y (a is scalar) Broadcast a, followed by independent * and + DDOT: s = xTy = Sj x[j] * y[j] Independent * followed by + reduction

Broadcast and reduction Broadcast of 1 value to p processors with log p span Reduction of p values to 1 with log p span Takes advantage of associativity in +, *, min, max, etc. a Broadcast 1 3 1 0 4 -6 3 2 Add-reduction 8

Parallel Prefix Algorithms A theoretical secret for turning serial into parallel Surprising parallel algorithms: If “there is no way to parallelize this algorithm!” … … it’s probably a variation on parallel prefix!

Example of a prefix yi = Σj=1:i xj Sum Prefix Input x = (x1, x2, . . ., xn) Output y = (y1, y2, . . ., yn) yi = Σj=1:i xj Example x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Prefix Functions-- outputs depend upon an initial string

What do you think? Can we really parallelize this? It looks like this kind of code: y(0) = 0; for i = 1:n y(i) = y(i-1) + x(i); The ith iteration of the loop depends completely on the (i-1)st iteration. Work = n, span = n, parallelism = 1. Impossible to parallelize, right?

A clue? x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Is there any value in adding, say, 4+5+6+7? If we separately have 1+2+3, what can we do? Suppose we added 1+2, 3+4, etc. pairwise -- what could we do?

Prefix sum in parallel Algorithm: 1. Pairwise sum 2. Recursive prefix 3. Pairwise sum 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 7 11 15 19 23 27 31 (Recursively compute prefix sums) 3 10 21 36 55 78 105 136 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136

Parallel prefix cost What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36 T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1 at the cost of more work!

Parallel prefix cost What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36 T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1 Parallelism at the cost of more work!

Parallel prefix cost: Work and Span What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36 T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1 T∞(n) = 2 log n Parallelism at the cost of more work!

Non-recursive view of parallel prefix scan Tree summation: two phases up sweep get values L and R from left and right child save L in local variable Mine compute Tmp = L + R and pass to parent down sweep get value Tmp from parent send Tmp to left child send Tmp+Mine to right child Up sweep: mine = left tmp = left + right Down sweep: tmp = parent (root is 0) right = tmp + mine 6 6 4 5 6 9 6 4 6 11 4 5 3 2 4 1 4 5 4 3 4 6 6 10 11 12 3 2 4 1 +X = 3 1 2 0 4 1 1 3 3 1 2 0 4 1 1 3 3 4 6 6 10 11 12 15

Any associative operation works (a  b)  c = a  (b  c) Sum (+) Product (*) Max Min Input: Reals All (and) Any ( or) Input: Bits (Boolean) MatMul Input: Matrices

Scan (Parallel Prefix) Operations Definition: the parallel prefix operation takes a binary associative operator , and an array of n elements [a0, a1, a2, … an-1] and produces the array [a0, (a0 a1), … (a0 a1 ... an-1)] Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]

Applications of scans Many applications, some more obvious than others lexically compare strings of characters add multi-precision numbers add binary numbers fast in hardware graph algorithms evaluate polynomials implement bucket sort, radix sort, and even quicksort solve tridiagonal linear systems solve recurrence relations dynamically allocate processors search for regular expression (grep) image processing primitives

E.g., Using Scans for Array Compression Given an array of n elements [a0, a1, a2, … an-1] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements into [a0, a2, a3, a6, …] Compute an add scan of [0, flags] : [0,1,1,2,3,3,4,…] Gives the index of the ith element in the compressed array If the flag for this element is 1, write it into the result array at the given position

E.g., Fibonacci via Matrix Multiply Prefix Fn+1 = Fn + Fn-1 Can compute all Fn by matmul_prefix on [ , , , , , , , , ] then select the upper left entry

Carry-Look Ahead Addition (Babbage 1800’s) Example 1 0 1 1 1 Carry 1 0 1 1 1 First Int 1 0 1 0 1 Second Int 1 0 1 1 0 0 Sum Goal: Add Two n-bit Integers

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c2 c1 c0 1 0 1 1 1 First Int a3 a2 a1 a0 1 0 1 0 1 Second Int b3 b2 b1 b0 1 0 1 1 0 0 Sum s3 s2 s1 s0

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c2 c1 c0 1 0 1 1 1 First Int a3 a2 a1 a0 1 0 1 0 1 Second Int b3 b2 b1 b0 1 0 1 1 0 0 Sum s3 s2 s1 s0 c-1 = 0 for i = 0 : n-1 si = ai + bi + ci-1 ci = aibi + ci-1(ai + bi) end sn = cn-1 (addition mod 2)

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c2 c1 c0 1 0 1 1 1 First Int a3 a2 a1 a0 1 0 1 0 1 Second Int b3 b2 b1 b0 1 0 1 1 0 0 Sum s3 s2 s1 s0 c-1 = 0 for i = 0 : n-1 si = ai + bi + ci-1 ci = aibi + ci-1(ai + bi) end sn = cn-1 ci ai + bi aibi ci-1 1 0 1 1 = (addition mod 2)

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c2 c1 c0 1 0 1 1 1 First Int a3 a2 a1 a0 1 0 1 0 1 Second Int b3 b2 b1 b0 1 0 1 1 0 0 Sum s3 s2 s1 s0 c-1 = 0 for i = 0 : n-1 si = ai + bi + ci-1 ci = aibi + ci-1(ai + bi) end sn = cn-1 ci ai + bi aibi ci-1 1 0 1 1 = 1. compute ci by binary matmul prefix 2. compute si = ai + bi +ci-1 in parallel

Adding two n-bit integers in O(log n) time Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers We want their sum s = a+b = s[n]s[n-1]…s[0] Challenge: compute all c[i] in O(log n) time via parallel prefix Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = a[i] xor b[i] xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = M[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = M[i] * M[i-1] * … M[0] * 0 1 … evaluate each product M[i] * M[i-1] * … * M[0] by parallel prefix

Segmented Operations 2 (y, T) (y, F) (x, T) (x y, T) (y, F) Inputs = Ordered Pairs (operand, boolean) e.g. (x, T) or (x, F) Change of segment indicated by switching T/F 2 (y, T) (y, F) (x, T) (x y, T) (y, F) (x, F) (y, T) (xÅy, F) e. g. 1 2 3 4 5 6 7 8 T T F F F T F T 1 3 3 7 12 6 7 8 + + Result

Any Prefix Operation May Be Segmented!

Graph algorithms by segmented scans

Multiplying n-by-n matrices in O(log n) span For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) span = 1, work = n3 For all (1 <= i,j <= n) C(i,j) = S P(i,j,k) span = O(log n), work = n3 using a tree

Inverting dense n-by-n matrices in O(log2 n) span Lemma 1: Cayley-Hamilton Theorem expression for A-1 via characteristic polynomial in A Lemma 2: Newton’s Identities Triangular system of equations for coefficients of characteristic polynomial Lemma 3: trace(Ak) = S Ak [i,i] = S [li (A)]k Csanky’s Algorithm (1976) Completely numerically unstable n n i=1 i=1 1) Compute the powers A2, A3, …,An-1 by parallel prefix span = O(log2 n) 2) Compute the traces sk = trace(Ak) span = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial 4) Evaluate A-1 using Cayley-Hamilton Theorem

Evaluating arbitrary expressions Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labelled by +, -, * and / Theorem (Brent): E can be evaluated with O(log n) span, if we reorganize it using laws of commutativity, associativity and distributivity Sketch of (modern) proof: evaluate expression tree E greedily by collapsing all leaves into their parents at each time step evaluating all “chains” in E with parallel prefix

Say n = 1000000p (1000000 summands per processor) The myth of log n The log2 n span is not the main reason for the usefulness of parallel prefix. Say n = 1000000p (1000000 summands per processor) Cost = (2000000 adds) + (log2P message passings) fast & embarassingly parallel (2000000 local adds are serial for each processor, of course)

Summary of tree algorithms Lots of problems can be done quickly - in theory - using trees Some algorithms are widely used broadcasts, reductions, parallel prefix carry look ahead addition Some are of theoretical interest only Csanky’s method for matrix inversion Solving tridiagonal linear systems (without pivoting) Both numerically unstable Csanky does too much work Embedded in various systems CM-5 hardware control network MPI, UPC, Titanium, NESL, other languages