1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr05.

Slides:

Advertisements

Similar presentations

Lecture 3: Parallel Algorithm Design

Advertisements

Analysis of Algorithms CS 477/677 Linear Sorting Instructor: George Bebis ( Chapter 8 )

Grade School Again: A Parallel Perspective Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2004 Lecture 17Mar 16, 2004Carnegie.

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Numerical Algorithms Matrix multiplication

CS 240A: Parallel Prefix Algorithms or Tricks with Trees

Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.

Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Lecture 21: Parallel Algorithms

Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.

2/9/2007CS267 Lecure 81 CS 267: Distributed Memory Programming (MPI) and Tree-Based Algorithms Kathy Yelick

VLSI Arithmetic. Multiplication A = a n-1 a n-2 … a 1 a 0 B = b n-1 b n-2 … b 1 b 0  eg)  Shift.

Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a.

1 CS 267 Tricks with Trees James Demmel

Cache Oblivious Search Trees via Binary Trees of Small Height

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

Introduction to CMOS VLSI Design Lecture 11: Adders David Harris Harvey Mudd College Spring 2004.

Propositional Equivalence Goal: Show how propositional equivalences are established & introduce the most important such equivalences.

Great Theoretical Ideas in Computer Science.

Systems of Linear Equation and Matrices

Great Theoretical Ideas in Computer Science.

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

Chapter 6-1 ALU, Adder and Subtractor

1 Lecture 11 POLYNOMIALS and Tree sort 2 INTRODUCTION EVALUATING POLYNOMIAL FUNCTIONS Horner’s method Permutation Tree sort.

1 Chapter 7 Computer Arithmetic Smruti Ranjan Sarangi Computer Organisation and Architecture PowerPoint Slides PROPRIETARY MATERIAL. © 2014 The McGraw-Hill.

Have we ever seen this phenomenon before? Let’s do some quick multiplication…

Advanced Algebraic Algorithms on Integers and Polynomials Prepared by John Reif, Ph.D. Analysis of Algorithms.

DISCRETE COMPUTATIONAL STRUCTURES CS Fall 2005.

Lecture 8 Tree.

Outline Priority Queues Binary Heaps Randomized Mergeable Heaps.

Computing Systems Designing a basic ALU.

درس مدارهای منطقی دانشگاه قم مدارهای منطقی محاسباتی تهیه شده توسط حسین امیرخانی مبتنی بر اسلایدهای درس مدارهای منطقی دانشگاه.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Great Theoretical Ideas in Computer Science.

MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?

1 Lecture 12 Time/space trade offs Adders. 2 Time vs. speed: Linear chain 8-input OR function with 2-input gates Gates: 7 Max delay: 7.

Computer Architecture Lecture 16 Fasih ur Rehman.

1 Carry Lookahead Logic Carry Generate Gi = Ai Bi must generate carry when A = B = 1 Carry Propagate Pi = Ai xor Bi carry in will equal carry out here.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.

Great Theoretical Ideas in Computer Science for Some.

Grade School Again: A Parallel Perspective CS Lecture 7.

CHAPTER 7 Determinant s. Outline - Permutation - Definition of the Determinant - Properties of Determinants - Evaluation of Determinants by Elementary.

Parallel Prefix The Parallel Prefix Method –This is our first example of a parallel algorithm –Watch closely what is being optimized.

Numerical Algorithms Chapter 11.

CDA3101 Recitation Section 5

Somet things you should know about digital arithmetic:

CS 267 Tricks with Trees James Demmel Why talk about this now?

Lecture 3: Parallel Algorithm Design

Propositional Equivalence

CS 267 Tricks with Trees James Demmel Why talk about this now?

ECE/CS 552: Carry Lookahead

PRAM Algorithms.

CS 267 Tricks with Trees Why talk about this now?

CSE Winter 2001 – Arithmetic Unit - 1

Parallel Prefix.

James Demmel CS 267 Tricks with Trees James Demmel 01/31/2012 CS267 Lecture.

Unit –VIII PRAM Algorithms.

James Demmel and Kathy Yelick

74LS283 4-Bit Binary Adder with Fast Carry

James Demmel CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality.

James Demmel and Horst Simon

James Demmel CS 267 Tricks with Trees James Demmel 02/11/2009 CS267 Lecture.

Presentation transcript:

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

202/09/05CS267 Lecture 7 Outline °A log n lower bound to compute any function in parallel °Reduction and broadcast in O(log n) time °Parallel prefix (scan) in O(log n) time °Adding two n-bit integers in O(log n) time °Multiplying n-by-n matrices in O(log n) time °Inverting n-by-n triangular matrices in O(log n) time °Inverting n-by-n dense matrices in O(log n) time °Evaluating arbitrary expressions in O(log n) time °Evaluating recurrences in O(log n) time °Solving n-by-n tridiagonal matrices in O(log n) time °Traversing linked lists °Computing minimal spanning trees °Computing convex hulls of point sets 2 2

302/09/05CS267 Lecture 7 A log n lower bound to compute any function of n variables °Assume we can only use binary operations, one per time unit °After 1 time unit, an output can only depend on two inputs °Use induction to show that after k time units, an output can only depend on 2 k inputs °A binary tree performs such a computation

402/09/05CS267 Lecture 7 Broadcasts and Reductions on Trees

502/09/05CS267 Lecture 7 Parallel Prefix, or Scan °If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes °Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1

602/09/05CS267 Lecture 7 Mapping Parallel Prefix onto a Tree - Details °Up-the-tree phase (from leaves to root ) °By induction, Lsave = sum of all leaves in left subtree °Down the tree phase (from root to leaves) °By induction, S = sum of all leaves to left of subtree rooted at the parent 1) Get values L and R from left and right children 2) Save L in a local register Lsave 3) Pass sum L+R to parent 1) Get value S from parent (the root gets 0) 2) Send S to the left child 3) Send S + Lsave to the right child

702/09/05CS267 Lecture 7 Adding two n-bit integers in O(log n) time °Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers °We want their sum s = a+b = s[n]s[n-1]…s[0] °Challenge: compute all c[i] in O(log n) time via parallel prefix °Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] )... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix

802/09/05CS267 Lecture 7 Multiplying n-by-n matrices in O(log n) time °For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) cost = 1 time unit, using n^3 processors °For all (1 <= I,j <= n) C(i,j) =  P(i,j,k) cost = O(log n) time, using a tree with n^3 / 2 processors k =1 n

902/09/05CS267 Lecture 7 Inverting triangular n-by-n matrices in O(log 2 n) time °Fact: °Function TriInv(T) … assume n = dim(T) = 2 m for simplicity °time(TriInv(n)) = time(TriInv(n/2)) + O(log(n)) Change variable to m = log n to get time(TriInv(n)) = O(log 2 n) A 0 C B = A 0 -B CA B If T is 1-by-1 return 1/T else … Write T = A 0 C B In parallel do { invA = TriInv(A) invB = TriInv(B) } … implicitly uses a tree newC = -invB * C * invA Return invA 0 newC invB

1002/09/05CS267 Lecture 7 Inverting Dense n-by-n matrices in O(log n) time °Lemma 1: Cayley-Hamilton Theorem expression for A -1 via characteristic polynomial in A °Lemma 2: Newton’s Identities Triangular system of equations for coefficients of characteristic polynomial, matrix entries = s k °Lemma 3: s k = trace(A k ) =  A k [i,i] =  [ i (A)] k °Csanky’s Algorithm (1976) °Completely numerically unstable 2 i=1 n n 1) Compute the powers A 2, A 3, …,A n-1 by parallel prefix cost = O(log 2 n) 2) Compute the traces s k = trace(A k ) cost = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log 2 n) 4) Evaluate A -1 using Cayley-Hamilton Theorem cost = O(log n)

1102/09/05CS267 Lecture 7 Evaluating arbitrary expressions °Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately °Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and / °Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity °Sketch of (modern) proof: evaluate expression tree E greedily by collapsing all leaves into their parents at each time step evaluating all “chains” in E with parallel prefix

1202/09/05CS267 Lecture 7 Evaluating recurrences °Let x i = f i (x i-1 ), f i a rational function, x 0 given °How fast can we compute x n ? °Theorem (Kung): Suppose degree(f i ) = d for all i If d=1, x n can be evaluated in O(log n) using parallel prefix If d>1, evaluating x n takes  (n) time, i.e. no speedup is possible °Sketch of proof when d=1 °Sketch of proof when d>1 degree(x i ) as a function of x 0 is d i After k parallel steps, degree(anything) <= 2 k Computing x i take  (i) steps x i = f i (x i-1 ) = (a i * x i-1 + b i )/( c i * x i-1 + d i ) can be written as x i = num i / den i = (a i * num i-1 + b i * den i-1 )/(c i * num i-1 + d i * den i-1 ) or num i = a i b i * num i-1 = M i * num i-1 = M i * M i-1 * … * M 1 * num 0 dem i c i d i den i-1 den i-1 den 0 Can use parallel prefix with 2-by-2 matrix multiplication

1302/09/05CS267 Lecture 7 Summary of tree algorithms °Lots of problems can be done quickly - in theory - using trees °Some algorithms are widely used broadcasts, reductions, parallel prefix carry look ahead addition °Some are of theoretical interest only Csanky’s method for matrix inversion Solving general tridiagonals (without pivoting) Both numerically unstable Csanky needs too many processors °Embedded in various systems CM-5 hardware control network MPI, Split-C, Titanium, NESL, other languages