Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grade School Again: A Parallel Perspective CS 15-251 Lecture 7.

Similar presentations


Presentation on theme: "Grade School Again: A Parallel Perspective CS 15-251 Lecture 7."— Presentation transcript:

1 Grade School Again: A Parallel Perspective CS 15-251 Lecture 7

2 Grade School Addition  +  

3 Grade School Addition a4a3a2a1a0a4a3a2a1a0b4b3b2b1b0b4b3b2b1b0a4a3a2a1a0a4a3a2a1a0b4b3b2b1b0b4b3b2b1b0 + c5c4c3c2c1c5c4c3c2c1

4 a 4 a 3 a 2 a 1 a 0 b 4 b 3 b 2 b 1 b 0 + c 5 c 4 c 3 c 2 c 1 s1s1

5 Expressed With Logical Operations s 1 = (a 1 XOR b 1 ) XOR c 1 c 2 = (a 1 AND b 1 ) OR (a 1 AND c 1 ) OR (b 1 AND c 1 )

6 OR AND XOR cici bibi aiai OR c i+1 sisi

7 a i b i cici c i+1 sisi a i b i cici c i+1 sisi Ripple-carry adder

8 0 How long to add two n bit numbers? Propagation time through the circuit will be  (n) Ripple-carry adder

9 Circuits compute things in parallel. We can think of the propagation delay as PARALLEL TIME.

10 Is it possible to add two n bit numbers in less than linear parallel-time?

11 If n people agree to help you add two n bit numbers, it is not obvious that they can finish faster than if you had done it yourself.

12 If we knew the carries it would be very easy to do fast parallel addition 0

13 What do we know about the carry- out before we know the carry-in? a b c in c out s ab C out 000 01 C in 10 111

14 What do we know about the carry- out before we know the carry-in? a b c in c out s ab C out 000 01 10 111

15 Grade School Addition  +              

16 Question: Once we have converted the carries to the values 0,1, and , what is a fast parallel way to convert each position to its final 0/1 value?              

17 Associative, Binary Operator Binary Operator: an operation that takes two objects and returns a third. A  B = CAssociative: ( A  B )  C = A  ( B  C )

18 Examples Addition on the integers Min(a,b) Max(a,b) Left(a,b) = a Right(a,b) = b Boolean AND Boolean OR

19 In what we are about to do “+” will mean an arbitrary binary associative operator.

20 Prefix Sum Problem Input: X n-1, X n-2,…,X 1, X 0 Output:Y n-1, Y n-2,…,Y 1, Y 0 where Y 0 = X 0 Y 1 = X 0 + X 1 Y 2 = X 0 + X 1 + X 2 Y 3 = X 0 + X 1 + X 2 + X 3 Y n-1 = X 0 + X 1 + X 2 + X 3 + … + X n-1......

21 Prefix Sum example Input: 6, 9, 2, 3, 4, 7 Output:31, 25, 16, 14, 11, 7 where Y 0 = X 0 Y 1 = X 0 + X 1 Y 2 = X 0 + X 1 + X 2 Y 3 = X 0 + X 1 + X 2 + X 3 Y n-1 = X 0 + X 1 + X 2 + X 3 + … + X n-1......

22 Any ideas for how we can compute the prefix sums for any binary, associative + quickly in parallel?

23 Example circuitry (n = 4) + X1X1 X0X0 y1y1 X0X0 X2X2 + X1X1 + y2y2 X0X0 y0y0 X3X3 + ++ X2X2 X1X1 X0X0 y3y3

24 y n-1 + Recursive algorithm for computing y n-1 X  n/2  -1 … X1X1 X0X0 sum on  n/2  items X n-1 …X  n/2  X n-2 sum on  n/2  items T(1)=0 T(n) = T(  n/2  ) + 1 T(n) =  log 2 n 

25 Modern computers do something slightly different. This algorithm is fast, but how much does it cost to build?

26 y n-1 + Size of Circuit (number of gates) X  n/2  -1 … X1X1 X0X0 Prefix sum on  n/2  items X n-1 …X  n/2  X n-2 Prefix sum on  n/2  items S(1)=0 S(n) = S(  n/2  ) + S(  n/2  ) +1 S(n) = n-1

27 Sum of Sizes X0X0 y0y0 + X1X1 X0X0 y1y1 X0X0 X2X2 + X1X1 + y2y2 X3X3 + ++ X2X2 X1X1 X0X0 y3y3 S(n) = 0 + 1 + 2 + 3 +  + (n-1) = n(n-1)/2

28 Recursive Algorithm n items (n = power of 2) If n = 1, Y 0 = X 0 ; + X3X3 X2X2 + X1X1 X0X0 + X5X5 X4X4 + X n-3 X n-4 + X n-1 X n-2 …

29 Recursive Algorithm n items (n = power of 2) If n = 1, Y 0 = X 0 ; + X3X3 X2X2 + X1X1 X0X0 + X5X5 X4X4 + X n-3 X n-4 + X n-1 X n-2 … … Prefix sum on n/2 items

30 + Recursive Algorithm n items (n = power of 2) If n = 1, Y 0 = X 0 ; + X3X3 X2X2 + X1X1 X0X0 + X5X5 X4X4 + X n-3 X n-4 + X n-1 X n-2 … +++ Prefix sum on n/2 items Y n-1 Y1Y1 Y0Y0 Y2Y2 Y n-2 Y4Y4 Y3Y3 … …

31 Parallel time complexity T(1)=0; T(2) = 1; T(n) = T(n/2) + 2 T(n) = 2 log 2 (n) - 1 + If n = 1, Y 0 = X 0 ; + X3X3 X2X2 + X1X1 X0X0 + X5X5 X4X4 + X n-3 X n-4 + X n-1 X n-2 … + ++ Prefix sum on n/2 items Y n-1 Y1Y1 Y0Y0 Y2Y2 Y n-2 Y4Y4 Y3Y3 … … 1 T(n/2) 1

32 Size S(1)=0; S(n) = S(n/2) + n - 1 S(n) = 2n – log 2 n -2 + If n = 1, Y 0 = X 0 ; + X3X3 X2X2 + X1X1 X0X0 + X5X5 X4X4 + X n-3 X n-4 + X n-1 X n-2 … + ++ Prefix sum on n/2 items Y n-1 Y1Y1 Y0Y0 Y2Y2 Y n-2 Y4Y4 Y3Y3 … … n/2 S(n/2) (n/2)-1

33 Back to the carries…. A binary, associative operator: 01  0000 1111 01

34 Back to the carries…. 01  0000 1111 01                      

35 Carry look-ahead addition s 1 = (a 1 XOR b 1 ) XOR c 1 a a 4 4 a a 3 3 a a 2 2 a a 1 1 a a 0 0 b b 4 4 b b 3 3 b b 2 2 b b 1 1 b b 0 0 + c 5 c 4 c 3 c 2 c 1 s 1

36 Carry Look-Ahead Addition To add two n-bit numbers: a and b 1 step to compute x values (   ) 2 log 2 n - 1 steps to compute carries c 1 step to compute c XOR (a XOR b) 2 log 2 n + 1 steps total

37 And this is how addition works on commercial chips….. Processorn2log 2 n +1 80186169 Pentium3211 Alpha6413

38 2’s-complement representation of signed integers As long as we do not overflow, we can add two integers using our addition algorithm. To negate a number flip each of its bits and add one. Subtract by first negating one number, then adding. -64-64-64-6432 161616168421 1010100 = -44

39 How fast can we multiply two n-bit numbers in parallel?

40 Grade School Multiplication X * * * * * * * * * * * * * * * * * * * * * * * * * * * *

41 Grade School Multiplication X * * * * * * * * * * * * * * * * * * * * * * * * * * * * a1a2a3a1a2a3...... an ana1a2a3a1a2a3...... an an...

42 We need to add n 2n-bit numbers: a 1, a 2, a 3,…, a n

43 A tree of carry look-ahead adders + + + + + + + a1a1 anan

44 + + + + + + + a1a1 anan T(n) = log 2 n (2 log 2 2n + 1) =  (log 2 n)

45 We can do much better than that!

46 Carry-Save Addition The sum of three numbers can be converted into the sum of 2 numbers:  + +   + XOR Carries

47 A tree of carry-save adders a1a1 anan + carry look ahead + + + + + + + + + + + +

48 A tree of carry-save adders T(n)  log 3/2 (2  n/3  + (n mod 3)) + 2log 2 2n + 1 + carry look ahead + + + + + + + + + + + +

49 For a 64-bit word that works out to a parallel time of 22. Contrast that with 13 for addition.

50 Grade School Division * * * * * * * * * * * * * * * * * * * * * * * * * * * * * n bits of precision: n subtractions costing 2log 2 n + 1 each  (n logn)

51 We can do much better than that!

52 Representation: Understand the relationship between different representations of the same information or idea 1 2 3

53 Redundant representation of integers Idea: allow each digit to be –1, 0, or 1 Examples: 0 1 0 1 = 5 1 0 –1 –1 = 5 0 1 1 –1 = 5

54 Addition algorithm for redundant representation 1.Add corresponding digits, no carries 0 1 0 1 -1 0 1 1 75 0 1 0 1 -1 0 1 1 75 + 1 0 –1 0 –1 0 1 0 90 + 1 0 –1 0 –1 0 1 0 90 -------------------------- -------------------------- 1 1 –1 1 –2 0 2 1 165 1 1 –1 1 –2 0 2 1 165 2.Remove +2’s by changing to 0, passing 1 left Remove +1’s by changing to –1, passing 1 left 1 0 -1 0 –1 –2 1 1 -1 165 Now all digits are –2, -1, 0, or 1. Why? Now all digits are –2, -1, 0, or 1. Why? 3.Remove –1 by changing to 1, passing -1 left Remove –2 by changing to 0, passing –1 left 1 –1 1 –1 0 0 1 0 1 165 1 –1 1 –1 0 0 1 0 1 165 We’ve added two numbers in constant parallel time!

55 SRT division algorithm 1 0 1 11 1 1 0 1 1 0 1 23711 21 r 6 Rule: Each bit of quotient is determined by comparing first bit of divisor with first bit of dividend. Easy! -1 0 –1 -1 1 0 –1 1 1 = -1 0 0 1 -1 0 –1 -1 -2 0 1 1 0 1 1 1 2 = 1 0 0 0 -1 0 –1 -1 0 –1 –1 1 10 r –1 –1 1 23711 22 r -5 Time for n bits of precision in result:  3n + 2log 2 (n) + 1 Convert to standard representation by subtracting negative bits from positive. 1 addition per bit

56 Intel Pentium division error The Pentium uses essentially the same algorithm, but computes more than one bit of the result in each step. Several leading bits of the divisor and quotient are examined at each step, and the difference is looked up in a table. The table had several bad entries. Ultimately Intel offered to replace any defective chip, estimating their loss at $475 million.

57 Brent’s Law At best, p processors will give you a factor of p speedup over the time it takes on a single processor.

58 The traditional GCD algorithm will take linear time to operate on two n bit numbers. Can it be done faster in parallel?

59 If n 2 people agree to help you compute the GCD of two n bit numbers, it is not obvious that they can finish faster than if you had done it yourself.

60 Is GCD inherently sequential?

61 No one knows.

62 Deep Blue Many processors help DB look many chess moves ahead. It was not obvious that more processors could really help with game tree search. I remember a heated debate in C.B.’s Ph.D. thesis defense.


Download ppt "Grade School Again: A Parallel Perspective CS 15-251 Lecture 7."

Similar presentations


Ads by Google