1/30 Division by Convergence 授課老師:王立洋老師 製作學生: M 蔡鐘葳
2/30 Outline ▓ Speedup of Convergence Division ▓ Hardware Implementation ▓ Analysis of Lookup Table Size ▓ Reference
3/ Speedup of Convergence Division
4/30 Introduction Compute y = 1/d Do the multiplication yz Division can be performed via 2 log2 k – 1 multiplications This is not yet very impressive 64-bit numbers, 5-ns multiplier 55-ns division
5/30 Three Types of Speedup Three types of speedup are possible: Reducing the number of multiplications (reduce m) Using narrower multiplications (reduce the width of some x (i) s) Performing the multiplications faster
6/30 Initial Approximation Convergence is slow in the beginning: It takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits Since x (0) x (1) x (2) is essentially an approximation to 1/d, these four initial multiplications can be replaces by a table-lookup step that directly supplies x (0+)
7/30 Initial Approximation via Table Lookup A 2 w w lookup table is necessary and sufficient for w bits of convergence after the first pair multiplications Approx to 1/d Better approx Read this value, x (0+), directly replaced by a table-lookup step, thereby reducing 6 multiplications to 2 d x (0) x (1) x (2) = ( ) two
8/30 Example with 4-bit lookup Example with 4-bit lookup: d = ( xxxx...) two 11/16 d < 12/16 Inverses of the two extremes are 16/11 and 16/12 So, is a good estimate for 1/d = (11/8) (11/16) = 121/128 = = (11/8) (3/4) = 33/32 =
9/30 Fig Fig Convergence in division by repeated multiplications with initial table lookup. After table lookup and first pair of multiplications, replacing several iterations After the second pair of multiplications
10/30 Fig For division by repeated multiplications We saw that convergence to 1 and q occurred from below If at some point in our iterations, d (i) overshoots 1 (becomes 1 + ε) The next multiplicative factor 2 - d (i) = 1 - ε will lead to a value smaller than 1 But still closer to 1, for d (i+1)
11/30 Analysis the Truncating Multiplicative (1/2) We begin by noting that dx (0) x (1) … x (i) = 1 – y (i) x (i+1) = 2 – (1 – y (i) ) = 1 + y (i) Assume that we truncate 1 – y (i) to an a-bit fraction Thus obtaining (1 – y (i) ) T with an error of α< 2 -a
12/30 Analysis the Truncating Multiplicative (2/2) With this truncated multiplicative factor, we get x (i+1) = 2 – (1 – y (i) ) = 1 + y (i) Where 0 ≦ (x (i+1) ) T – x (i+1) < 2 -a Thus dx (0) x (1) … x (i) x (i+1) T = (1 – y (i) )(1 + y (i) + α) = 1 – (y (i) ) 2 + α(1 – y (i) ) = dx (0) x (1) … x (i) x (i+1) + α(1 – y (i) )
13/30 Fig Fig Convergence in division by repeated multiplications with initial table lookup and the use of truncated multiplicative factors.
14/30 Fig The first pair of multiplications following the table- lookup involve a narrow multiplier It may be faster than a full-width multiplications If the multiplier is suitably truncated The result is that convergence occurs from above or below
15/30 Fig Fig One step in convergence division with truncated multiplicative factors.
16/30 Fig If we aim to go from l bits to 2l bits of convergence We can truncate the next multiplicative factor to 2l Bits Consider Fig A is the result of precise iteration, is no more than 2 -2l below 1 With a = 2l, B, arrived at by the approximate iteration, will be no more than 2 -2l above 1
17/30 Example 64-bit multiplication Initial step: Table of size 256 8 = 2K bits Middle steps: Multiplication pairs, with 9, 17, and 33-bit multipliers Final step: Full 64 64 multiplication
18/ Hardware Implementation
19/30 Hardware Implementation Fig Two multiplications fully overlapped in a 2-stage pipelined multiplier.
20/30 Fig As the computation of z (i) x (i) moves from the top to the bottom pipeline stage The next iteration begins by computing the stage of d (i+1) x (i+1)
21/30 Implementing Division with Reciprocation Reciprocation: Multiplication pairs are data- dependent, so they cannot be pipelined or performed in parallel Since in the recurrence x (i+1) = x (i) (2 - x (i) d) The second multiplication by x (i) needs the result of the first one The most promising speedup method relief on deriving a better starting approximation to 1/d
22/30 The Required Lookup Table The Required Lookup Table can be made smaller, or totally eliminated, by a variety of methods Store the reciprocal values for fewer points Use linear or higher-order interpolation to compute the starting approximation Formulate the starting approximation as a multi-operand addition problem Use or pass through the multiplier’s CSA tree, suitably augmented, to compute it
23/ Analysis of Lookup Table Size
24/30 Theorem for Table Size Theorem 16.1: To get w 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x (0+) read out from table is of the form (1.xxx... xxx) two, with w bits after the radix point Based on the theorem, the required table size is 2 w × w The cases w < 5: Practically uninteresting (allow smaller table) We can ignore them
25/30 Analysis of Lookup Table Size (1/4) Recall that our objective is to have 1 – 2 -w ≦ dx (0+) ≦ w Let d = (0.1 d -2 d -3 ) …d -(w+1) d -(w+2) …d -l ) two w bits to be inspected Theorem 16.1 postulates the existence of x (0+) = (1. x + -1 x + -2 …x + -w ) two satisfying the objective inequality
26/30 Analysis of Lookup Table Size (2/4) Let u = (1 d -2 d -3 ) … d -(w+1) ) two satisfying 2 w ≦ u < 2 w+1 We have 2 -(w+1) u ≦ d < 2 -(w+1) (u+1) Similarly, let v = (1x + -1 x + -2 …x + -w ) two The objective inequality can be rewrite as 2 w – 1 ≦ dv ≦ 2 w + 1
27/30 Analysis of Lookup Table Size (3/4) We derive the following sufficient conditions 2 w - 1 ≦ 2 -(w+1) uv 2 -(w+1) (u+1)v ≦ 2 w + 1 The conditions lead to the following restrictions on v
28/30 Analysis of Lookup Table Size (4/4) The latter condition is equivalent to The last inequality always holds is left as an exercise Completes the “sufficiency” part of the proof At least w bits of d must be inspected x (0+) must have at least w bits after the radix point
29/30 Example Table 16.2 Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Address d = 0.1 xxxx xxxx x (0+) = 1. xxxx xxxx ––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Example: Table entry at address 55 (311/512 d < 312/512) For 8 bits of convergence, the table entry f must satisfy (311/512)(1 +. f) 1 – 2 –8 (312/512)(1 +. f) –8 199/311 .f 101/156 or ≤ 256 . f ≤ Two choices: 164 = ( ) two or165 = ( ) two
30/30 Reference [1] Behrooz Parhami, “Computer Arithmetic Algorithms and Hardware Designs,” Oxford University Press