Download presentation
Presentation is loading. Please wait.
1
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University of California, Santa Barbara of America
2
2 Outline Carry Save Arithmetic Related Work Problem formulation Algebraic methods Delay aware optimization Experimental results
3
3 Carry Save Arithmetic Multi-Operand addition F = A + B + C + D + E + F Carry propagation major bottleneck Fast adders: Carry Lookahead Adder (CLA), Carry Select Adders, not fast enough Solution: Eliminate Carry propagation to the final step Generate Sums and Carries separately Treat them as separate numbers Keep adding till only two numbers remain Add the numbers using fast adder (CLA)
4
4 Carry Save Arithmetic CSA + A B CD EF Delay = 3 + log 2 (M + 3) 3 = height of CSA tree M = bitwidth of operands S S S SC C C C F CLA Tree height = log 1.5 (N/2)
5
5 Carry Save arithmetic RCA (M +1) Delay = (M+5) + 4 Using Ripple carry adders (RCAs) (M +2) (M +3) (M +4) (M +5) Delay thru CSA network = 3 + log 1.5 (M + 3)
6
6 Related Work Kim et. al “Arithmetic optimization using Carry Save Adders”, DAC’98 + + + + AB C D E F D E CSA ABC + + F
7
7 Related Work Kim. et. al “Optimal allocation of CSAs”, ICCAD’99 Delay aware CSA allocation Kim et. al “High performance, low power synthesis”, DAC’2000 Synopsys TM Behavioral optimization for arithmetic (BOA) A.Verma and P.Ienne “Improved use of the carry save representation for the synthesis of complex arithmetic circuits”, ICCAD’2004 Arithmetic Optimizer?
8
8 Problem formulation No methodology for detecting redundancy in CSA computations Can reduce the number of CSAs Can reduce the number of wires Common subexpression elimination Standard compiler technique Applied to 2-term arithmetic operations –Polynomial expressions (ICCAD’04, VLSI’05) –Constant multiplications (ASAP’04, ASPDAC’05) CSA expressions (Common 3-term subexpressions)
9
9 Problem formulation Y 1 = X 1 + X 1 <<2 + X 2 + X 2 <<1 + X 2 <<2 Y 2 = X 1 <<2 + X 2 <<2 + X 2 <<3 D 1 = X 1 + X 2 + X 2 <<1 Y 1 = (D 1 S + D 1 C ) + X 1 <<2 + X 2 <<2 Y 2 = (D 1 S + D 1 C )
10
10 Algebraic methods Polynomial transformation X<<i = XL i Detects shifted common subexpressions and also extends to multiple variables C × X = (±X×L i ) (14) 10 × X = (1110) 2 × X = X<<3 + X<<2 + X<<1 = XL 3 + XL 2 + XL 1 = (100-10) CSD × X = XL 4 – XL 1
11
11 Algebraic methods 3-term divisors = All potential common subexpressions Divisor generation One for every combination of 3 terms eg. F 1 = X 1 + X 1 L 2 + X 2 + X 2 L + X 2 L 2 d 1 = X 1 L 2 + X 2 L + X 2 L 2 MinL = L Divisor D 1 = d 1 /L = X 1 L + X 2 + X 2 L # of divisors = Theorem: There exists a 3-term common subexpression iff there exists a non-overlapping intersection among the set of 3-term divisors N 3
12
12 Algebraic methods Greedy Iterative algorithm Extracts the “best” 3-term divisor Rewrites the expressions containing it Terminates when there are no more common subexpressions F 1 = a + b + c + d + e F 2 = a + b + c + d + f >> D 1 = a + b + c F 1 = D 1 S + D 1 C + d + e F 2 = D 1 S + D 1 C + d + f >> D 2 = D 1 S + D 1 C + e F 1 = D 2 S + D 2 C + e F 2 = D 2 S + D 2 C + f
13
13 Algebraic methods Algorithm details Optimize ({P i }) { {P i } = Set of expressions in polynomial form; {D} = Set of divisors = φ; // Step 1. Creating divisors and their frequency statistics for each expression P i in {P i } { {D new } = Divisors(P i ); Update frequency statistics of divisors in {D}; {D} = {D} { D new }; } //Step 2. Iterative selection and elimination of best divisor while (1) { Find d = divisor in {D} with most number of non-overlapping intersections; if (d == NULL) break; Rewrite affected expressions in {P i } using d; Remove divisors in {D} that have become invalid; Update frequency statistics of affected divisors; {D new } = Set of new divisors from new terms added by division; {D} = {D} {D new }; }
14
14 Algebraic methods Algorithm complexity M expressions, each with N terms Divisor generation = M* = O(MN 3 ) Iterative algorithm, worst case – N terms reduced to 2 terms = (N -2) steps – M expressions = O(MN) steps N 3
15
15 Delay aware optimization Sharing subexpressions can increase the total delay Traditional high level synthesis approach: Reduce delay by Tree Height Reduction (THR) Our solution: Control delay during optimization itself Optimal delay CSA allocation (T.Kim, J.Um, “Timing driven synthesis”, ASPDAC’2000) – Use this to get minimum possible delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0)
16
16 Delay aware optimization Optimal allocation Delay ignorant extraction CSA 000 bcd 1 e a + F1F1 33 10 222 000 bcd 1 f a + F2F2 33 10 222 Delay(F 1 ) = Delay(F 2 ) = 3 + D(Add)
17
17 Delay aware extraction Control delay during optimization Evaluate each candidate divisor for delay Only consider those divisors that do not increase the delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0) >> D 1 (3) = a (2) + b (0) + c (0) F 1 = D 1S (3) + D 1C (3) + d (0) + e (0) F 2 = D 1S (3) + D 1C (3) + d (0) + f (0) Delay = 5 + D(Add)
18
18 Delay aware extraction Control delay during optimization Evaluate each candidate divisor for delay Only consider those divisors that do not increase the delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0) >> D 2 (1) = b (0) + c (0) + d (0) F 1 = D 2S (1) + D 2C (1) + e (0) + a (2) F 2 = D 2S (1) + D 2C (1) + f (0) + a (2) Delay = 3 + D(Add)
19
19 Experimental results Comparing # of CSAs Average 38.4% reduction
20
20 Experimental results Synthesis for Standard Cell Designs Synopsys TM Design compiler 0.25 micron library Synthesized for minimum delay Avg 32.7% Area reduction Avg 3.7% increase in delay
21
21 Experimental results FPGA synthesis Virtex II FPGAs Synthesized designs and performed place & route Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs Avg 5.7% increase in the delay
22
22 Experimental results Evaluate Delay aware extraction algorithm Consider different arrival times of the signals Assume delay dominated by gate delay (FA delay) Only consider best case delay Example# of CSAsDelay (FA units) Delay ignorant Delay aware Delay Ignorant Delay aware H.264787998 DCT82222321413 IDCT81952011413 FIR6tap111554 FIR20tap344565 FIR41tap799165 Average103.2110.598 Best delay with 15.5% increase in #CSAs
23
23 Conclusions First methodology for common subexpression elimination for Carry Save Arithmetic Significant area/power reduction Delay aware optimization algorithm also developed Can be combined with CSA tree extraction methods for actual application improvement
24
24 Thank you!! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.