1 Generation of Optimal Bit-Width Topology of Fast Hybrid Adder in a Parallel Multiplier Sabyasachi Das Synplicity Inc. Sunil P. Khatri Texas A&M University Presented by David Pan, UT Austin
2 What is a Multiplier? IC block that perform multiplication operation IC block that perform multiplication operation Well-known logic architectures Well-known logic architectures Computationally-intensive Computationally-intensive Wide usage in DSP, Graphics, Microprocessors Wide usage in DSP, Graphics, Microprocessors
3 Structure of Multiplier Multiplier block consists of 3 parts (written in the order of data-flow) Multiplier block consists of 3 parts (written in the order of data-flow) Partial Product Generator (PPGen) Partial Product Generator (PPGen) Partial Product Reduction Tree (PPRT) Partial Product Reduction Tree (PPRT) Final Carry-Propagation Adder (CPA) Final Carry-Propagation Adder (CPA) Partial Product Generator (PPGen) Partial Product Reduction Tree (PPRT) Final Carry Propagation Adder (CPA) Inputs Output
4 Final Adder in a Multiplier Frequently used adder architectures Frequently used adder architectures Ripple-Carry Ripple-Carry Area-efficient, but slow Area-efficient, but slow Timing-efficient if inputs have skewed arrival time Timing-efficient if inputs have skewed arrival time Parallel-Prefix architecture (Brent-Kung, Kogge-Stone) Parallel-Prefix architecture (Brent-Kung, Kogge-Stone) Faster architecture Faster architecture Requires more area Requires more area Carry-Select Carry-Select Large area overhead (often >100%) Large area overhead (often >100%) Better delay if C in signal arrives late. Better delay if C in signal arrives late.
5 3-stage Hybrid Adder Multipliers exhibit a typical arrival time pattern (in the input of the CPA) Multipliers exhibit a typical arrival time pattern (in the input of the CPA) Hybrid adder produces best result for Multipliers Hybrid adder produces best result for Multipliers This outperforms all stand-alone architectures This outperforms all stand-alone architectures Stelling et al., “Design Strategies for optimal hybrid final adders in a parallel multiplier”, In The Journal of VLSI Signal Processing, 1996
6 3-Stage Hybrid Adder There are many possible configurations (w 1, w 2 and w 3 ). Exhaustive exploration is not feasible (huge runtime) How to identify the best configuration? How to identify the best configuration? SubAdder 1 (Ripple) w rpl SubAdder 2 (Brent-Kung) w bk SubAdder 3 (Carry-Select) w cs
7 Identification of Optimal Topology Width of the Ripple adder Width of the Ripple adder At every bit (i), compute T(C i+1 ) and check if At every bit (i), compute T(C i+1 ) and check if T(C i+1 ) ≤ T(a i+1 ) or T(C i+1 ) ≤ T(a i+1 ) or T(C i+1 ) ≤ T(b i+1 ) T(C i+1 ) ≤ T(b i+1 ) If check passes, w rpl = i+1 If check passes, w rpl = i+1 Else continue checking until 3 consecutive bits fail the check (Hill Climbing) Else continue checking until 3 consecutive bits fail the check (Hill Climbing) Return the value i as the Ripple Adder width Return the value i as the Ripple Adder width
8 Delay of the Hybrid Adder T hybrid =Max (T s2, (T co2 +D mx ), (T s3 +D mx )) SubAdder 1 (Ripple) w rpl SubAdder 2 (Brent-Kung) w bk SubAdder 3 (Carry-Select) w cs T s2 T s3 + D mx T co2 + D mx
9 Identification of Optimal Topology Width of the BK and Carry-Select Adders Width of the BK and Carry-Select Adders Initial Configuration w bk = 2 p, where p= log 2 (n – w rpl ) w cs = n – w bk – w rpl Example: If n=32 and w rpl =7 then w bk =16 and w cs =9 Iterative approach Iterative approach Estimate delay of a configuration and explore in the appropriate direction (similar to Binary Search) Estimate delay of a configuration and explore in the appropriate direction (similar to Binary Search)
10 Results For different adder widths, our approach always found best configuration in very short runtime. For different adder widths, our approach always found best configuration in very short runtime. Runtime example: for a 32-bit Adder, Runtime example: for a 32-bit Adder, Trying all possible configurations (561) takes hours of runtime Trying all possible configurations (561) takes hours of runtime Our approach takes 4-18 minutes of runtime and always computes the best configuration. Our approach takes 4-18 minutes of runtime and always computes the best configuration.
11 Results Now, it is feasible to use this powerful hybrid-adder architecture during synthesis (~12% faster adder). Now, it is feasible to use this powerful hybrid-adder architecture during synthesis (~12% faster adder).
12 Thank you