1 A Timing-Driven Synthesis Approach of a Fast Four-Stage Hybrid Adder in Sum-of-Products Sabyasachi Das University of Colorado, Boulder Sunil P. Khatri Texas A&M University
2 What is a Sum-of-Product (SOP) An arithmetic Sum-of-Product block (SOP) consists of an arbitrary number of product terms and sum terms. An arithmetic Sum-of-Product block (SOP) consists of an arbitrary number of product terms and sum terms. General form of SOP: General form of SOP: p = a * b a b q = c * d c d z = p + q + e + f e f z p q
3 Examples of SOP Blocks Multiplier { assign z = a * b} Multiplier { assign z = a * b} found in Microprocessors found in Microprocessors Multiply-Accumulator { assign z = (a * b) + c} Multiply-Accumulator { assign z = (a * b) + c} found in Cryptographic Applications found in Cryptographic Applications Squarer { assign z = a * a} Squarer { assign z = a * a} found in DSP processors found in DSP processors Addition Tree { assign z = a + b + c + d} Addition Tree { assign z = a + b + c + d} found in ALU, Wireless applications found in ALU, Wireless applications Generalized SOP { assign z = (a * b) + (c * d)} Generalized SOP { assign z = (a * b) + (c * d)} found in FIR filters, IIR filters found in FIR filters, IIR filters
4 Synthesis of Sum-of-Products Synthesis of Sum-of- Product blocks is done in 3 steps (in the order of data- flow) Synthesis of Sum-of- Product blocks is done in 3 steps (in the order of data- flow) Creation of Partial Products Creation of Partial Products Reduction of Partial Products into 2 operands Reduction of Partial Products into 2 operands Computation of Final Sum by adding the 2 operands Computation of Final Sum by adding the 2 operands Creation of Partial Products Reduction of Partial Products Computation of Final Sum Inputs Output
5 Motivation and Problem Statement SOP blocks are widely used and computationally-intensive SOP blocks are widely used and computationally-intensive Final adder in SOP consumes about 30% to 40% delay of the SOP block. This paper focuses on the synthesis of an efficient final adder for a SOP expression Final adder in SOP consumes about 30% to 40% delay of the SOP block. This paper focuses on the synthesis of an efficient final adder for a SOP expression Stand-alone adder architectures do not work well in SOP Stand-alone adder architectures do not work well in SOP
6 Stand-alone Adder Architectures Frequently used adder architectures Frequently used adder architectures Ripple-Carry Ripple-Carry Area-efficient, but slow Area-efficient, but slow Timing-efficient if inputs have skewed arrival time Timing-efficient if inputs have skewed arrival time Parallel-Prefix architecture (Brent-Kung, Kogge-Stone) Parallel-Prefix architecture (Brent-Kung, Kogge-Stone) Faster architecture Faster architecture Requires more area Requires more area Carry-Select Carry-Select Large area overhead (often >100%) Large area overhead (often >100%) Better delay if C in signal arrives late. Better delay if C in signal arrives late. None of these are very suitable in Sum-of-Products None of these are very suitable in Sum-of-Products Why? Why?
7 Special Arrival-time Property The 2 operands of the final adder in a SOP exhibit a peculiar arrival time pattern The 2 operands of the final adder in a SOP exhibit a peculiar arrival time pattern As a result, traditional monolithic adders do not work well in SOP As a result, traditional monolithic adders do not work well in SOP Optimized for equal arrival times Optimized for equal arrival times Hence, hybrid adders are required, which exploit this arrival-time pattern Hence, hybrid adders are required, which exploit this arrival-time pattern Hence it is critical to synthesize an efficient hybrid adder which is designed specifically for SOP blocks Hence it is critical to synthesize an efficient hybrid adder which is designed specifically for SOP blocks
8 Proposed 4-Stage Hybrid Adder SubAdder 1 RippleCarry w1w1 w1w1 w1w1 SubAdder 2 KoggeStone w2w2 w2w2 w2w2 SubAdder 3 CarrySelect w3w3 w3w3 w3w3 SubAdder 4 CarrySelect w4w4 w4w4 w4w4 Ripple-Carry architecture near LSB Ripple-Carry architecture near LSB Fast Kogge-Stone architecture near Middle Fast Kogge-Stone architecture near Middle 2 Carry-Selects (based on Brent-Kung) near MSB 2 Carry-Selects (based on Brent-Kung) near MSB GOAL : Find w 1, w 2, w 3 and w 4 algorithmically GOAL : Find w 1, w 2, w 3 and w 4 algorithmically
9 Notations We use the following notations: We use the following notations: The bit-width of SubAdder 1 (Ripple) is w 1 bits The bit-width of SubAdder 1 (Ripple) is w 1 bits The bit-width of SubAdder 2 (Kogge-Stone) is w 2 bits The bit-width of SubAdder 2 (Kogge-Stone) is w 2 bits The bit-width of SubAdder 3 (Carry-Select, Brent-Kung) is w 3 bits The bit-width of SubAdder 3 (Carry-Select, Brent-Kung) is w 3 bits The bit-width of SubAdder 4 (Carry-Select, Brent-Kung) is w 4 bits The bit-width of SubAdder 4 (Carry-Select, Brent-Kung) is w 4 bits w 1 + w 2 + w 3 + w 4 = n (total width of the hybrid adder) w 1 + w 2 + w 3 + w 4 = n (total width of the hybrid adder) T(a i ) = Time when input signal a i is available T(a i ) = Time when input signal a i is available T(S i ) = Time when output signal S i (Sum i ) is available T(S i ) = Time when output signal S i (Sum i ) is available T(C i ) = Time when output signal C i (Carry i ) is available T(C i ) = Time when output signal C i (Carry i ) is available
10 SubAdder 1 (Ripple-Carry) Most area-efficient architecture Most area-efficient architecture Very slow Very slow Timing-efficient if input arrival time is skewed. We use it for a few bits near LSB (which arrive earliest) Timing-efficient if input arrival time is skewed. We use it for a few bits near LSB (which arrive earliest) FA x0x0 y0y0 z0z0 x1x1 y1y1 z1z1 x2x2 y2y2 z2z2 xkxk ykyk zkzk z k+1
11 Parallel-Prefix Adders (KS, BK) In a Parallel-Prefix adder, Carry for each bit is computed by an efficient tree-structure (using the Generate and Propagate concept). In a Parallel-Prefix adder, Carry for each bit is computed by an efficient tree-structure (using the Generate and Propagate concept). For each bit i of the adder, Generate (G i ) indicates whether a carry is generated from that bit For each bit i of the adder, Generate (G i ) indicates whether a carry is generated from that bit G i = a i b i G i = a i b i For each bit i of the adder, Propagate (P i ) indicates whether a carry is propagated through that bit For each bit i of the adder, Propagate (P i ) indicates whether a carry is propagated through that bit P i = a i b i P i = a i b i The Generate and Propagate concept is extendable to blocks comprising multiple bits, as we discuss next The Generate and Propagate concept is extendable to blocks comprising multiple bits, as we discuss next
12 Parallel-Prefix Adders (KS, BK) If two blocks (comprising one or more bits) have the GP value- pairs as (G left, P left ) and (G right, P right ), then the combined block has the GP values as follows: If two blocks (comprising one or more bits) have the GP value- pairs as (G left, P left ) and (G right, P right ), then the combined block has the GP values as follows: G left, right = G left (P left G right ) G left, right = G left (P left G right ) P left, right = P left P right P left, right = P left P right The above computation is performed The above computation is performed by a carry-operator or ”o”-operator by a carry-operator or ”o”-operator Once we obtain carry for each bit, Once we obtain carry for each bit, it is trivial to compute the sum output of each bit (XOR and NAND) (G left, P left ) (G right, P right ) (G left, right, P left, right )
13 SubAdder 2 (Kogge-Stone) Kogge-Stone Parallel prefix architecture Kogge-Stone Parallel prefix architecture Delay: log 2 n levels of ”o”-operator Delay: log 2 n levels of ”o”-operator Area: (n*log 2 n)-n+1 number of ”o”-operator Area: (n*log 2 n)-n+1 number of ”o”-operator GP 3 GP 2 GP 1 GP 0 GP 7 GP 6 GP 5 GP 4 C4C4 C3C3 C2C2 C8C8 C7C7 C6C6 C5C5 C1C1 Kogge and Stone, “A parallel algorithm for the efficient solution of a general class of recurrence equations”, In IEEE transaction for Computers, 1973
14 Brent-Kung (BK) Brent-Kung Parallel prefix architecture Brent-Kung Parallel prefix architecture Delay: (2*log 2 n)-2 levels of ”o”-operator Delay: (2*log 2 n)-2 levels of ”o”-operator Area: (2*n)-2-log 2 n number of ”o”-operator Area: (2*n)-2-log 2 n number of ”o”-operator GP 3 GP 2 GP 1 GP 0 GP 7 GP 6 GP 5 GP 4 C4C4 C3C3 C2C2 C8C8 C7C7 C6C6 C5C5 C1C1 Brent and Kung, “A regular layout for parallel adders”, In IEEE transaction for Computers, 1982
15 SubAdder 3 & SubAdder 4 (Carry-Select) Adder 1 y x z1 Adder 0 1’b0 x z0 Mux z c in y 1’b1 Large area overhead Large area overhead Used as a special case, since C in arrives late Used as a special case, since C in arrives late Speed depends on the architecture of two adders Speed depends on the architecture of two adders But these adders need not be KS (rather, we use BK) But these adders need not be KS (rather, we use BK) The arrival times of the inputs of SubAdder 3 and SubAdder 4 are earlier than those for SubAdder 2 The arrival times of the inputs of SubAdder 3 and SubAdder 4 are earlier than those for SubAdder 2
16 Determination of width of SubAdder 1 Width of the Ripple adder (SubAdder 1 ) Width of the Ripple adder (SubAdder 1 ) At every bit (i), compute T(C i+1 ) and check if At every bit (i), compute T(C i+1 ) and check if T(C i+1 ) ≤ T(a i+1 ) T(C i+1 ) ≤ T(a i+1 ) T(C i+1 ) ≤ T(b i+1 ) T(C i+1 ) ≤ T(b i+1 ) If check passes, i = i+1 If check passes, i = i+1 Else continue checking until 3 consecutive bits fail the check (Hill Climbing) Else continue checking until 3 consecutive bits fail the check (Hill Climbing) Return the value i as the Ripple Adder width Return the value i as the Ripple Adder width
17 Determination of width of SubAdder 2 Width of Kogge-Stone Adder (SubAdder 2 ) Width of Kogge-Stone Adder (SubAdder 2 ) The latest arriving signals are part of this adder The latest arriving signals are part of this adder Hence keep this adder wide, while ensuring that this does not result in a very narrow Carry- Select adder for SubAdder 3 and SubAdder 4 Hence keep this adder wide, while ensuring that this does not result in a very narrow Carry- Select adder for SubAdder 3 and SubAdder 4 We determine the widths with the following equation: We determine the widths with the following equation: w 2 = n – w 1 if (n-w 1 ) ≤ 8 w 2 = n – w 1 if (n-w 1 ) ≤ 8 w 2 = 2 p, where p = log 2 (n-w 1 ) if (n-w 1 ) > 8 w 2 = 2 p, where p = log 2 (n-w 1 ) if (n-w 1 ) > 8 Example: If n=32 and w 1 =7 then w 2 =16
18 Delay of the Hybrid Adder SubAdder 1 RippleCarry w1w1 w1w1 w1w1 SubAdder 2 KoggeStone w2w2 w2w2 w2w2 SubAdder 3 CarrySelect w3w3 w3w3 w3w3 SubAdder 4 CarrySelect w4w4 w4w4 w4w4 T hybrid = max (T(C 4 ), T(S 4 ), T(S 3 ), T(S 2 )) T(S 2 ) T(S 3 )T(S 4 )T(C 4 )
19 Determination of widths of SubAdder 3 and SubAdder 4 Width of the two Carry-Select adders Width of the two Carry-Select adders Initial width configuration w 3 = (n-w 1 -w 2 )/2 w 4 = (n-w 1 -w 2 -w 3 ) With this initial configuration, estimate delay of the overall hybrid adder (based on the previous slide) Use an iterative approach to explore in the appropriate direction (similar to Binary Search) and converge on the smallest delay configuration Use an iterative approach to explore in the appropriate direction (similar to Binary Search) and converge on the smallest delay configuration
20 Experimental Setup To test our approach, we used: To test our approach, we used: Adders in several different types of SOP blocks (Multipliers, MAC, generalized SOP and Squarer) Adders in several different types of SOP blocks (Multipliers, MAC, generalized SOP and Squarer) Two process technologies (0.13µ and 0.09µ) Two process technologies (0.13µ and 0.09µ) Two commercial library vendors Two commercial library vendors Two different arrival time constraints Two different arrival time constraints We compared the results of our hybrid adder with the adder produced by a commercial datapath synthesis tool. We compared the results of our hybrid adder with the adder produced by a commercial datapath synthesis tool.
21Results On an average, 14.31% faster than the result of the commercial Synthesis tool (with 6.62% area penalty)
22 Summary Hybrid adder consists of 4 SubAdders Hybrid adder consists of 4 SubAdders SubAdder 1 has Ripple-Carry architecture SubAdder 1 has Ripple-Carry architecture SubAdder 2 has Kogge-Stone architecture SubAdder 2 has Kogge-Stone architecture SubAdder 3 and SubAdder 4 have Carry-Select (based on Brent-Kung) architecture SubAdder 3 and SubAdder 4 have Carry-Select (based on Brent-Kung) architecture Widths of all SubAdders are computed based on a timing-driven analysis Widths of all SubAdders are computed based on a timing-driven analysis On an average, 14.31% faster (with 6.62% area penalty) On an average, 14.31% faster (with 6.62% area penalty)
23 Thank you