VLSI Arithmetic Lecture 4 Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel
Review Lecture 3
Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Computer Arithmetic
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 5 5 4 4 3 D=9 3 1 1 Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic
Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Delay model: Oklobdzija 2004 Computer Arithmetic
Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Group Length Oklobdzija, Barnes, Arith’85 Oklobdzija 2004 Computer Arithmetic
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Block Lengths No closed form solution for delay It is a dynamic programming problem Oklobdzija 2004 Computer Arithmetic
Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Computer Arithmetic
Delay Comparison: Variable Block Adder Square Root Dependency VBA Log Dependency CLA VBA- Multi-Level Oklobdzija 2004 Computer Arithmetic
Circuit Issues Adder speed can not be estimated based on: logic gates in the critical path number of transistors in the path logic levels in the path Estimating Adders speed is much more complex and many of the “fast” schemes may be misleading you. Oklobdzija 2004 Computer Arithmetic
Fan-Out Dependency Oklobdzija 2004 Computer Arithmetic
Fan-In Dependency This looks like “Logical Effort” (1985) Oklobdzija 2004 Computer Arithmetic
Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Computer Arithmetic
Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith, 1958) ARITH-13: Presenting Achievement Award to Arnold Weinberger of IBM (who invented CLA adder in 1958) Ref: A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958. Oklobdzija 2004 Computer Arithmetic
CLA Definitions: One-bit adder First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. Oklobdzija 2004 Computer Arithmetic
CLA Definitions: 4-bit Adder Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder: 4-bits Gj Pj Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder One gate delay D to calculate p, g One D to calculate P and two for G Three gate delays To calculate C4(j+1) Compare that to 8 D in RCA ! Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith) Additional two gate delays C16 will take a total of 5D vs. 32D for RCA ! Oklobdzija 2004 Computer Arithmetic
32-bit Carry Lookahead Adder A significant speed improvement in the implementation of a parallel adder was introduced by a Carry-Lookahead-Adder developed by Weinberger and Smith in 1958. It is theoretically one of the fastest schemes, since the delay to add two numbers depends on the logarithm of the size of the operands. The Carry Loookahead Adder uses modified full adders for each bit position and Lookahead modules which are used to generate carry signals independently for a group of k-bits. In most common case the group size is 4-bits. In addition to carry signal for the group, Lookahead modules produce group carry generate G and group carry propagate P outputs that indicate that a carry is generated within the group, or that an incoming carry would propagate across the group. The carry out from a 4-bit wide group ci+4 can be computed in four gate delays: one gate delay to compute pi and gi for i = i through i+3, a second gate delay to evaluate Pj, the second and the third to evaluate Gj, and the third and fourth to calculate carry signals ci+1, ci+2 , ci+3 and ci+4. Actually, if not limited by fan-in constraints, ci+4 could be calculated concurrently with Gj and will be available after three gate delays. In a recursive fashion, we can create a "group of groups" or a "super-group". The inputs to the "super-group" are G and P signals from the previous level. The "super-group" produces P* and G* signals indicating that the carry signal will be propagated across, or generated in the groups within the "super-group" domain. A "super-group" produces a carry signal out of the "super-group" as well as an input carry signal for each of the groups in the level above. Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith: original derivation, 1958 ) Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith: original derivation ) Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic
Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic
Motorola: CLA Implementation Example A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”, Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.
Critical path in Motorola's 64-bit CLA 4.8nS 1.05nS 1.7nS As opposed to Ripple or Carry-Skip Adders the critical path in the Carry-Lookahead-Adder travels in vertical direction rather than a horizontal one as shown in the previous slide. Therefore the delay of Carry-Lookahead-Adder is not directly proportional to the size of the adder N, but to the number of levels used. Given that the groups and super-groups in the Carry-Lookahead-Adder resemble a tree structure the delay of a Carry-Lookahead-Adder is thus proportional to the log function of the size N. This log dependency makes Carry-Lookahead-Adder one of the theoretically fastest structures for addition. However, it can be argued that the speed efficiency of the Carry-Lookahead-Adder has passed the point of diminishing returns given the fan-in and fan-out dependencies of the logic gates and inadequacy of the delay model based on counting number of gates in the critical path. In reality, Carry-Lookahead-Adder is indeed achieving lesser speed than expected, especially when compared to some techniques that consume less hardware for the implementation. An example of a Carry Lookahead Adder, and a critical path as implemented in Motorola processor is shown in this slide. 3.75nS 2.7nS 2.0nS 2.35nS Oklobdzija 2004 Computer Arithmetic
Motorola's 64-bit CLA conventional PG Block no better situation here ! carry ripples locally 5-transistors in the path Basically, this is MCC performance with Carry-Skip. One should not expect any better results than VBA. Oklobdzija 2004 Computer Arithmetic
Motorola's 64-bit CLA Modified PG Block Intermediate propagate signals Pi:0 are generated to speed-up C3 still critical path resembles MCC Oklobdzija 2004 Computer Arithmetic
Motorola's 64-bit CLA 1.8nS 2.2nS 2.9nS 3.2nS 3.55nS 3.9nS Oklobdzija 2004 Computer Arithmetic
1.05nS 1.7nS 2.0nS 2.35nS 2.7nS 3.75nS 4.8nS 1.8nS 2.2nS 2.9nS 3.2nS As opposed to Ripple or Carry-Skip Adders the critical path in the Carry-Lookahead-Adder travels in vertical direction rather than a horizontal one as shown in the previous slide. Therefore the delay of Carry-Lookahead-Adder is not directly proportional to the size of the adder N, but to the number of levels used. Given that the groups and super-groups in the Carry-Lookahead-Adder resemble a tree structure the delay of a Carry-Lookahead-Adder is thus proportional to the log function of the size N. This log dependency makes Carry-Lookahead-Adder one of the theoretically fastest structures for addition. However, it can be argued that the speed efficiency of the Carry-Lookahead-Adder has passed the point of diminishing returns given the fan-in and fan-out dependencies of the logic gates and inadequacy of the delay model based on counting number of gates in the critical path. In reality, Carry-Lookahead-Adder is indeed achieving lesser speed than expected, especially when compared to some techniques that consume less hardware for the implementation. An example of a Carry Lookahead Adder, and a critical path as implemented in Motorola processor is shown in this slide. Oklobdzija 2004 Computer Arithmetic
Journal of VLSI Signal Processing, Vol.3, No.4, October 1991 Delay Optimized CLA B. Lee, V. G. Oklobdzija Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
Delay Optimized CLA: Lee-Oklobdzija ‘91 (a.) Fixed groups and levels (b.) variable-sized groups, fixed levels (c.) variable-sized groups and fixed levels (d.) variable-sized groups and levels Oklobdzija 2004 Computer Arithmetic
Two-Levels of Logic Implementation of the Carry Block Oklobdzija 2004 Computer Arithmetic
Two-Levels of Logic Implementation of the Carry-Lookahead Block Oklobdzija 2004 Computer Arithmetic
Three-Levels of Logic Implementation of the Carry Block (restricted fan-in) Oklobdzija 2004 Computer Arithmetic
Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in) Oklobdzija 2004 Computer Arithmetic
Delay Optimized CLA: Lee-Oklobdzija ‘91 Delay: Two-level BCLA Delay: Three-level BCLA Oklobdzija 2004 Computer Arithmetic
Delay Optimized CLA: Lee-Oklobdzija ‘91 (a.) 2-level BCLA D=8.5nS (b.) 3-level BCLA D=8.9nS Oklobdzija 2004 Computer Arithmetic
Ling’s Adder Huey Ling, “High-Speed Binary Adder” IBM Journal of Research and Development, Vol.5, No.3, 1981. Used in: IBM 3033, IBM 168, Amdahl V6, HP etc.
Ling’s Derivations define: ai bi ci si ci+1 define: gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1 ai bi pi gi ti 1 Oklobdzija 2004 Computer Arithmetic
Ling’s Derivations From: Now we need to derive Sum equation and because: fundamental expansion Now we need to derive Sum equation Oklobdzija 2004 Computer Arithmetic
Ling Adder Ling’s equations: Variation of CLA: Ling, IBM J. Res. Dev, 5/81 Oklobdzija 2004 Computer Arithmetic
Ling Adder Ling’s equation: Variation of CLA: Ling uses different transfer function. Four of those functions have desired properties (Ling’s is one of them) see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988. Oklobdzija 2004 Computer Arithmetic
Ling Adder Conventional: Ling: Fan-in of 5 Fan-in of 4 Oklobdzija 2004 Computer Arithmetic
Advantages of Ling’s Adder Uniform loading in fan-in and fan-out H16 contains 8 terms as compared to G16 that contains 15. H16 can be implemented with one level of logic (in ECL), while G16 can not. (Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used) Oklobdzija 2004 Computer Arithmetic