Download presentation
Presentation is loading. Please wait.
1
VLSI Arithmetic Adders & Multipliers
Prof. Vojin G. Oklobdzija University of California
2
Introduction Digital Computer Arithmetic belongs to Computer Architecture, however, it is also an aspect of logic design. The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way. Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation. Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Oklobdzija 2004 Computer Arithmetic
3
Basic Operations Addition Multiplication Multiply-Add Division
Evaluation of Functions Multi-Media Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Oklobdzija 2004 Computer Arithmetic
4
Addition of Binary Numbers
5
Addition of Binary Numbers
Full Adder. The full adder is the fundamental building block of most arithmetic circuits: The sum and carry outputs are described as: ai bi Cout Full Adder Cin si Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Oklobdzija 2004 Computer Arithmetic
6
Addition of Binary Numbers
Inputs Outputs ci ai bi si ci+1 1 Propagate Generate Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Propagate Generate Oklobdzija 2004 Computer Arithmetic
7
Full-Adder Implementation
Full Adder operations is defined by equations: Carry-Propagate: and Carry-Generate gi First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. One-bit adder could be implemented as shown Oklobdzija 2004 Computer Arithmetic
8
High-Speed Addition First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. One-bit adder could be implemented more efficiently because MUX is faster Oklobdzija 2004 Computer Arithmetic
9
The Ripple-Carry Adder
Oklobdzija 2004 Computer Arithmetic
10
The Ripple-Carry Adder
From Rabaey Oklobdzija 2004 Computer Arithmetic
11
Inversion Property From Rabaey Oklobdzija 2004 Computer Arithmetic
12
Minimize Critical Path by Reducing Inverting Stages
From Rabaey Oklobdzija 2004 Computer Arithmetic
13
Ripple Carry Adder Critical Path
Carry-Chain of an RCA implemented using multiplexer from the standard cell library: Critical Path A ripple carry adder for N-bit numbers is implemented by concatenating N full adders as shown in this slide. At the i-th bit position, the i-th bits of operands A and B and a carry signal from the preceding adder stage are used to generate the i-th bit of the sum, si, and a carry, ci+1, to the next adder stage. This scheme is called a Ripple Carry Adder, since the carry signal “ripple” from the least significant bit position to the most significant one. If the ripple carry adder is implemented by concatenating N full adders, the delay of such an adder is 2N gate delays from Cin-to-Cout. The path from the input to the output signal that is likely to take the longest time is designated as a "critical path". In the case of a Ripple Carry Adder, this is the path from the least significant input a0 or b0 to the last sum bit sn. Assuming multiplexer based XOR gate implementation, this critical path will consist of N+1 pass transistor delays. However, such a long chain of transistors will significantly degrade the signal, thus some amplification points are necessary. In practice, we can use a multiplexer cell to build this critical path using standard cell library as shown in this slide. Oklobdzija, ISCAS’88 Oklobdzija 2004 Computer Arithmetic
14
Manchester Carry-Chain Realization of the Carry Path
Simple and very popular scheme for implementation of carry signal path Manchester Carry Chain is a simple schemes for addition that was very popular at the time of emerging LSI nMOS technology. It is an alternative switch based technique implemented using pass-transistor logic. The speed realized using Manchester Carry Chain is impressive which is due to its simplicity and the properties of the pass-transistor logic. Manchester Carry Chain does not require a large area for its implementation, consuming substantially less power as compared to Carry-Lookahead or other more elaborate schemes. A realization of the Manchester Carry Chain is shown in the slide. Due to the RC delay properties of the Manchester Carry Chain the signal needs to be regenerated by inserting inverters at appropriately chosen locations in the carry chain. Oklobdzija 2004 Computer Arithmetic
15
Original Design T. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers: A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959. Oklobdzija 2004 Computer Arithmetic
16
Manchester Carry Chain (CMOS)
Implement P with pass-transistors Implement G with pull-up, kill (delete) with pull-down Use dynamic logic to reduce the complexity and speed up Kilburn, et al, IEE Proc, 1959. Oklobdzija 2004 Computer Arithmetic
17
Pass-Transistor Realization in DPL
The ability of pass-transistor logic to provide an efficient multiplexer implementation has been exploited in CPL and DPL logic families. Even an XOR gate is more efficiently implemented using multiplexer topology. A Full-Adder cell which is entirely multiplexer based was published by Hitachi and it is shown in this slide. Such a Full-Adder realization contains only two transistors in the Input-to-Sum path and only one transistor in the Cin-to-Cout path (not counting the buffer). The short critical path is a factor that contributes to a remarkable speed of this implementation. Oklobdzija 2004 Computer Arithmetic
18
Carry-Skip Adder MacSorley, Proc IRE 1/61
Lehman, Burla, IRE Trans on Comp, 12/61 Oklobdzija 2004 Computer Arithmetic
19
Carry-Skip Adder Bypass From Rabaey Oklobdzija 2004
Computer Arithmetic
20
Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups
Since the Cin-to-Cout represents the longest path in the ripple-carry-adder an obvious attempt is to accelerate carry propagation through the adder. This is accomplished by using Carry-Propagate pi signals within a group of bits. If all the pi signals within the group are set to pi = 1, the condition exist for the carry to bypass the entire group: Carry Skip Adder divides the words to be added into groups of equal size of k-bits. The basic structure of an N-bit Carry Skip Adder is shown here. Within the group, carry propagates in a ripple-carry fashion. In addition, an AND gate is used to form the group propagate signal. If group propagate signal is “true” the condition exists for carry to bypass, the group as shown in this slide. The maximal delay of a Carry Skip Adder is encountered when carry signal is generated in the least-significant bit position, rippling through k-1 bit positions, skipping over N/k-2 groups in the middle, rippling through the k-1 bits of most significant group and being assimilated in the Nth bit position to produce the sum SN: Thus, Carry Skip Adder is faster than Ripple Carry Adder at the expense of a few relatively simple modifications. The delay of the Carry Skip Adder is still linearly dependent on the size of the adder N, however this linear dependence is reduced by a factor of 1/k. Oklobdzija 2004 Computer Arithmetic
21
Carry-Skip Adder k Oklobdzija 2004 Computer Arithmetic
22
Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Computer Arithmetic
23
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
24
Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 5 5 4 4 3 D=9 3 1 1 Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
25
Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic
26
Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Delay model: Oklobdzija 2004 Computer Arithmetic
27
Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Group Length Oklobdzija, Barnes, Arith’85 Oklobdzija 2004 Computer Arithmetic
28
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Block Lengths No closed form solution for delay It is a dynamic programming problem Oklobdzija 2004 Computer Arithmetic
29
Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Computer Arithmetic
30
Delay Comparison: Variable Block Adder
VBA CLA VBA- Multi-Level Oklobdzija 2004 Computer Arithmetic
31
VLSI Arithmetic Lecture 4
Prof. Vojin G. Oklobdzija University of California
32
Review Lecture 3
33
Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Computer Arithmetic
34
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
35
Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 5 5 4 4 3 D=9 3 1 1 Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Oklobdzija 2004 Computer Arithmetic
36
Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic
37
Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Delay model: Oklobdzija 2004 Computer Arithmetic
38
Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Group Length Oklobdzija, Barnes, Arith’85 Oklobdzija 2004 Computer Arithmetic
39
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Block Lengths No closed form solution for delay It is a dynamic programming problem Oklobdzija 2004 Computer Arithmetic
40
Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Computer Arithmetic
41
Delay Comparison: Variable Block Adder
Square Root Dependency VBA Log Dependency CLA VBA- Multi-Level Oklobdzija 2004 Computer Arithmetic
42
Circuit Issues Adder speed can not be estimated based on:
logic gates in the critical path number of transistors in the path logic levels in the path Estimating Adders speed is much more complex and many of the “fast” schemes may be misleading you. Oklobdzija 2004 Computer Arithmetic
43
Fan-Out Dependency Oklobdzija 2004 Computer Arithmetic
44
Fan-In Dependency This looks like “Logical Effort” (1985)
Oklobdzija 2004 Computer Arithmetic
45
Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Computer Arithmetic
46
Oklobdzija 2004 Computer Arithmetic
47
Carry-Lookahead Adder (Weinberger and Smith, 1958)
ARITH-13: Presenting Achievement Award to Arnold Weinberger of IBM (who invented CLA adder in 1958) Ref: A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958. Oklobdzija 2004 Computer Arithmetic
48
CLA Definitions: One-bit adder
First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. Oklobdzija 2004 Computer Arithmetic
49
CLA Definitions: 4-bit Adder
Oklobdzija 2004 Computer Arithmetic
50
Carry-Lookahead Adder: 4-bits
Gj Pj Oklobdzija 2004 Computer Arithmetic
51
Carry-Lookahead Adder
One gate delay D to calculate p, g One D to calculate P and two for G Three gate delays To calculate C4(j+1) Compare that to 8 D in RCA ! Oklobdzija 2004 Computer Arithmetic
52
Carry-Lookahead Adder (Weinberger and Smith)
Additional two gate delays C16 will take a total of 5D vs. 32D for RCA ! Oklobdzija 2004 Computer Arithmetic
53
32-bit Carry Lookahead Adder
A significant speed improvement in the implementation of a parallel adder was introduced by a Carry-Lookahead-Adder developed by Weinberger and Smith in It is theoretically one of the fastest schemes, since the delay to add two numbers depends on the logarithm of the size of the operands. The Carry Loookahead Adder uses modified full adders for each bit position and Lookahead modules which are used to generate carry signals independently for a group of k-bits. In most common case the group size is 4-bits. In addition to carry signal for the group, Lookahead modules produce group carry generate G and group carry propagate P outputs that indicate that a carry is generated within the group, or that an incoming carry would propagate across the group. The carry out from a 4-bit wide group ci+4 can be computed in four gate delays: one gate delay to compute pi and gi for i = i through i+3, a second gate delay to evaluate Pj, the second and the third to evaluate Gj, and the third and fourth to calculate carry signals ci+1, ci+2 , ci+3 and ci+4. Actually, if not limited by fan-in constraints, ci+4 could be calculated concurrently with Gj and will be available after three gate delays. In a recursive fashion, we can create a "group of groups" or a "super-group". The inputs to the "super-group" are G and P signals from the previous level. The "super-group" produces P* and G* signals indicating that the carry signal will be propagated across, or generated in the groups within the "super-group" domain. A "super-group" produces a carry signal out of the "super-group" as well as an input carry signal for each of the groups in the level above. Oklobdzija 2004 Computer Arithmetic
54
Carry-Lookahead Adder (Weinberger and Smith: original derivation, 1958 )
Oklobdzija 2004 Computer Arithmetic
55
Carry-Lookahead Adder (Weinberger and Smith: original derivation )
Oklobdzija 2004 Computer Arithmetic
56
Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic
57
Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic
58
Motorola: CLA Implementation Example
A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”, Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.
59
Critical path in Motorola's 64-bit CLA
4.8nS 1.05nS 1.7nS As opposed to Ripple or Carry-Skip Adders the critical path in the Carry-Lookahead-Adder travels in vertical direction rather than a horizontal one as shown in the previous slide. Therefore the delay of Carry-Lookahead-Adder is not directly proportional to the size of the adder N, but to the number of levels used. Given that the groups and super-groups in the Carry-Lookahead-Adder resemble a tree structure the delay of a Carry-Lookahead-Adder is thus proportional to the log function of the size N. This log dependency makes Carry-Lookahead-Adder one of the theoretically fastest structures for addition. However, it can be argued that the speed efficiency of the Carry-Lookahead-Adder has passed the point of diminishing returns given the fan-in and fan-out dependencies of the logic gates and inadequacy of the delay model based on counting number of gates in the critical path. In reality, Carry-Lookahead-Adder is indeed achieving lesser speed than expected, especially when compared to some techniques that consume less hardware for the implementation. An example of a Carry Lookahead Adder, and a critical path as implemented in Motorola processor is shown in this slide. 3.75nS 2.7nS 2.0nS 2.35nS Oklobdzija 2004 Computer Arithmetic
60
Motorola's 64-bit CLA conventional PG Block
no better situation here ! carry ripples locally 5-transistors in the path Basically, this is MCC performance with Carry-Skip. One should not expect any better results than VBA. Oklobdzija 2004 Computer Arithmetic
61
Motorola's 64-bit CLA Modified PG Block
Intermediate propagate signals Pi:0 are generated to speed-up C3 still critical path resembles MCC Oklobdzija 2004 Computer Arithmetic
62
Motorola's 64-bit CLA 1.8nS 2.2nS 2.9nS 3.2nS 3.55nS 3.9nS
Oklobdzija 2004 Computer Arithmetic
63
1.05nS 1.7nS 2.0nS 2.35nS 2.7nS 3.75nS 4.8nS 1.8nS 2.2nS 2.9nS 3.2nS
As opposed to Ripple or Carry-Skip Adders the critical path in the Carry-Lookahead-Adder travels in vertical direction rather than a horizontal one as shown in the previous slide. Therefore the delay of Carry-Lookahead-Adder is not directly proportional to the size of the adder N, but to the number of levels used. Given that the groups and super-groups in the Carry-Lookahead-Adder resemble a tree structure the delay of a Carry-Lookahead-Adder is thus proportional to the log function of the size N. This log dependency makes Carry-Lookahead-Adder one of the theoretically fastest structures for addition. However, it can be argued that the speed efficiency of the Carry-Lookahead-Adder has passed the point of diminishing returns given the fan-in and fan-out dependencies of the logic gates and inadequacy of the delay model based on counting number of gates in the critical path. In reality, Carry-Lookahead-Adder is indeed achieving lesser speed than expected, especially when compared to some techniques that consume less hardware for the implementation. An example of a Carry Lookahead Adder, and a critical path as implemented in Motorola processor is shown in this slide. Oklobdzija 2004 Computer Arithmetic
64
Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
Delay Optimized CLA B. Lee, V. G. Oklobdzija Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
65
Delay Optimized CLA: Lee-Oklobdzija ‘91
(a.) Fixed groups and levels (b.) variable-sized groups, fixed levels (c.) variable-sized groups and fixed levels (d.) variable-sized groups and levels Oklobdzija 2004 Computer Arithmetic
66
Two-Levels of Logic Implementation of the Carry Block
Oklobdzija 2004 Computer Arithmetic
67
Two-Levels of Logic Implementation of the Carry-Lookahead Block
Oklobdzija 2004 Computer Arithmetic
68
Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)
Oklobdzija 2004 Computer Arithmetic
69
Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)
Oklobdzija 2004 Computer Arithmetic
70
Delay Optimized CLA: Lee-Oklobdzija ‘91
Delay: Two-level BCLA Delay: Three-level BCLA Oklobdzija 2004 Computer Arithmetic
71
Delay Optimized CLA: Lee-Oklobdzija ‘91
(a.) 2-level BCLA D=8.5nS (b.) 3-level BCLA D=8.9nS Oklobdzija 2004 Computer Arithmetic
72
Ling’s Adder Huey Ling, “High-Speed Binary Adder”
IBM Journal of Research and Development, Vol.5, No.3, 1981. Used in: IBM 3033, IBM 168, Amdahl V6, HP etc.
73
Ling’s Derivations define:
ai bi ci si ci+1 define: gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1 ai bi pi gi ti 1 Oklobdzija 2004 Computer Arithmetic
74
Ling’s Derivations From: Now we need to derive Sum equation and
because: fundamental expansion Now we need to derive Sum equation Oklobdzija 2004 Computer Arithmetic
75
Ling Adder Ling’s equations: Variation of CLA:
Ling, IBM J. Res. Dev, 5/81 Oklobdzija 2004 Computer Arithmetic
76
Ling Adder Ling’s equation: Variation of CLA:
Ling uses different transfer function. Four of those functions have desired properties (Ling’s is one of them) see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept Oklobdzija 2004 Computer Arithmetic
77
Ling Adder Conventional: Ling: Fan-in of 5 Fan-in of 4 Oklobdzija 2004
Computer Arithmetic
78
Advantages of Ling’s Adder
Uniform loading in fan-in and fan-out H16 contains 8 terms as compared to G16 that contains 15. H16 can be implemented with one level of logic (in ECL), while G16 can not. (Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used) Oklobdzija 2004 Computer Arithmetic
79
VLSI Arithmetic Lecture 5
Prof. Vojin G. Oklobdzija University of California
80
Review Lecture 4
81
Ling’s Adder Huey Ling, “High-Speed Binary Adder”
IBM Journal of Research and Development, Vol.5, No.3, 1981. Used in: IBM 3033, IBM S370/168, Amdahl V6, HP etc.
82
Ling’s Derivations define:
ai bi ci si ci+1 define: gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1 ai bi pi gi ti 1 Oklobdzija 2004 Computer Arithmetic
83
Ling’s Derivations From: Now we need to derive Sum equation and
because: fundamental expansion Now we need to derive Sum equation Oklobdzija 2004 Computer Arithmetic
84
Ling Adder Ling’s equations: Variation of CLA:
Ling, IBM J. Res. Dev, 5/81 Oklobdzija 2004 Computer Arithmetic
85
Ling Adder Ling’s equation: Variation of CLA:
ai bi ci si ci+1 ai-1 bi-1 ci-1 si-1 gi, ti gi-1, ti-1 Hi+1 Hi Ling uses different transfer function. Four of those functions have desired properties (Ling’s is one of them) see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept Oklobdzija 2004 Computer Arithmetic
86
Ling Adder Conventional: Ling: Fan-in of 5 Fan-in of 4 Oklobdzija 2004
Computer Arithmetic
87
Advantages of Ling’s Adder
Uniform loading in fan-in and fan-out H16 contains 8 terms as compared to G16 that contains 15. H16 can be implemented with one level of logic (in ECL), while G16 can not (with 8-way wire-OR). (Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used - his IBM limitation was fan-in of 4 and wire-OR of 8) Oklobdzija 2004 Computer Arithmetic
88
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic
89
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic
90
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic
91
Advantage of Ling’s Adder
32-bit adder used in: IBM 3033, IBM S370/ Model168, Amdahl V6. Implements 32-bit addition in 3 levels of logic Implements 32-bit AGEN: B+Index+Disp in 4 levels of logic (rather than 6) 5 levels of logic for 64-bit adder used in HP processor Oklobdzija 2004 Computer Arithmetic
92
Implementation of Ling’s Adder in CMOS (S
Implementation of Ling’s Adder in CMOS (S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96) Oklobdzija 2004 Computer Arithmetic
93
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
94
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
95
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
96
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
97
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
98
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
99
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
100
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
101
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
102
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
103
S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic
104
Ling Adder Critical Path
Oklobdzija 2004 Computer Arithmetic
105
Ling Adder: Circuits Oklobdzija 2004 Computer Arithmetic
106
LCS4 – Critical G Path Oklobdzija 2004 Computer Arithmetic
107
LCS4 – Logical Effort Delay
Oklobdzija 2004 Computer Arithmetic
108
See: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96
Results: 0.5u Technology Speed: nS Nominal process, 80C, V=3.3V See: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96 Oklobdzija 2004 Computer Arithmetic
109
Prefix Adders and Parallel Prefix Adders
110
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
111
Prefix Adders (g, p)o(g’,p’)=(g+pg’, pp’) (g0, p0) Gi, Pi =
Following recurrence operation is defined: (g, p)o(g’,p’)=(g+pg’, pp’) such that: (g0, p0) i=0 Gi, Pi = (gi, pi)o(Gi-1, Pi-1 ) 1 ≤ i ≤ n ci+1 = Gi for i=0, 1, ….. n (g-1, p-1)=(cin,cin) c1 = g0+ p0 cin This operation is associative, but not commutative It can also span a range of bits (overlapping and adjacent) Oklobdzija 2004 Computer Arithmetic
112
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
113
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
114
Pyramid Adder: M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic Units”, IFIP Congress, Munich, Germany, 1962. Oklobdzija 2004 Computer Arithmetic
115
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
116
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
117
Hybrid BK-KS Adder Oklobdzija 2004 Computer Arithmetic
118
Parallel Prefix Adders: S. Knowles 1999
operation is associative: h>i≥j≥k operation is idempotent: h>i≥j≥k produces carry: cin=0 Oklobdzija 2004 Computer Arithmetic
119
Parallel Prefix Adders: Ladner-Fisher
Exploits associativity, but not idempotency. Produces minimal logical depth Oklobdzija 2004 Computer Arithmetic
120
Parallel Prefix Adders: Ladner-Fisher (16,8,4,2,1)
Two wires at each level. Uniform, fan-in of two. Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages) Oklobdzija 2004 Computer Arithmetic
121
Parallel Prefix Adders: Kogge-Stone
Exploits idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher. Buffers needed in both cases: K-S, L-F Oklobdzija 2004 Computer Arithmetic
122
Kogge-Stone Adder Oklobdzija 2004 Computer Arithmetic
123
Parallel Prefix Adders: Brent-Kung
Set the fan-out to one Avoids explosion of wires (as in K-S) Makes no sense in CMOS: fan-out = 1 limit is arbitrary and extreme much of the capacitive load is due to wire (anyway) It is more efficient to insert buffers in L-F than to use B-K scheme Oklobdzija 2004 Computer Arithmetic
124
Brent-Kung Adder Oklobdzija 2004 Computer Arithmetic
125
Parallel Prefix Adders: Han-Carlson
Is a hybrid synthesis of L-F and K-S Trades increase in logic depth for a reduction in fan-out: effectively a higher-radix variant of K-S. others do it similarly by serializing the prefix computation at the higher fan-out nodes. Others, similarly trade the logical depth for reduction of fan-out and wire. Oklobdzija 2004 Computer Arithmetic
126
Parallel Prefix Adders: variety of possibilities
from: Knowles bounded by L-F and K-S at ends Oklobdzija 2004 Computer Arithmetic
127
Parallel Prefix Adders: variety of possibilities Knowles 1999
Following rules are used: Lateral wires at the jth level span 2j bits Lateral fan-out at jth level is power of 2 up to 2j Lateral fan-out at the jth level cannot exceed that a the (j+1)th level. Oklobdzija 2004 Computer Arithmetic
128
Parallel Prefix Adders: variety of possibilities Knowles 1999
The number of minimal depth graphs of this type is given in: at 4-bits there is only K-S and L-F, afterwards there are several new possibilities. Oklobdzija 2004 Computer Arithmetic
129
Parallel Prefix Adders: variety of possibilities
Knowles 1999 example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic
130
Parallel Prefix Adders: variety of possibilities
Knowles 1999 Example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic
131
Parallel Prefix Adders: variety of possibilities Knowles 1999
Delay is given in terms of FO4 inverter delay: w.c. (nominal case is 40-50% faster) K-S is the fastest K-S adders are wire limited (requiring 80% more area) The difference is less than 15% between examined schemes Oklobdzija 2004 Computer Arithmetic
132
Parallel Prefix Adders: variety of possibilities Knowles 1999
Conclusion Irregular, hybrid schmes are possible The speed-up of 15% is achieved at the cost of large wiring, hence area and power Circuits close in speed to K-S are available at significantly lower wiring cost Oklobdzija 2004 Computer Arithmetic
133
VLSI Arithmetic Lecture 6
Prof. Vojin G. Oklobdzija University of California
134
Review Lecture 5
135
Prefix Adders and Parallel Prefix Adders
136
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
137
Prefix Adders (g, p)o(g’,p’)=(g+pg’, pp’) (g0, p0) Gi, Pi =
Following recurrence operation is defined: (g, p)o(g’,p’)=(g+pg’, pp’) such that: (g0, p0) i=0 Gi, Pi = (gi, pi)o(Gi-1, Pi-1 ) 1 ≤ i ≤ n ci+1 = Gi for i=0, 1, ….. n (g-1, p-1)=(cin,cin) c1 = g0+ p0 cin This operation is associative, but not commutative It can also span a range of bits (overlapping and adjacent) Oklobdzija 2004 Computer Arithmetic
138
Parallel Prefix Adders: S. Knowles 1999
operation is associative: h>i≥j≥k operation is idempotent: h>i≥j≥k produces carry: cin=0 Oklobdzija 2004 Computer Arithmetic
139
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
140
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
141
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
142
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
143
Kogge-Stone Adder Oklobdzija 2004 Computer Arithmetic
144
Brent-Kung Adder Oklobdzija 2004 Computer Arithmetic
145
Hybrid BK-KS Adder Oklobdzija 2004 Computer Arithmetic
146
Pyramid Adder: M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic Units”, IFIP Congress, Munich, Germany, 1962. Oklobdzija 2004 Computer Arithmetic
147
Parallel Prefix Adders: Ladner-Fisher
Exploits associativity, but not idempotency. Produces minimal logical depth Oklobdzija 2004 Computer Arithmetic
148
Parallel Prefix Adders: Ladner-Fisher (16,8,4,2,1)
Two wires at each level. Uniform, fan-in of two. Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages) Oklobdzija 2004 Computer Arithmetic
149
Parallel Prefix Adders: Kogge-Stone
Exploits idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher. Buffers needed in both cases: K-S, L-F Oklobdzija 2004 Computer Arithmetic
150
Parallel Prefix Adders: Brent-Kung
Set the fan-out to one Avoids explosion of wires (as in K-S) Makes no sense in CMOS: fan-out = 1 limit is arbitrary and extreme much of the capacitive load is due to wire (anyway) It is more efficient to insert buffers in L-F than to use B-K scheme Oklobdzija 2004 Computer Arithmetic
151
Two Parallel Prefix Adder Structures
Kogge-Stone Han-Carlson log(bits) carry stages Extra Wiring log(bits) + 1 carry stages Reduced Wiring and Gates Oklobdzija 2004 Computer Arithmetic
152
Parallel Prefix Adders: Han-Carlson
Is a hybrid synthesis of L-F and K-S Trades increase in logic depth for a reduction in fan-out: effectively a higher-radix variant of K-S. others do it similarly by serializing the prefix computation at the higher fan-out nodes. Others, similarly trade the logical depth for reduction of fan-out and wire. Oklobdzija 2004 Computer Arithmetic
153
Parallel Prefix Adders: variety of possibilities
from: Knowles bounded by L-F and K-S at ends Oklobdzija 2004 Computer Arithmetic
154
Parallel Prefix Adders: variety of possibilities Knowles 1999
Following rules are used: Lateral wires at the jth level span 2j bits Lateral fan-out at jth level is power of 2 up to 2j Lateral fan-out at the jth level cannot exceed that a the (j+1)th level. Oklobdzija 2004 Computer Arithmetic
155
Parallel Prefix Adders: variety of possibilities Knowles 1999
The number of minimal depth graphs of this type is given in: at 4-bits there is only K-S and L-F, afterwards there are several new possibilities. Oklobdzija 2004 Computer Arithmetic
156
Parallel Prefix Adders: variety of possibilities
Knowles 1999 example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic
157
Parallel Prefix Adders: variety of possibilities
Knowles 1999 Example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic
158
Parallel Prefix Adders: variety of possibilities Knowles 1999
Delay is given in terms of FO4 inverter delay: w.c. (nominal case is 40-50% faster) K-S is the fastest K-S adders are wire limited (requiring 80% more area) The difference is less than 15% between examined schemes Oklobdzija 2004 Computer Arithmetic
159
Parallel Prefix Adders: variety of possibilities Knowles 1999
Conclusion Irregular, hybrid schmes are possible The speed-up of 15% is achieved at the cost of large wiring, hence area and power Circuits close in speed to K-S are available at significantly lower wiring cost Oklobdzija 2004 Computer Arithmetic
160
Possibilities for Further Research
The logical depth is important (Knowles was right) The fan-out is less important than fan-in (Knowles was wrong): It is possible to examine a variety of topologies with restricted and varied fan-in. Driving strength and Logical Effort rules were overlooked and at least neglected: It is possible to create number of topologies taking LE rules into account. It is further possible to combine the rules with compound domino implementation taking advantage of two different rules governing “dynamic” and “static”. It is still possible to produce a better adder ! Oklobdzija 2004 Computer Arithmetic
161
Other Types of Adders Oklobdzija 2004 Computer Arithmetic
162
Conditional Sum Adder J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic Computers, EC-9, p , 1960.
163
Conditional Sum Adder from: Ercegovac-Lang Oklobdzija 2004
Computer Arithmetic
164
ConditionalSum Adder Oklobdzija 2004 Computer Arithmetic
165
Conditional Sum Adder from: Ercegovac-Lang Oklobdzija 2004
Computer Arithmetic
166
Conditional Sum Adder from: Ercegovac-Lang Oklobdzija 2004
Computer Arithmetic
167
Conditional Sum Adder Oklobdzija 2004 Computer Arithmetic
168
Carry-Select Adder O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June 1962, p
169
Carry-Select Sum Adder
from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic
170
Carry-Select Adder Addition under assumption of Cin=0 and Cin =1.
The theoretically fastest scheme for addition of two numbers is "Conditional-Sum Addition" proposed by Sklansky in The essence of this scheme is in the realization that we can add two numbers without waiting for the carry signal to arrive. Simply, the numbers are added in two instances: one assuming Cin = 0 and the other assuming Cin = 1. The conditionally produced results: Sum0, Sum1 and Carry0, Carry1 are selected by a multiplexer using an incoming carry signal Cin as a multiplexer control. Similarly to the Carry-Lookahead Adder the input bits are divided into groups which are in this case added "conditionally". It is apparent that while building Conditional-Sum Adder the hardware complexity starts to grow rapidly starting from the Least Significant Bit position. Therefore, in practice, the full-blown implementation of the CNSA is not found. However, the idea of adding the Most Significant portion of the operands conditionally and selecting the results once the carry-in signal is computed in the Least Significant portion, is attractive. Such a scheme, which is a subset of Conditional-Sum Adder, is known as "Carry-Select Adder". Carry Select Adder divides the words to be added into blocks and forms two sums for each block in parallel: -one with a carry in of ZERO and the other with a carry in of ONE. In this slide an example of a 16 bit carry select adder in shown: The carry-out from the Least Significant 4-bit block controls a multiplexer that selects the sum from the Most Significant portion. The carry out is computed using the equation for the carry out of the group, since the group propagate signal Pi is the carry out of an adder with a carry input of ONE and the group generate Gi signal is the carry out of an adder with a carry input of ZERO. This speeds-up the computation of the carry signal which is necessary for selection in the next block. The upper 8-bits are computed conditionally using two Carry-Select Adders similar to the one used in the Least Significant 8-bit portion. The delay of this adder is determined by the speed of the Least Significant k-bit block (4-bit RCA in this example) and delay of multiplexers in the Most Significant path. Generally the delay of such adder is proportional to the log function of the size of the adder. Oklobdzija 2004 Computer Arithmetic
171
Carry Select Adder: combining two 32-b VBAs in select mode
Delay =DVBA32+ DMUX Oklobdzija 2004 Computer Arithmetic
172
O.J. Bedrij, IBM Poughkeepsie, 1962
Carry-Select Adder O.J. Bedrij, IBM Poughkeepsie, 1962 Oklobdzija 2004 Computer Arithmetic
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.