Asynchronous Datapath Design Adders Comparators Multipliers Registers Completion Detection Bus Pipeline …..
Asynchronous Adder Design Motivation Background: Sync and Async adders Delay-insensitive carry-lookahead adders Complexity Analysis Conclusions
Motivation Integer addition is one of the most important operations in digital computer systems Statistics shows that in a prototypical RISC machine (DLX) 72% of the instructions perform additions(or subtractions) in the datapath. In ARM processors it even reaches 80%. The performance of processors is significantly influenced by the speed of their adders.
Background Adders: synchronous or asynchronous synchronous adders: worst case performance asynchronous adders: average case performance For example: Ripple-Carry Adders(synchronous): O(n) Carry-Completion Sensing Adders(asynchronous): O(log n)
Background: Binary Addition Worst case S C Adders can perform average case behavior Best case S C
Background Ripple-Carry Adders: One-stage full adder: Logic complexity: O(n) Time complexity: O(n)
Background Carry-Sensing Completion Detection Adders: (asynchronous version of RCA)
Background One-stage CSCD Adder: Carry-Sensing Completion Detection Adders: Logic complexity: O(n) Time complexity: O(log n)
Background Delay-Insensitive Ripple-Carry Adders: (DI version of RCA):
Background One-stage DIRCA: DIRCA Adders: Logic complexity: O(n) Time complexity: O(log n) One of the most robust adders
Background Completion detection for asynchronous adders:
Background DI adder VS Bundling Constraint adder:
Carry-Lookahead Adders RCA requires n stage-propagation delays. For high speed processors, this scheme is undesirable. One way to improve adder performance is to use parallel processing in computing the carries. That is why Carry-Lookahead Adders (CLA) are introduced. CLAs: Logic complexity: O(n) Time complexity: O(log n)
Carry-Lookahead Adders
A module: B module:
DI Carry-Lookahead Adders Delay-Insensitive Carry-Lookahead Adders (DICLA) may be implemented by using delay-insensitive code. 1. dual-rail signaling: inputs, sums, and carry bits 2. one-hot code: internal signals A1=0 A0=0 A1=0 A0=1 A1=1 A0=0 A1=1 A0=1 a. No data b. valid 0 c. valid 1 d. illegal a. No data: 000 b. 001 c. 010 d. 100
QDI Carry-Lookahead Adders DI C module: 1. internal signals: one-hot code, k, g, p 2. input and sum bits: dual-rail signals CLA A module
QDI Carry-Lookahead Adders DI D module: 1. Internal signals: one-hot code, K, G, P 2. Carry bits: dual-rail signals CLA B module
DI Carry-Lookahead Adders
If A 3 =B 3 then C 3 is carry kill or generate k 3,g 3
DI Carry-Lookahead Adders G 3,2, K 3,2 can be used to speed up the carry computation too. k 3,g 3 K 3,2, G 3,2
Speeding Up DICLA Idea: Send the carry-generate’s and carry-kill’s to any possible stages which needs these information to compute carries immediately. D module with speed-up circuitry
Speeding Up DICLA General form: D module with speed-up circuitry for carry-kill for carry-generate = g j-1 +g j-2 P j-1 +…+g 0 p 1 p 2 …p j-1 This is in fact the full carry-lookahead scheme.
Speeding Up DICLA Problem of full carry-lookahead scheme practical limitations on fan-in and fan-out, irregular structure, and many long wire. logic complexity increases more than linearly Solution: use the properties of tree-like structure New speed-up circuitry:
SP focuses on the root node of a subtree. All leftmost root node of its right subtree
Power of Speed-up Circuitry x : carry chain x’ in r subtree x-x’ in l subtree
Power of Speed-up Circuitry Without Speed-up circuitry
Power of Speed-up Circuitry With Speed-up circuitry
Optimization: Simplified D module Simplified D’ module Better logic complexity Delay-Insensitive again
Complexity Analysis DICLASP Logic Complexity: (n) Time Complexity: (log log n) Best area-time efficiency: (n log log n)
Complexity Analysis
CMOS: C module
CMOS: SD module
CMOS: SD’ module
SPICE Simulation: SPICE Simulation contains two parts: Random number inputs: random generated input pairs Statistical data: running examples on a 32-bit ARM emulator
SPICE Simulation: Random number input distribution
SPICE Simulation: SPICE simulation results: random number inputs Speedup: DIRCA vs RCA: 6.39 DICLASP vs CLA: 2.64
SPICE Simulation: Breakdown of addition/subtraction operations: by runing three benchmark programs: Dhrystone f1, Dhrystone f2 and Espresso dc2 on a 32-bit ARM simulator
SPICE Simulation :dynamic traces
SPICE Simulation: dynamic traces 83.92% instructions: |carry chain| <17
SPICE Simulation: SPICE simulation results: dynamic traces Average computation time: DIRCA 9.61ns DICALSP 5.25ns Speedup: DIRCA vs RCA: 4.1 DICLASP vs CLA: 2.2
Conclusion DICLASP Best area-time efficiency: (n log log n) Correctness: No adder is more robust than DICLASP Cost(Logic Complexity):No parallel adder is cheaper than DICLASP ( (n)). Speed(Time Complexity):No adder is better than DICLASP ( (log log n)). Suitable for VLSI implementation.