Introduction to asynchronous circuit design: specification and synthesis Jordi Cortadella, Universitat Politècnica de Catalunya, Spain Michael Kishinevsky, Intel Corporation, USA Alex Kondratyev, Theseus Logic, USA Luciano Lavagno, Università di Udine, Italy
Outline I: Introduction to basic concepts on asynchronous design II: Synthesis of control circuits from STGs III: Advanced topics on synthesis of control circuits from STGs IV: Synthesis from HDL and other synthesis paradigms Note: no references in the tutorial
Introduction to asynchronous circuit design: specification and synthesis Part I: Introduction to basic concepts on asynchronous circuit design
Outline What is an asynchronous circuit ? Asynchronous communication Asynchronous logic blocks Micropipelines Control specification and implementation Delay models Why asynchronous circuits ?
Synchronous circuit RRRRCL CLK Implicit synchronization
Asynchronous circuit RRRRCL Explicit synchronization: Req/Ack handshakes Req Ack
Synchronous communication Clock edges determine the time instants where data must be sampled Data wires may glitch between clock edges (set-up/hold times must be satisfied) Data are transmitted at a fixed rate (clock frequency)
Dual rail Two wires per bit –“00” = spacer, “01” = 0, “10” = 1 n-bit data communication requires 2n wires Each bit is self-timed Other delay-insensitive codes exist
Bundled data Validity signal –Similar to an aperiodic local clock n-bit data communication requires n+1 wires Data wires may glitch when no valid Signaling protocols –level sensitive (latch) –transition sensitive (register): 2-phase / 4-phase
Example: memory read cycle Transition signaling, 4-phase Valid address Address Valid data Data AA DD
Example: memory read cycle Transition signaling, 2-phase Valid address Address Valid data Data AA DD
Outline What is an asynchronous circuit ? Asynchronous communication Asynchronous logic blocks Micropipelines Control specification and implementation Delay models Why asynchronous circuits ?
Asynchronous modules Signaling protocol: reqin+ start+ [computation] done+ reqout+ ackout+ ackin+ reqin- start- [reset] done- reqout- ackout- ackin- (more concurrency is also possible, e.g. by overlapping the return-to- zero phase of step i-1 with the evaluation phase of step i) Data INData OUT req inreq out ack inack out DATA PATH CONTROL startdone
Completion detection Cdone Completion detection tree
Asynchronous latches: C element C A B Z A B Z Z 1 0 Z Vdd Gnd A A A AB B B B Z Z Z
Dual-rail logic A.t A.f B.t B.f C.t C.f Dual-rail AND gate Valid behavior for monotonic environment
Differential cascode voltage switch logic start A.t B.t C.t A.fB.f C.f Z.tZ.f done 3-input AND/NAND gate
Bundled-data logic blocks delay startdone logic Conventional logic + matched delay
Micropipelines (Sutherland 89) LLLLlogic R in A out C C C C R out A in delay
Data-path / Control LLLLlogic R in R out CONTROL A in A out
Outline What is an asynchronous circuit ? Asynchronous communication Asynchronous logic blocks Micropipelines Control specification and implementation Delay models Why asynchronous circuits ?
Control specification A+ B+ A- B- A B A input B output
Control specification A+ B+ A- B- A B
Control specification A+ B- A- B+ A B
Control specification A+ C- A- C+ A C B+ B- B C
Control specification A+ C- A- C+ A C B+ B- B C
Control specification C C Ri Ro Ai Ao Ri+ Ao+ Ri- Ao- Ro+ Ai+ Ro- Ai- Ri Ro Ao Ai FIFO cntrl
A simple filter: specification y := 0; loop x := READ (IN); WRITE (OUT, (x+y)/2); y := x; end loop R in A in A out R out IN OUT filter
A simple filter: block diagram xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT x and y are level-sensitive latches (transparent when R=1) + is a bundled-data adder (matched delay between R a and A a ) R in indicates the validity of IN After A in + the environment is allowed to change IN (R out,A out ) control a level-sensitive latch at the output
A simple filter: control spec. xy + control R in A in R out A out RxRx AxAx RyRy AyAy RaRa AaAa IN OUT R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out -
A simple filter: control impl. R in + A in + R in - A in - Rx+Rx+ Ax+Ax+ Rx-Rx- Ax-Ax- Ry+Ry+ Ay+Ay+ Ry-Ry- Ay-Ay- Ra+Ra+ Aa+Aa+ Ra-Ra- Aa-Aa- R out + A out + R out - A out - C R in A in RxRx AxAx RyRy AyAy AaAa RaRa A out R out
Control: observable behavior Rx+Rx+ R in + Ax+Ax+Ra+Ra+Aa+Aa+R out +A out +z+R out -A out -Ry+Ry+ Ry-Ry- Ay+Ay+ Rx-Rx-Ax-Ax- Ay-Ay- A in - A in + Ra-Ra- R in - Aa-Aa- z- C R in A in RxRx AxAx RyRy AyAy AaAa RaRa A out R out z
Outline What is an asynchronous circuit ? Asynchronous communication Asynchronous logic blocks Micropipelines Control specification and implementation Delay models Why asynchronous circuits ?
Taking delays into account x+ x- y+ y- z+ z- x z y x’ z’ Delay assumptions: Environment: 3 times units Gates: 1 time unit events: x+ x’- y+ z+ z’- x- x’+ z- z’+ y- time:
Taking delays into account x+ x- y+ y- z+ z- x z y x’ z’ Delay assumptions: unbounded delays events: x+ x’- y+ z+ x- x’+ y- time: very slow failure !
Gate vs wire delay models Gate delay model: delays in gates, no delays in wires Wire delay model: delays in gates and wires
Delay models for async. circuits Bounded delays (BD): realistic for gates and wires. –Technology mapping is easy, verification is difficult Speed independent (SI): Unbounded (pessimistic) delays for gates and “negligible” (optimistic) delays for wires. –Technology mapping is more difficult, verification is easy Delay insensitive (DI): Unbounded (pessimistic) delays for gates and wires. –DI class (built out of basic gates) is almost empty Quasi-delay insensitive (QDI): Delay insensitive except for critical wire forks (isochronic forks). –Formally, it is the same as speed independent –In practice, different synthesis strategies are used BD SI QDI DI
Motivation (designer’s view) Modularity –Plug-and-play interconnectivity Reusability –IPs with abstract timing behaviors High-performance –Average-case performance (no worst-case delay synchronization) –No clock skew (local timing assumptions instead) Many interfaces are asynchronous –Buses, networks,...
Motivation (technology aspects) Low power –Automatic clock gating Electromagnetic compatibility –No peak currents around clock edges Robustness –High immunity to technology and environment variations (in-die variations, temperature, power supply,...)
Problems Concurrent models for specification –CSP, Petri nets,... Difficult to design –Hazards, synchronization Complex timing analysis –Difficult to estimate performance Difficult to test –No way to stop the clock
But we have some success stories... Philips AMULET microprocessors Sharp Intel (RAPPID) IBM (interlocked pipeline) Start-up companies: –Theseus Logic, Cogency, ADD...
Introduction to asynchronous circuit design: specification and synthesis Part II: Synthesis of control circuits from STGs
Outline Overview of the synthesis flow Specification State graph and next-state functions State encoding Implementability conditions Speed-independent circuit –Complex gates –C-element architecture
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
x y z x+ x- y+ y- z+ z- Signal Transition Graph (STG) x y z
x y z x+ x- y+ y- z+ z-
x+ x- y+ y- z+ z- xyz 000 x+ 100 y+ z+ y x y+ z- 010 y-
xyz 000 x+ 100 y+ z+ y x y+ z- 010 y- Next-state functions
x z y
Outline Overview of the synthesis flow Specification State graph and next-state functions State encoding Implementability conditions Speed-independent circuit –Complex gates –C-element architecture
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
VME bus Device LDS LDTACK D DSr DSw DTACK VME Bus Controller Data Transceiver Bus DSr LDS LDTACK D DTACK Read Cycle
STG for the READ cycle LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ LDS LDTACK D DSr DTACK VME Bus Controller
Choice: Read and Write cycles DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK-DTACK- DSw+ D+ LDS+ LDTACK+ D- DTACK+ DSw- LDS- LDTACK-DTACK-
Choice: Read and Write cycles DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw+ D+ LDS+ LDTACK+ D- DTACK+ DSw- LDS- LDTACK-DTACK-
Choice: Read and Write cycles DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw+ D+ LDS+ LDTACK+ D- DTACK+ DSw- LDS- LDTACK-DTACK-
Choice: Read and Write cycles DTACK- DSr+ LDS+ LDTACK+ D+ DTACK+ DSr- D- LDS- LDTACK- DSw+ D+ LDS+ LDTACK+ D- DTACK+ DSw- LDS- LDTACK-DTACK-
Circuit synthesis Goal: –Derive a hazard-free circuit under a given delay model and mode of operation
Outline Overview of the synthesis flow Specification State graph and next-state functions State encoding Implementability conditions Speed-independent circuit –Complex gates –C-element architecture
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
STG for the READ cycle LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ LDS LDTACK D DSr DTACK VME Bus Controller
Binary encoding of signals DSr+ DTACK- LDS- LDTACK- D- DSr-DTACK+ D+ LDTACK+ LDS+
Binary encoding of signals DSr+ DTACK- LDS- LDTACK- D- DSr-DTACK+ D+ LDTACK+ LDS (DSr, DTACK, LDTACK, LDS, D)
QR (LDS+) QR (LDS-) Excitation / Quiescent Regions ER (LDS+) ER (LDS-) LDS- LDS+ LDS-
Next-state function 0 1 LDS- LDS+ LDS- 1 0 0 0 1
Karnaugh map for LDS DTACK DSr D LDTACK DTACK DSr D LDTACK LDS = 0 LDS = /1?
Outline Overview of the synthesis flow Specification State graph and next-state functions State encoding Implementability conditions Speed-independent circuit –Complex gates –C-element architecture
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
Concurrency reduction LDS- LDS+ LDS DSr+
Concurrency reduction LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+
State encoding conflicts LDS- LDTACK- LDTACK+ LDS
Signal Insertion LDS- LDTACK- D- DSr- LDTACK+ LDS+ CSC- CSC
Outline Overview of the synthesis flow Specification State graph and next-state functions State encoding Implementability conditions Speed-independent circuit –Complex gates –C-element architecture
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
Complex-gate implementation Under what conditions does a hazard-free implementation exist?
Implementability conditions Consistency –Rising and falling transitions of each signal alternate in any trace Complete state coding (CSC) –Next-state functions correctly defined Persistency –No event can be disabled by another event (unless they are both inputs)
Implementability conditions Consistency + CSC + persistency There exists a speed-independent circuit that implements the behavior of the STG (under the assumption that any Boolean function can be implemented with one complex gate)
Persistency a- c+ b+b+ b+b+ a c b a c b is this a pulse ? Speed independence glitch-free output behavior under any delay
Speed-independent implementations How can the implementability conditions –Consistency –Complete state coding –Persistency be satisfied? Standard circuit architectures: –Complex (hazard-free) gates –C elements with monotonic covers –“Standard” gates and latches
a+ b+ c+ d+ a- b- d- a+ c-a a+ b+ c+ a- b- c- a+ c- a- d- d+
a+ b+ c+ a- b- c- a+ c- a- d- d+ ab cd ER(d+) ER(d-)
ab cd a+ b+ c+ a- b- c- a+ c- a- d- d+ Complex gate
Implementation with C elements C R S z S+ z+ S- R+ z- R- S (set) and R (reset) must be mutually exclusive S must cover ER(z+) and must not intersect ER(z-) QR(z-) R must cover ER(z-) and must not intersect ER(z+) QR(z+)
ab cd a+ b+ c+ a- b- c- a+ c- a- d- d+ C S R d
a+ b+ c+ a- b- c- a+ c- a- d- d+ C S R d but...
a+ b+ c+ a- b- c- a+ c- a- d- d+ C S R d Assume that R=ac has an unbounded delay Starting from state 0000 (R=1 and S=0): a+ ; R- ; b+ ; a- ; c+ ; S+ ; d+ ; R+ disabled (potential glitch)
ab cd a+ b+ c+ a- b- c- a+ c- a- d- d+ C S R d Monotonic covers
C-based implementations C S R d C d a b c a b c d weak generalized C element (gC)
Synthesis exercise y- z-w- y+x+ z+ x- w y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ Derive circuits for signals x and z (complex gates and monotonic covers)
Synthesis exercise y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ wx yz Signal x
Synthesis exercise y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ wx yz Signal z
Introduction to asynchronous circuit design: specification and synthesis Part III: Advanced topics on synthesis of control circuits from STGs
Outline Logic decomposition –Hazard-free decomposition –Signal insertion –Technology mapping Optimization based on timing information –Relative timing –Timing assumptions and constraints –Automatic generation of timing assumptions
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
No Hazards a b c x 0 abcx b a c
Decomposition May Lead to Hazards abcx b a c+ a b z c x
Decomposition Acknowledgement Generating candidates Hazard-free signal insertion –Event insertion –Signal insertion
Global acknowledgement a b c z a b d y d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c-
a b c z a b d y How about 2-input gates ? d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c-
a b c z a b d y d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c- How about 2-input gates ?
a b c z a b d y 0 0 d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c- How about 2-input gates ?
a b c z a b d y d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c- How about 2-input gates ?
c z d y a b d-b+d+y+a-y-c+d- c-d+z-b-z+c+a+c- How about 2-input gates ?
Strategy for logic decomposition Each decomposition defines a new internal signal Method: Insert new internal signals such that –After resynthesis, some large gates are decomposed –The new specification is hazard-free Generate candidates for decomposition using standard logic factorization techniques: –Algebraic factorization –Boolean factorization (boolean relations)
y- z-w- y+x+ z+ x- w+ Decomposition example y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ wxyz
yz=1 yz= y- y+ x- x+ w+ w- z+ z- w- z- y+ x y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ C C x y x y w z x y z y z w z w z y
y- z-w- y+x+ z+ x- w+ s- s+ s- s+ s- s=1 s= y+ x- w+ z+ z x+ w- z- y+ x y+ z y-
s- s+ s- s=1 s= y+ x- w+ z+ z x+ w- z- y+ x y+ z C C x y x y w z x y z w z w z y s y-
C C x y x y w z x y z y z w z w z y yz=1yz= y- y+ x- x+ w+ w- z+ z- w- z- y+ x y- y+ x- x+ w+ w- z+ z- w- z- y+ x+ 1001
s- s+ s=1 s= x- w+ z x+ w- z- y+ x y+ z y- z-w- y+x+ z+ x- w+ s- s+ z- is delayed by the new transition s- !
C C x y x y w z x y z w z w z yyyyyyy s- s+ s=1 s= x- w+ z x+ w- z- y+ x y+ z y-
F C Sr D Decomposition (Algebraic, Boolean relations) Hazard-free ? (Event insertion) NO YES C C C C Sr D D
F C D Hazard-free ? (Event insertion) NO YES C C Sr D until no more progress Decomposition (Algebraic, Boolean relations)
Signal insertion for function F State Graph F=0F=1 Insertion by input borders F- F+
Event insertion a b ER(x) c
Event insertion a b ER(x) c x x x x b SR(x) a
Properties to preserve a a b b a a b b a a b b x a a b b a a b b b a a b b x x a is persistent a is disabled by b = hazards
Boolean decomposition F x1x1 xnxn f HG x1x1 xnxn h1h1 hmhm f f = F (x 1,…,x n )f = G(H(x 1,…,x n )) Our problem: Given F and G, find H
C h1h1 h2h2 f state f next(f) (h 1,h 2 ) s (0,-) (-,0) s (1,1) s (0,0) s (-,1) (1,-) dc - - (-,-) This is a Boolean Relation
y- a+c- d- a- c+ a+ y+ a- c- d+ c+ y a c d F Rs y R S
y- a+c- d- a- c+ a+ y+ a- c- d+ c+ y a c d Rs y a c d c d
y- a+c- d- a- c+ a+ y+ a- c- d+ c+ y a c d Rs ya
y- a+c- d- a- c+ a+ y+ a- c- d+ c+ y a c d Rs ya D d c
Technology mapping Merging small gates into larger gates introduces no new hazards Standard synchronous technique can be applied, e.g. BDD-based boolean matching Handles sequential gates and combinational feedbacks Due to hazards there is no guarantee to find correct mapping (some gates cannot be decomposed) Timing-aware decomposition can be applied in these rare cases
Specification (STG) State Graph SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis State encoding Boolean minimization Logic decomposition Technology mapping Designflow
Timing assumptions in design flow Speed-independent: wire delays after a fork smaller than fan-out gate delays Burst-mode: circuit stabilizes between two changes at the inputs Timed circuits: Absolute bounds on gate / environment delays are known a priori (before physical design)
Relative Timing Circuits Assumptions: “a before b” –for concurrent events: reduces reachable state space –for ordered events: permits early enabling –both increase don’t care space for logic synthesis => simplify logic (better area and timing) “Assume - if useful - guarantee” approach: assumptions are used by the tool to derive a circuit and required timing constraints that must be met in physical design flow Applied to design of the Rotating Asynchronous Pentium Processor(TM) Instruction Decoder (K.Stevens, S.Rotem et al. Intel Corporation)
Speed-independent C-element Relative Timing Asynchronous Circuits a- before b- Timing assumption (on environment): a b c RT C-element: faster,smaller; correct only under timing constraint: a- before b- a b c
State Graph (Read cycle) DSr+ DTACK- LDS- LDTACK- D- DSr-DTACK+ D+ LDTACK+ LDS+
Lazy Transition Systems ER (LDS+) ER (LDS-) LDS- LDS+ LDS- DTACK- FR (LDS-) Event LDS- is lazy: firing = subset of enabling
Timing assumptions (a before b) for concurrent events: concurrency reduction for firing and enabling (a before b) f or ordered events: early enabling (a simultaneous to b wrt c) for triples of events: combination of the above
Speed-independent Netlist LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK csc map
Adding timing assumptions (I) LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK csc map LDTACK- before DSr+ FAST SLOW
Adding timing assumptions (I) DTACK D DSr LDS LDTACK csc map LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ LDTACK- before DSr+
State space domain LDTACK- before DSr+ LDTACK- DSr+
State space domain LDTACK- before DSr+ LDTACK- DSr+
State space domain LDTACK- before DSr+ LDTACK- DSr+ Two more unreachable states
Boolean domain DTACK DSr D LDTACK DTACK DSr D LDTACK LDS = 0 LDS = /1?
Boolean domain DTACK DSr D LDTACK DTACK DSr D LDTACK LDS = 0 LDS = One more DC vector for all signalsOne state conflict is removed
Netlist with one constraint LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK csc map
Netlist with one constraint LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK LDTACK- before DSr+ TIMING CONSTRAINT
Timing assumptions (a before b) for concurrent events: concurrency reduction for firing and enabling (a before b) f or ordered events: early enabling (a simultaneous to b wrt c) for triples of events: combination of the above
Ordered events: early enabling a c b a a c b a b b c c F G Logic for gate c may change
Adding timing assumptions (II) LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK D- before LDS-
State space domain LDS- D- Reachable space is unchanged For LDS- enabling can be changed in one state D- before LDS- Potential enabling for LDS- DSr-
Boolean domain DTACK DSr D LDTACK DTACK DSr D LDTACK LDS = 0 LDS =
Boolean domain DTACK DSr D LDTACK DTACK DSr D LDTACK LDS = 0 LDS = One more DC vector for one signal: LDS If used: LDS = DSr, otherwise: LDS = DSr + D
Before early enabling LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ DTACK D DSr LDS LDTACK
Netlist with two constraints LDS+LDTACK+D+DTACK+DSr-D- DTACK- LDS-LDTACK- DSr+ LDTACK- before DSr+ and D- before LDS- TIMING CONSTRAINTS DTACK D DSr LDS LDTACK Both timing assumptions are used for optimization and become constraints
Rule I (out of 6): a,b - non-input events –Untimed ordering: a||b and a enabled before b, but not vice versa –Derived assumption: a fires before b –Justification: delay of a gate can be made shorter than delay of two (or more) gates: del(a) < del(c)+del(b) Deriving automatic timing assumptions aaa b b b c c
Rule I (out of 6): a,b - non-input events –Untimed ordering: (a||b) and (a enabled before b), but not vice versa –Derived assumption: a fires before b –Justification: delay of a gate can be made shorter than delay of two (or more) gates Deriving automatic timing assumptions aaa b b b c c –Effect I: a state becomes DC for all signals
Rule I (out of 6): a,b - non-input events –Untimed ordering: (a||b) and (a enabled before b), but not vice versa –Derived assumption: a fires before b –Justification: delay of a gate can be made shorter than delay of two (or more) gates Deriving automatic timing assumptions aaa b b b c c –Effect II: another state becomes local DC for signal of event b
Backannotation of Timing Constraints Timed circuits require post-verification Can synthesis tools help ? –Report the least stringent set of timing constraints required for the correctness of the circuit –Not all initial timing assumptions may be required Petrify reports a set of constraints for order of firing that guarantee the circuit correctness
Timing constraints generation a b c d e d d e e b b c c d a Assumptions: d before b and c before e and a before d
Timing constraints generation a b c d e Assumptions: d before b and c before e and a before d d d e e b b c c d a
Timing constraints generation a b c d e Assumptions: d before b and c before e and a before d d d e e b b c c Correct behavior d a
Timing constraints generation a b c d e Assumptions: d before b and c before e and a before d d d e e b b c c 1 2 Incorrect behavior d a
Covering incorrect behavior a b c d e Assumptions: d before b and c before e and a before d d d e e b b c c {1, 3} d before b {1} d before c d a 5 {2, 4} c before e Other possible constraints remove states from assumption domain => invalid
Covering incorrect behavior a b c d e Assumptions: d before b and c before e and a before d d d e e b b c c {1} d before c d a 5 {2, 4} c before e Constraints for the minimal cost solution: d before c and c before e
Timing aware state encoding Solve only state conflicts reachable in the RT assumptions domain Generate automatic timing assumptions for inserted state signals => state signals can be implemented as RT logic State variables inserted concurrently with I/O events => latency and cycle time reduction
Value of Relative Timing RT circuits provides up to 2-3x (1.3-2x) delay&area reduction with respect to SI circuits synthesized without (with) concurrency reduction Automatic generation of timing assumptions => foundation for automatic synthesis of RT circuits with area/performance comparable/better than manual Back-annotation of timing constraints => minimal required timing information for the back-end tools Timing-aware state encoding allows significant area/performance optimization
Specification (STG + user assumptions) Lazy State Graph Lazy SG with CSC Next-state functions Decomposed functions Gate netlist Reachability analysis Timing-aware state encoding Boolean minimization Logic decomposition Technology mapping Design Flow with Timing Required Timing Constraints Automatic Timing Assumptions
FIFO example FIFO li lo ro ri li- li+ lo+ lo- ro+ ro- ri+ ri-
Speed-Independent Implementation without concurrency reduction 3 state signals are required
SI implementation with concurrency reduction li lo ro ri x li- li+ lo+ lo- ro+ ro- ri+ ri- x+ x- + gC + -
RT implementation li lo ro ri x li- li+ lo+ lo- ro+ ro- ri+ ri- x+ x- OR li- li+ lo+ lo- ro+ ro- ri+ ri- x+ x-
RT implementation li lo ro ri x li- li+ lo+ lo- ro+ ro- ri+ ri- x+ x- OR li- li+ lo+ lo- ro+ ro- ri+ ri- x+ x- To satisfy the constraint: Delay(x- ) < Delay (ri+ ) and Delay(lo+) + Delay(x- ) < Delay(ro+ ) + Delay (ri+ ) All constraints are either satisfied by default or easy to satisfy by sizing
Introduction to asynchronous circuit design: specification and synthesis Part IV: Synthesis from HDL Other synthesis paradigms
Outline Synthesis from standard HDL (Verilog) [L. Lavagno et al Async00] –Subset for asynchronous specification –Data-path/control partitioning –Circuit architecture. Control generation Synthesis from asynchronous HDL (CSP, Tangram) –CSP for control generation [A. Martin et al, Caltech] –Tangram for silicon compilation [K. van Berkel et al, Philips] Control synthesis using FSMs [K. Yun, S. Nowick] –Burst-mode machines –Comparison with STGs Disclaimer: this is NOT a comprehensive review
Motivation Language-based design key enabler to synchronous logic success Use HDL as single language for specification logic simulation and debugging synthesis post-layout simulation HDL must support multiple levels of abstraction
Control-data partitioning Splitting of asynchronous control and synchronous data path Automated insertion of bundling delays CONTROL UNIT DATA PATH delay request acknowledge
Design flow Control/data splitting STG (control) HDL specification Synthesizable HDL (data) Synthesis (petrify) Timing analysis (Synopsys) HDL implementation Synthesis (Synopsys) Logic implementation Delay insertion Logic delays
Asynchronous Verilog subset by example always begin wait(start); R = SMP * 3; RES = SMP * 4 + R; if(RES[7] == 1) RES = 0; else begin if(RES[6] == 1) RES = 1; end; done = 1; wait(!start); done = 0; end R RESRES SMP donestart RES C.U. begin-end for sequencing, fork-join for concurrency, if- else for input choice Only structured mix of sequencing, concurrency and choice can be specified
Controller design flow Trace Expressions Circuit Petri Net Transformations Reductions Synthesis HDL Syntax-directed translation
Trace expressions: example ( a || ( b ; c) ) || (d e) || ; a bc de
Reduction Example a fb c dg h e d; a; ( b || f ) c g; h; e
Transformation: concurrency reduction a fb c d ; || a bcdf ; ; Concurrency in TE: b and f have a common parallel father
a fb c d f and b are ordered ; || a bcdf ; ; ; Transformation: concurrency reduction
Synthesis Place-based encoding ( based on a David-cell approach) Transformations to improve area and performance Structural methods to derive a circuit [Pastor et al.] Transactions on CAD, Nov’98
Place-based encoding p1 p2 p3 p4 t1 t2 p3+ p1- p2- p4+ p3- t1 t2 p1+ p2+ p ER(t1) = 111- ER(t2) = --11
Synthesis example: VME bus p2+ ldtack+ p8-p11- lds+ p1+ D+ p3+ p1- p2- p4+ dtack+ p3- p5+ dsr- p4- p9+p6+ D-p5- p10+p7+ lds-dtack- p9-p6- p11+ ldtack-p8+ dsr+ p10- p7- LDTACK+ D+ DTACK+ DSr- D- DTACK-LDS- LDTACK-DSr+ LDS+ Place encoding
VME bus spec after transforms p2+ ldtack+ p8-p11- lds+ p1+ D+ p3+ p1- p2- p4+ dtack+ p3- p5+ dsr- p4- p9+p6+ D-p5- p10+p7+ lds-dtack- p9-p6- p11+ ldtack-p8+ dsr+ p10- p7- ldtack+ lds+d+ dtack+ dsr-p9+ d- lds-dtack- p9-ldtack- dsr+ Reductions Transforms
Deriving Next state function x+ z+ z- y- x- y+ p1 p2 p3 p4 p5 p6 p7 Next-state function of signal y ?
Deriving Next State function x+ z+ z- y- x- y+ p1 p2 p3 p4 p5 p6 p7 Next-state function of signal y ? y = x + z
Conclusion Initial prototype of automated flow without state explosion for ASIC design –From HDLs (control / data splitting) –Existing tools for data-path synthesis –Direct synthesis guarantees implementation (HDL Petri net, Petri-net-based encoding) –Synthesis of large controllers by efficient spec models (Free-choice Petri nets + trace expressions) –Exploration of the design space (optimization) by property-preserving transformations –Logic synthesis by structural methods Quality of design often acceptable Timing post-optimization can be applied
Synthesis from asynchronous HDL CSP based languages CSP = communicating sequential processes [T. Hoare] Two synthesis techniques –based on program transformations [Caltech] –based on direct compilation [Philips] Tools are more mature than for asynchronous synthesis from standard HDL Complete shift in design methodology is required
Using CSP for control generation After li goes high do full handshake at the right, then complete handshake at the left and iterate. li+ro+ri+ro-ri-lo+li-lo- ro ri li lo Q element *[[li];ro+;[ri];ro-;[not ri];lo+;[not li];lo-] “;” = sequencing operator ro+ = ro goes high; ro- = ro goes low [li] = wait until li is high; [not li] = wait until li is low CSP: STG:
Using CSP for control generation *[[li];ro+;[ri];ro-;[not ri];lo+;[not li];lo-] Conflict: ro+ and ro- are not mutually exclusive (since ri+ and li+ are not) Eliminate conflict by state signal insertion (= CSC) CSP: Production rules: li -> ro+; ri -> ro- not ri -> lo+; not li -> lo- ri li ro weak
Conflict elimination *[[li];ro+;[ri];x+;[x];ro-;[not ri];lo+;[not li];x-;[not x];lo-] CSP: Production rules: not x and li -> ro+; x or not li -> ro- x and not ri -> lo+; not x or ri -> lo- ri -> x+; not li -> x- FF x not x li lo ri ro
Conclusions Generating circuits from CSP control program is similar to STG synthesis One can be reduced to the other Particular technique may vary. Direct CSP program transformations can be (and were) used instead of methods based on state space generation See reference list for more details
Buffer example in Tangram (a?byte & b!byte) begin x0: var byte | forever do a?x0 ; b!x0 od end Buffer * x a b T ; T a b passive port active port Each circle mapped to a netlist Data path Q element
Summary Tangram program is partitioned into data path and control Data path is implemented as dual or single rail Control is mapped to composition of standard elements (“;” “||” etc) Each standard element is mapped to a circuit Post-optimization is done Composing islands of control elements and re-synthesis with STG can give more aggressive optimization Philips made a few chips using Tangram, including a product: 8051 micro-controller in low-power pager Muna (25 wks battery life from one AAA battery) Similar approach used in Balsa (Manchester Univ., public domain)
Burst mode FSM s1 s2 s3 s4 b-/x- a+b+/y+ a-/x+y- c+/y- c-/y+ Close to synchronous FSMs with binary encoded I/O Work in bursts: –Input transitions fire –Output transitions fire –State signals change Mostly limited to fundamental mode: next input burst cannot arrive before stabilization at the outputs
Extended Burst mode s1 s2 s3 s4 b-/x- a+b*/y+ a-/x+y- c+/y- c-/y+ Directed don’t cares (b*): some concurrency is allowed for input transitions that do not influence an output burst Conditional guards = “if b=1 then …”
Synthesis of XBM Next state and output functions free of functional and logic hazards Sequential feedbacks should not introduce new hazards State assignment –one state of the BM spec to one layer of Karnaugh map –compatible layers are merged –layers are compatible if merging does not introduce CSC violations or hazards –Layers are encoded using race free encoding
XBM and STG s1 s2 s3 s4 b-/x- a+b*/y+ a-/x+y- c+/y- c-/y+ x- a+ y+ b+ eps c- a- c+ y- y+ x+ y- b-
Summary Specification: XBM is subclass of STGs Synthesis: techniques are extensions of synchronous state assignment and logic minimization Timing: –environment is limited to fundamental mode (difficult for pipelined and highly concurrent systems) –internals are delay insensitive See reference list for details
Summary Specification: Signal Transition Graph (formalized timing diagram) Synthesis: –state encoding –Boolean function derivation –algebraic and Boolean sequential decomposition –technology mapping Timing: –delay model implies timing constraints –exploiting timing assumptions leads to minimization and generates further assumptions Future work: –integrated flow –testing