Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date :
Outline Introduction ◦ C-element ◦ Null convention logic (NCL) ◦ NCL asynchronous systems UNCLE synthesis flow ◦ From RTL to gates ◦ Ack generation ◦ Net buffering ◦ Latch balancing ◦ Relaxation, cell merging Comparisons Conclusion
C-element Commonly used asynchronous logic component Hysteresis Implementations ◦ Semi-static : with two cross-coupled inverters ◦ Static : doesn’t rely on feedback inverters ◦ Gate-level : depends on which gate used
C-element (cont.) Semi-static
C-element (cont.) Static Gate-level
Null convention logic Dual-rail Delay-insensitive logic style Based on threshold logic Use 27 fundamental threshold gates with 2~4 inputs Hysteresis state-holding capability
Null convention logic (cont.)
An example of implement TH23
Null convention logic (cont.) Compare between two types of DR AND2
27 Basic NCL macros
NCL asynchronous systems Data-driven approach ◦ Use NCL gates for both registers and control Control-driven approach ◦ Uses Balsa-style registers and control
Data-driven approach Using dual-rail latch with acknowledge signals ki, ko to control the datapath
Dual-rail latches ◦ C_0 = C-element with async reset to 0 ◦ C_1 = C-element with async reset to 1 ◦ t_d/f_d = dual-rail in ◦ ko = ackout ◦ t_q/f_q = dual-rail out ◦ ki = ackin Types of latch ◦ drlatn ◦ drlatr ◦ drlats
Dual-rail latches (cont.) drlatn drlatr drlats
Data-driven approach (cont.) Finite state machine ◦ The middle half-latch contains initial data ◦ All ports and registers are read and written every cycle
Control-driven Approach Registers with selective read/write Control network is separate from the datapath Number of read ports can be easily added to the register
UNCLE synthesis flow Both data-driven and control-driven are supported lower-level synthesis tool Verilog as its input language
From RTL to Gates RTL is transformed to a gate level netlist using commercial synthesis tools The target library read by the tool contains: ◦ AND2, XOR2, OR2, inverter ◦ D-flip-flop (DFF), D-latch (DLAT) ◦ Gates for special (T- elements, S-elements…) ◦ Complex gates that have been mapped into NCL Gates have unit delays for timing Area is proportional to transistor counts
Ack Generation Data-driven ◦ Each latch receive an ack signal from each destination latch of its output Control-driven ◦ Each control element receive an ack signal from each destination latch A simple Ack merging algorithm : ◦ any latches having at least one common destination have their ack networks merged An ack checker step is included at the end of the flow to check ack network validity
Net Buffering Timing data is non-linear delay model (NLDM) The signal net target transition time used for all examples in this paper is approximately equivalent to a 1 X inverter driving four separate 4X inverter loads Gate sizing Build a buffer tree with invertors
Latch Balancing For the data-driven style that moves half- latches in the netlist to balance data delays with ack delays Ack delay ◦ Depends on the number of destination that sets the completion network depth Data delay ◦ depends on the data logic complexity.
Latch Balancing (cont.)
Generally results in more transistors as the datapath width increases moving towards the source registers Requiring more latches, with a increase in the ack network size Implement by iterative heuristic algorithm
Latch Balancing (cont.)
Several sorting/pruning stages based on data/ack/cycle delays are used to find latch that are most likely to improve performance if pushed Chosen latches are pushed one gate level, and affected ack networks are rebuilt Latches only feed primary outputs are ineligible
Latch Balancing (cont.) Works appropriately for FSMs Has problems with linear pipelines if latches are pushed in one direction only
Relaxation and Cell Merging Relaxation is a technique that ◦ Look for redundant paths from a PI to a PO ◦ Finds gates that don’t have to be fully expanded to dual-rail versions, but can be implemented by eager versions that require fewer transistors Cell Merging ◦ A cell merging step is performed in which adjacent gates with no fanout are merged into more complex gates ◦ Area-driven
Example RTL Statements
Comparison GCD16 with different Uncle version Conditional port activity caused data-driven designs to be large, slow. Latch balancing helped DD performance. Control driven produced best results DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best Uncle ver.DDDD/NBDD/LB/NBCDCD/NB transistors * cyc. time (ns) * energy (pJ) *
Comparison (cont.) GCD16 between Uncle and Balsa Balsa used more read ports on registers reducing loading but increasing transistor count Net buffering helped offset increased loading in Uncle design, improved performance transistorscyc. time (ns)energy (pJ) BalsaUncle (CD/NB) BalsaUncle (CD/NB) BalsaUncle (CD/NB) *
Comparison (cont.) Viterbi decoder design ◦ Branch Metric Unit (BMU) Just combinational logic With a half latch at the output for UNCLE ack ◦ Path Metric Unit (PMU) It’s a set of parallel accumulator-like registers resulting in many parallel three half-latch loops ◦ History Unit (HU) It has three 16-entry register files(4-bit, 2-bit, and 1-bit) An outer loop writes the registers, and can conditionally trigger an inner while loop that contains register read/write operations and executes a variable number of iterations
Comparison (cont.) Viterbi’s Branch Metric Unit comparison ◦ Combination only Uncle version just combinational logic with half-latch on output Balsa version used loop splitting to split combinational logic into concurrent blocks that increased parallelism of internal computations at the cost of more transistors. transistorscyc. time (ns)energy (pJ) BalsaUncle (CD/NB) BalsaUncle (CD/NB) BalsaUncle (CD/NB) *
Comparison (cont.) Uncle’s Viterbi Path Metric Unit (PMU) LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and one on output port) Uncle ver.DD/NBDD/NB/LBDD/NB/LB+CD/NB transistors * cyc. time (ns) * energy (pJ) *
Comparison (cont.) Viterbi’s Path Metric Unit comparison transistorscyc. time (ns)energy (pJ) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) *
Comparison (cont.) Viterbi’s History Unit comparison BalsaUncle CD/NB Uncle CD transistors * V1cyc. time (ns) * energy (pJ) * V2cyc. time (ns) * energy (pJ) *
Comparison (cont.) Viterbi comparison between Balsa and Uncle The Uncle decoder uses the DD/NB/LB+ PMU RTL transistorscyc. time (ns)energy (pJ) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) BalsaUncle (DD/NB/ LB+) *
Comparison (cont.) BalsaUncle Combinational synthesis Yes Control synthesisYesData-driven only Logic StyleDifferent dual-rail styles, bundled data NCL only Behavioral simulation YesLimited Area optimizations NoRelaxation, limited cell merging, ack sharing Area optimizations Relaxation, limited cell merging, ack sharing RTL style allow area/perf. tradeoffs, latch balancing, net buffering Timing modelFixed delayNLDM
Conclusion Requires more effort by the designer than Balsa, But can have a higher quality design If performance of the always active module is our goal, data-driven style would be better Control-driven style better for modules with conditional port activity.
Appendix : Teak Teak is a successor toolset to Balsa that uses a data-driven style One of Teak’s goals is to automatically insert latch stages and balance delays for optimum throughput. Teak is a fairly new tool with only one public release
Reference Uncle – An RTL Approach to Asynchronous Design ASYNC12 powerpoint about Uncle – An RTL Approach To Asynchronous Design Design of Asynchronous Circuits Using Synchronous CAD Tools Optimization of NULL convention self-timed circuits