Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum
Presentation goals Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work
Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications Industrial quality Easy to integrate in RTL oriented environment Capable of handling very large designs – scalability Automated fine-grain pipelining To achieve high performance (throughput) Automated to reduce design time
Choice of paradigm Synchronous RTL 8 logic levels per stage is the limit Due to register, clock skew and jitter overhead Timing closure No pipelining automation available – stage balancing is difficult Performance limitations To guarantee correctness with process variation etc Asynchronous GTL Lower design time Automated pipelining possible from RTL specification Higher performance Gate-level (finest possible) pipelining achievable Controllable power consumption Smoothly slows down in case of voltage reduction Improved yield Correct operation regardless of variations
Easy integration & scalability: Weaver flow architecture RTL tools reuse Creates the impression that nothing has changed Saves development effort Substitution based transformations Linear complexity Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries
Easy integration & scalability: Weaver flow architecture Synthesis flow Interfacing with host synthesis engine Transforming Synchronous RTL to Asynchronous GTL – Weaving Dedicated library(ies) Dual-rail encoded data logic Cells comprising entire stages Internal delay assumptions only
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline REGCombinational logic
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist
Weaving Critical transformations Mapping combinational gates (basic weaving) Mapping sequential gates Initialization preserving liveness and safeness Optimizations Performance optimization Fine-grain pipelining (natural) Slack matching Area optimization Optimizing out identity function stages
Basic Weaving De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals Merge insertion Fork insertion Reset routing
Basic Weaving: example (C17 MCNC benchmark)
Linear pipeline (RTL)
Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization
Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization
Linear pipeline PPN models asynchronous full-buffer pipelines pipeline PN model with global synchronization
Linear pipeline GTL implementation RTL implementation
Correctness Safeness Guarantees that the number of data portions (tokens) stays the same over the time Liveness Guarantees that the system operates continuously Flow equivalence In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream
Non-linear pipelines Deterministic token flow Broadcasting tokens to all channels at Forks Synchronizing at Merges Data dependent token flow Ctrl is also a dual-rail channel To guarantee liveness MUXes need to match deMUXes – hard computationally
Non-linear pipeline liveness Currently guaranteed for deterministic token flow only by construction (weaving) A marking of a marked graph is live if each directed PN circuit has a marker Linear closed pipelines can be considered instead
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits Forward – has at least one token inferred for a DFF
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits Forward – has at least one token inferred for a DFF Feedback – has at least one NULL inferred from CL or added explicitly
Closed linear PPN pipeline is live iff (for full-buffer pipelines) Every loop has at least 2 stages Token capacity for any loop: 1 C N - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold by construction (Weaving)
Initialization: example
Initialization: FSM example … HB …
Flow equivalence GTL data flow structure is equivalent to the source RTL by weaving No data dependencies are removed No additional dependencies introduced In deterministic flow architecture There are no token races (tokens cannot pass each other) All forks are broadcast and all joins are synchronizers Flow equivalence preserved by construction
Flow equivalence 2112 NNN21NNNNNN 2112 NNNNNNNNN21 GTL initialization is same as RTL
Flow equivalence NN2NNNN2NN 2112 NNNNNNNN321 but token propagation is independent
Flow equivalence NN2NNN22NN 2112 NNNNNNN332N but token propagation is independent
Flow equivalence NNNNNN22N NNNNNN3N2N but token propagation is independent
Flow equivalence 2112 NNNNNNN2N N32NNN3NNN but token propagation is independent
Flow equivalence 3212 N2NNNNN2N NNNNNN but token propagation is independent
Flow equivalence 3212 N2N3NNNNN3N NN4NN but token propagation is independent In GTL “3” hits the first top register output
Flow equivalence N3NN2NNNN 3212 N3N222NN4NN but token propagation is independent
Flow equivalence N3N22N3NN 3212 N3NN222443N In GTL “3” hits the first bottom register output but token propagation is independent
Flow equivalence N23N22N3N NNNNN224N3N but token propagation is independent
Flow equivalence 3212 NN2N NN4NNN2NN32 In GTL “2” hits the second register output but token propagation is independent
Flow equivalence 3223 NN2N22N N4NNN24N32 but token propagation is independent In RTL “3” and “2” moved one stage ahead timing is independent, the order is unchanged
Optimizations Area Optimizing out identity function stages Performance Fine-grain pipelining (natural) Slack matching
Optimizing out identity function stages Identity function stages (buffers) are inferred for clocked DFFs and D-latches Implement no functionality Can be removed as long as The token capacity is not decreased below the RTL level The resulting circuit can still be properly initialized
CL Optimizing out identity function stages: example HB DFF Final implementation is the same as if the RTL had not been pipelined (except for initialization) Saves pipelining effort
Slack matching implementation Adjusting the pipeline slack to optimize its throughput Implementation leveling gates according to their shortest paths from primary inputs (outputs) Inserting buffer stages to break long dependencies Buffer stages initialized to NULL Currently performed for circuits with no loops only Complexity O(|X||C| 2 ) |X| - the number of primary inputs |C| - the number of connection points in the netlist
Slack matching correctness Increases the token capacity Potentially increases performance Does not affect the number of initial tokens Liveness is not affected Does not affect the system structure The flow equivalence is not affected
Experimental results: MCNC RTL implementation Not pipelined GTL implementation Naturally fine-grain pipelined Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification on average ~ x4 better performance
Experimental results: AES ~ x36 better performance~ x12 larger
Base line Demonstrated an automatic synthesis of QDI (robust to variations) automatically gate-level pipelined implementations from large behavioral specifications Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced Resulting circuits feature increased performance (depth dependent ~4x for MCNC) area overhead Practical solution – first prerelease at Demonstrated correctness of transformations (weaving)
Future work Library design Dynamic (domino-like) library design Low leakage library design to combine high performance of fine-grain pipelining with low power from very aggressive voltage reduction Balanced library for security related applications Extending the concept to other technologies Automated asynchronous fine-grain pipelining for standard FPGAs Synthesis flow development Integration of efficient GTL “design-ware” and architectures
Thank you! Questions? Comments? Suggestions?
Backup slides Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness
Slack matching: example (C17)
Back to backup slides
Similar work: the difference Null Convention Logic Coarse-grain Slow and large synchronization trees Phased logic Different encoding provides less switching activity Complicated synthesis algorithm due to encoding De-synchronization Bundled data Coarse grain None of the above provide support for automated fine-grain pipelining
Back to backup slides
Example: data path CL FSM CLREG MUX CL REG
Example: data path CL FSM CLREG MUX CL
Example: data path CL FSM CLREG MUX CL
Example: data path CL FSM CL MUX DE MUX CL
Back to backup slides
Experiments setup Standard gates library vtvt from Virginia Tech TSMC 0.25 C-elements – derived from PCHB library from USC and simulated to obtain performance
Back to backup slides
All correctness prerequisites 1. no additional data dependencies are added and no existing data dependencies are removed during weaving; 2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent function for dual-rail encoded data and initialized to NULL (spacer); 3. closed asynchronous HB pipeline maximum token capacity is S/2 - 1 (where S is the number of HB stages); 4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages); 5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any two adjacent stages); 6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a spacer and another – to a token; 7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or half-DFFs) in the corresponding synchronous RTL implementation; 8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation; 9. no stage state is shared between any two stages. 10. exactly one place is marked in every stage state. 11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker; 12. GTL style pipeline is properly modeled by HB PPN. 13. a live closed HB PPN is at least 3 HB stages long; 14. a live closed HB PPN has at least one token and at most S/2 – 1 tokens; 15. the token flow is deterministic and does not depend on data itself; 16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit); 17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must satisfy the conditions (xi), (xiii) and (xiv); 18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);
Back to backup slides
Linear pipeline PPN models full-buffer pipelines HB PPN models half-buffer pipelines
Linear pipeline HB PPN stage has three states PPN stage has two states
Linear pipeline HB PPN stage has three states models properly HB GTL implementation
Back to backup slides
Non-linear pipeline HB PPN model PPN equivalent to HB PPN besides token capacity
Non-linear pipeline HB PPN model MG PN equivalent to HB PPN besides token capacity
Back to backup slides
Closed linear HB pipeline is live iff Every loop has at least 3 stages Token capacity for any loop: 1 C N/2 - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold
Back to backup slides