Download presentation
Presentation is loading. Please wait.
Published byDerek Scott Modified over 9 years ago
1
Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum
2
Presentation goals Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work
3
Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications Industrial quality Easy to integrate in RTL oriented environment Capable of handling very large designs – scalability Automated fine-grain pipelining To achieve high performance (throughput) Automated to reduce design time
4
Choice of paradigm Synchronous RTL 8 logic levels per stage is the limit Due to register, clock skew and jitter overhead Timing closure No pipelining automation available – stage balancing is difficult Performance limitations To guarantee correctness with process variation etc Asynchronous GTL Lower design time Automated pipelining possible from RTL specification Higher performance Gate-level (finest possible) pipelining achievable Controllable power consumption Smoothly slows down in case of voltage reduction Improved yield Correct operation regardless of variations
5
Easy integration & scalability: Weaver flow architecture RTL tools reuse Creates the impression that nothing has changed Saves development effort Substitution based transformations Linear complexity Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries
6
Easy integration & scalability: Weaver flow architecture Synthesis flow Interfacing with host synthesis engine Transforming Synchronous RTL to Asynchronous GTL – Weaving Dedicated library(ies) Dual-rail encoded data logic Cells comprising entire stages Internal delay assumptions only
7
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline REGCombinational logic
8
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently
9
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used
10
Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist
11
Weaving Critical transformations Mapping combinational gates (basic weaving) Mapping sequential gates Initialization preserving liveness and safeness Optimizations Performance optimization Fine-grain pipelining (natural) Slack matching Area optimization Optimizing out identity function stages
12
Basic Weaving De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals Merge insertion Fork insertion Reset routing
13
Basic Weaving: example (C17 MCNC benchmark)
14
Linear pipeline (RTL)
15
Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization
16
Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization
17
Linear pipeline PPN models asynchronous full-buffer pipelines pipeline PN model with global synchronization
18
Linear pipeline GTL implementation RTL implementation
19
Correctness Safeness Guarantees that the number of data portions (tokens) stays the same over the time Liveness Guarantees that the system operates continuously Flow equivalence In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream
20
Non-linear pipelines Deterministic token flow Broadcasting tokens to all channels at Forks Synchronizing at Merges Data dependent token flow Ctrl is also a dual-rail channel To guarantee liveness MUXes need to match deMUXes – hard computationally
21
Non-linear pipeline liveness Currently guaranteed for deterministic token flow only by construction (weaving) A marking of a marked graph is live if each directed PN circuit has a marker Linear closed pipelines can be considered instead
22
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
23
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
24
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition
25
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits Forward – has at least one token inferred for a DFF
26
Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits Forward – has at least one token inferred for a DFF Feedback – has at least one NULL inferred from CL or added explicitly
27
Closed linear PPN pipeline is live iff (for full-buffer pipelines) Every loop has at least 2 stages Token capacity for any loop: 1 C N - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold by construction (Weaving)
28
Initialization: example
29
Initialization: FSM example … HB …
30
Flow equivalence GTL data flow structure is equivalent to the source RTL by weaving No data dependencies are removed No additional dependencies introduced In deterministic flow architecture There are no token races (tokens cannot pass each other) All forks are broadcast and all joins are synchronizers Flow equivalence preserved by construction
31
Flow equivalence 2112 NNN21NNNNNN 2112 NNNNNNNNN21 GTL initialization is same as RTL
32
Flow equivalence 2112 3NN2NNNN2NN 2112 NNNNNNNN321 but token propagation is independent
33
Flow equivalence 2112 3NN2NNN22NN 2112 NNNNNNN332N but token propagation is independent
34
Flow equivalence 2112 3NNNNNN22N3 2112 2NNNNNN3N2N but token propagation is independent
35
Flow equivalence 2112 NNNNNNN2N33 2112 2N32NNN3NNN but token propagation is independent
36
Flow equivalence 3212 N2NNNNN2N33 3212 23322NNNNNN but token propagation is independent
37
Flow equivalence 3212 N2N3NNNNN3N 3212 233222NN4NN but token propagation is independent In GTL “3” hits the first top register output
38
Flow equivalence 3212 42N3NN2NNNN 3212 N3N222NN4NN but token propagation is independent
39
Flow equivalence 3212 42N3N22N3NN 3212 N3NN222443N In GTL “3” hits the first bottom register output but token propagation is independent
40
Flow equivalence 3212 4N23N22N3N4 3212 NNNNN224N3N but token propagation is independent
41
Flow equivalence 3212 NN2N2223344 3212 NN4NNN2NN32 In GTL “2” hits the second register output but token propagation is independent
42
Flow equivalence 3223 NN2N22N3344 3223 3N4NNN24N32 but token propagation is independent In RTL “3” and “2” moved one stage ahead timing is independent, the order is unchanged
43
Optimizations Area Optimizing out identity function stages Performance Fine-grain pipelining (natural) Slack matching
44
Optimizing out identity function stages Identity function stages (buffers) are inferred for clocked DFFs and D-latches Implement no functionality Can be removed as long as The token capacity is not decreased below the RTL level The resulting circuit can still be properly initialized
45
CL Optimizing out identity function stages: example HB DFF Final implementation is the same as if the RTL had not been pipelined (except for initialization) Saves pipelining effort
46
Slack matching implementation Adjusting the pipeline slack to optimize its throughput Implementation leveling gates according to their shortest paths from primary inputs (outputs) Inserting buffer stages to break long dependencies Buffer stages initialized to NULL Currently performed for circuits with no loops only Complexity O(|X||C| 2 ) |X| - the number of primary inputs |C| - the number of connection points in the netlist
47
Slack matching correctness Increases the token capacity Potentially increases performance Does not affect the number of initial tokens Liveness is not affected Does not affect the system structure The flow equivalence is not affected
48
Experimental results: MCNC RTL implementation Not pipelined GTL implementation Naturally fine-grain pipelined Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification on average ~ x4 better performance
49
Experimental results: AES ~ x36 better performance~ x12 larger
50
Base line Demonstrated an automatic synthesis of QDI (robust to variations) automatically gate-level pipelined implementations from large behavioral specifications Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced Resulting circuits feature increased performance (depth dependent ~4x for MCNC) area overhead Practical solution – first prerelease at http://async.bu.edu/weaver/ http://async.bu.edu/weaver/ Demonstrated correctness of transformations (weaving)
51
Future work Library design Dynamic (domino-like) library design Low leakage library design to combine high performance of fine-grain pipelining with low power from very aggressive voltage reduction Balanced library for security related applications Extending the concept to other technologies Automated asynchronous fine-grain pipelining for standard FPGAs Synthesis flow development Integration of efficient GTL “design-ware” and architectures
52
Thank you! Questions? Comments? Suggestions?
53
Backup slides Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness
54
Slack matching: example (C17)
59
Back to backup slides
60
Similar work: the difference Null Convention Logic Coarse-grain Slow and large synchronization trees Phased logic Different encoding provides less switching activity Complicated synthesis algorithm due to encoding De-synchronization Bundled data Coarse grain None of the above provide support for automated fine-grain pipelining
61
Back to backup slides
62
Example: data path CL FSM CLREG MUX CL REG
63
Example: data path CL FSM CLREG MUX CL
64
Example: data path CL FSM CLREG MUX CL
65
Example: data path CL FSM CL MUX DE MUX CL
66
Back to backup slides
67
Experiments setup Standard gates library vtvt from Virginia Tech TSMC 0.25 C-elements – derived from PCHB library from USC and simulated to obtain performance
68
Back to backup slides
69
All correctness prerequisites 1. no additional data dependencies are added and no existing data dependencies are removed during weaving; 2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent function for dual-rail encoded data and initialized to NULL (spacer); 3. closed asynchronous HB pipeline maximum token capacity is S/2 - 1 (where S is the number of HB stages); 4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages); 5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any two adjacent stages); 6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a spacer and another – to a token; 7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or half-DFFs) in the corresponding synchronous RTL implementation; 8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation; 9. no stage state is shared between any two stages. 10. exactly one place is marked in every stage state. 11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker; 12. GTL style pipeline is properly modeled by HB PPN. 13. a live closed HB PPN is at least 3 HB stages long; 14. a live closed HB PPN has at least one token and at most S/2 – 1 tokens; 15. the token flow is deterministic and does not depend on data itself; 16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit); 17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must satisfy the conditions (xi), (xiii) and (xiv); 18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);
70
Back to backup slides
71
Linear pipeline PPN models full-buffer pipelines HB PPN models half-buffer pipelines
72
Linear pipeline HB PPN stage has three states PPN stage has two states
73
Linear pipeline HB PPN stage has three states models properly HB GTL implementation
74
Back to backup slides
75
Non-linear pipeline HB PPN model PPN equivalent to HB PPN besides token capacity
76
Non-linear pipeline HB PPN model MG PN equivalent to HB PPN besides token capacity
77
Back to backup slides
78
Closed linear HB pipeline is live iff Every loop has at least 3 stages Token capacity for any loop: 1 C N/2 - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold
79
Back to backup slides
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.