Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum.

Similar presentations


Presentation on theme: "Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum."— Presentation transcript:

1 Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum

2 Presentation goals Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work

3 Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications Industrial quality  Easy to integrate in RTL oriented environment  Capable of handling very large designs – scalability Automated fine-grain pipelining  To achieve high performance (throughput)  Automated to reduce design time

4 Choice of paradigm Synchronous RTL  8 logic levels per stage is the limit Due to register, clock skew and jitter overhead  Timing closure No pipelining automation available – stage balancing is difficult  Performance limitations To guarantee correctness with process variation etc Asynchronous GTL  Lower design time Automated pipelining possible from RTL specification  Higher performance Gate-level (finest possible) pipelining achievable  Controllable power consumption Smoothly slows down in case of voltage reduction  Improved yield Correct operation regardless of variations

5 Easy integration & scalability: Weaver flow architecture RTL tools reuse  Creates the impression that nothing has changed  Saves development effort Substitution based transformations  Linear complexity  Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries

6 Easy integration & scalability: Weaver flow architecture Synthesis flow  Interfacing with host synthesis engine  Transforming Synchronous RTL to Asynchronous GTL – Weaving Dedicated library(ies)  Dual-rail encoded data logic  Cells comprising entire stages  Internal delay assumptions only

7 Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline REGCombinational logic

8 Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently

9 Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used

10 Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist

11 Weaving Critical transformations  Mapping combinational gates (basic weaving)  Mapping sequential gates  Initialization preserving liveness and safeness Optimizations  Performance optimization Fine-grain pipelining (natural) Slack matching  Area optimization Optimizing out identity function stages

12 Basic Weaving De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals  Merge insertion  Fork insertion Reset routing

13 Basic Weaving: example (C17 MCNC benchmark)

14 Linear pipeline (RTL)

15 Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization

16 Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization

17 Linear pipeline PPN models asynchronous full-buffer pipelines pipeline PN model with global synchronization

18 Linear pipeline GTL implementation RTL implementation

19 Correctness Safeness  Guarantees that the number of data portions (tokens) stays the same over the time Liveness  Guarantees that the system operates continuously Flow equivalence  In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream

20 Non-linear pipelines Deterministic token flow  Broadcasting tokens to all channels at Forks  Synchronizing at Merges Data dependent token flow  Ctrl is also a dual-rail channel  To guarantee liveness MUXes need to match deMUXes – hard computationally

21 Non-linear pipeline liveness Currently guaranteed for deterministic token flow only by construction (weaving) A marking of a marked graph is live if each directed PN circuit has a marker  Linear closed pipelines can be considered instead

22 Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

23 Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

24 Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

25 Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits  Forward – has at least one token inferred for a DFF

26 Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits  Forward – has at least one token inferred for a DFF  Feedback – has at least one NULL inferred from CL or added explicitly

27 Closed linear PPN pipeline is live iff (for full-buffer pipelines) Every loop has at least 2 stages Token capacity for any loop: 1  C  N - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless  Liveness conditions hold by construction (Weaving)

28 Initialization: example

29 Initialization: FSM example … HB …

30 Flow equivalence GTL data flow structure is equivalent to the source RTL by weaving  No data dependencies are removed  No additional dependencies introduced In deterministic flow architecture  There are no token races (tokens cannot pass each other)  All forks are broadcast and all joins are synchronizers  Flow equivalence preserved by construction

31 Flow equivalence 2112 NNN21NNNNNN 2112 NNNNNNNNN21 GTL initialization is same as RTL

32 Flow equivalence 2112 3NN2NNNN2NN 2112 NNNNNNNN321 but token propagation is independent

33 Flow equivalence 2112 3NN2NNN22NN 2112 NNNNNNN332N but token propagation is independent

34 Flow equivalence 2112 3NNNNNN22N3 2112 2NNNNNN3N2N but token propagation is independent

35 Flow equivalence 2112 NNNNNNN2N33 2112 2N32NNN3NNN but token propagation is independent

36 Flow equivalence 3212 N2NNNNN2N33 3212 23322NNNNNN but token propagation is independent

37 Flow equivalence 3212 N2N3NNNNN3N 3212 233222NN4NN but token propagation is independent In GTL “3” hits the first top register output

38 Flow equivalence 3212 42N3NN2NNNN 3212 N3N222NN4NN but token propagation is independent

39 Flow equivalence 3212 42N3N22N3NN 3212 N3NN222443N In GTL “3” hits the first bottom register output but token propagation is independent

40 Flow equivalence 3212 4N23N22N3N4 3212 NNNNN224N3N but token propagation is independent

41 Flow equivalence 3212 NN2N2223344 3212 NN4NNN2NN32 In GTL “2” hits the second register output but token propagation is independent

42 Flow equivalence 3223 NN2N22N3344 3223 3N4NNN24N32 but token propagation is independent In RTL “3” and “2” moved one stage ahead timing is independent, the order is unchanged

43 Optimizations Area  Optimizing out identity function stages Performance  Fine-grain pipelining (natural)  Slack matching

44 Optimizing out identity function stages Identity function stages (buffers) are inferred for clocked DFFs and D-latches Implement no functionality  Can be removed as long as  The token capacity is not decreased below the RTL level  The resulting circuit can still be properly initialized

45 CL Optimizing out identity function stages: example HB DFF Final implementation is the same as if the RTL had not been pipelined (except for initialization)  Saves pipelining effort

46 Slack matching implementation Adjusting the pipeline slack to optimize its throughput Implementation  leveling gates according to their shortest paths from primary inputs (outputs)  Inserting buffer stages to break long dependencies  Buffer stages initialized to NULL Currently performed for circuits with no loops only Complexity O(|X||C| 2 )  |X| - the number of primary inputs  |C| - the number of connection points in the netlist

47 Slack matching correctness Increases the token capacity  Potentially increases performance Does not affect the number of initial tokens  Liveness is not affected Does not affect the system structure  The flow equivalence is not affected

48 Experimental results: MCNC RTL implementation  Not pipelined GTL implementation  Naturally fine-grain pipelined  Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification on average ~ x4 better performance

49 Experimental results: AES ~ x36 better performance~ x12 larger

50 Base line Demonstrated an automatic synthesis of  QDI (robust to variations)  automatically gate-level pipelined  implementations from large behavioral specifications Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced Resulting circuits feature increased performance (depth dependent ~4x for MCNC)  area overhead Practical solution – first prerelease at http://async.bu.edu/weaver/ http://async.bu.edu/weaver/ Demonstrated correctness of transformations (weaving)

51 Future work Library design  Dynamic (domino-like) library design  Low leakage library design to combine high performance of fine-grain pipelining with low power from very aggressive voltage reduction  Balanced library for security related applications Extending the concept to other technologies  Automated asynchronous fine-grain pipelining for standard FPGAs Synthesis flow development  Integration of efficient GTL “design-ware” and architectures

52 Thank you! Questions? Comments? Suggestions?

53 Backup slides Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness

54 Slack matching: example (C17)

55

56

57

58

59 Back to backup slides

60 Similar work: the difference Null Convention Logic  Coarse-grain  Slow and large synchronization trees Phased logic  Different encoding provides less switching activity  Complicated synthesis algorithm due to encoding De-synchronization  Bundled data  Coarse grain  None of the above provide support for automated fine-grain pipelining

61 Back to backup slides

62 Example: data path CL FSM CLREG MUX CL REG

63 Example: data path CL FSM CLREG MUX CL

64 Example: data path CL FSM CLREG MUX CL

65 Example: data path CL FSM CL MUX DE MUX CL

66 Back to backup slides

67 Experiments setup Standard gates library vtvt from Virginia Tech TSMC 0.25 C-elements – derived from PCHB library from USC and simulated to obtain performance

68 Back to backup slides

69 All correctness prerequisites 1. no additional data dependencies are added and no existing data dependencies are removed during weaving; 2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent function for dual-rail encoded data and initialized to NULL (spacer); 3. closed asynchronous HB pipeline maximum token capacity is  S/2  - 1 (where S is the number of HB stages); 4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages); 5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any two adjacent stages); 6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a spacer and another – to a token; 7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or half-DFFs) in the corresponding synchronous RTL implementation; 8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation; 9. no stage state is shared between any two stages. 10. exactly one place is marked in every stage state. 11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker; 12. GTL style pipeline is properly modeled by HB PPN. 13. a live closed HB PPN is at least 3 HB stages long; 14. a live closed HB PPN has at least one token and at most  S/2  – 1 tokens; 15. the token flow is deterministic and does not depend on data itself; 16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit); 17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must satisfy the conditions (xi), (xiii) and (xiv); 18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);

70 Back to backup slides

71 Linear pipeline PPN models full-buffer pipelines HB PPN models half-buffer pipelines

72 Linear pipeline HB PPN stage has three states PPN stage has two states

73 Linear pipeline HB PPN stage has three states models properly HB GTL implementation

74 Back to backup slides

75 Non-linear pipeline HB PPN model PPN equivalent to HB PPN besides token capacity

76 Non-linear pipeline HB PPN model MG PN equivalent to HB PPN besides token capacity

77 Back to backup slides

78 Closed linear HB pipeline is live iff Every loop has at least 3 stages Token capacity for any loop: 1  C   N/2  - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless  Liveness conditions hold

79 Back to backup slides


Download ppt "Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum."

Similar presentations


Ads by Google