Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum.

Slides:

Advertisements

Similar presentations

Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.

Advertisements

Presenter : Ching-Hua Huang 2013/11/4 Temporal Parallel Simulation: A Fast Gate-level HDL Simulation Using Higher Level Models Cited count : 3 Dusung Kim.

Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits.

Programmable FIR Filter Design

Andrey Mokhov, Victor Khomenko Danil Sokolov, Alex Yakovlev Dual-Rail Control Logic for Enhanced Circuit Robustness.

OCV-Aware Top-Level Clock Tree Optimization

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.

Alexander Smirnov Alexander Taubin.  Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given.

Self-Timed Systems Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical.

Introduction to CMOS VLSI Design Sequential Circuits.

Introduction to CMOS VLSI Design Sequential Circuits

MICROELETTRONICA Sequential circuits Lection 7.

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

Module 12.  In Module 9, 10, 11, you have been introduced to examples of combinational logic circuits whereby the outputs are entirely dependent on the.

1 Lecture 20 Sequential Circuits: Latches. 2 Overview °Circuits require memory to store intermediate data °Sequential circuits use a periodic signal to.

Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date :

Sequential Circuits A Basic sequential circuit is nothing but a combinational circuit with some feedback paths between its output and input terminals.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.

Asynchronous Sequential Logic

COMP Clockless Logic and Silicon Compilers Lecture 3

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Embedded Systems Hardware:

1 Application Specific Integrated Circuits. 2 What is an ASIC? An application-specific integrated circuit (ASIC) is an integrated circuit (IC) customized.

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.

Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.

1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.

1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.

1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.

USING SAT-BASED CRAIG INTERPOLATION TO ENLARGE CLOCK GATING FUNCTIONS Ting-Hao Lin, Chung-Yang (Ric) Huang Graduate Institute of Electrical Engineering,

TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.

Power Reduction for FPGA using Multiple Vdd/Vth

MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

ICCD Conversion Driven Design of Binary to Mixed Radix Circuits Ashur Rafiev, Julian Murphy, Danil Sokolov, Alex Yakovlev School of EECE, Newcastle.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.

Behnam Ghavami and Hossein Pedram Presented by Wei-Lun Hung A CAD Framework for Leakage Power Aware Synthesis of Asynchronous Circuits.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

An Abstract Model of De- synchronous Circuit Design and Its Area Optimization Jin Gang University of Manchester.

ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.

CS151 Introduction to Digital Design Chapter 5: Sequential Circuits 5-1 : Sequential Circuit Definition 5-2: Latches 1Created by: Ms.Amany AlSaleh.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU 99-1 Under-Graduate Project Design of Datapath Controllers Speaker: Shao-Wei Feng Adviser:

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

04/21/20031 ECE 551: Digital System Design & Synthesis Lecture Set : Functional & Timing Verification 10.2: Faults & Testing.

1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.

1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Lecture 11: Sequential Circuit Design

Other Approaches.

Asynchronous Interface Specification, Analysis and Synthesis

Sequential circuit design with metastability

Two-phase Latch based design

CSE 370 – Winter Sequential Logic - 1

Dynamically Scheduled High-level Synthesis

FPGA Glitch Power Analysis and Reduction

Clockless Logic: Asynchronous Pipelines

Fast Min-Register Retiming Through Binary Max-Flow

Presentation transcript:

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum

Presentation goals Present and overview the synthesis framework Demonstrate a high-level pipeline model Demonstrate the synthesis correctness Illustrate how the correctness is guaranteed Present experimental results Conclusions Future work

Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications Industrial quality  Easy to integrate in RTL oriented environment  Capable of handling very large designs – scalability Automated fine-grain pipelining  To achieve high performance (throughput)  Automated to reduce design time

Choice of paradigm Synchronous RTL  8 logic levels per stage is the limit Due to register, clock skew and jitter overhead  Timing closure No pipelining automation available – stage balancing is difficult  Performance limitations To guarantee correctness with process variation etc Asynchronous GTL  Lower design time Automated pipelining possible from RTL specification  Higher performance Gate-level (finest possible) pipelining achievable  Controllable power consumption Smoothly slows down in case of voltage reduction  Improved yield Correct operation regardless of variations

Easy integration & scalability: Weaver flow architecture RTL tools reuse  Creates the impression that nothing has changed  Saves development effort Substitution based transformations  Linear complexity  Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries

Easy integration & scalability: Weaver flow architecture Synthesis flow  Interfacing with host synthesis engine  Transforming Synchronous RTL to Asynchronous GTL – Weaving Dedicated library(ies)  Dual-rail encoded data logic  Cells comprising entire stages  Internal delay assumptions only

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline REGCombinational logic

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used

Automated fine-grain pipelining: Gate Transfer Level (GTL) Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist

Weaving Critical transformations  Mapping combinational gates (basic weaving)  Mapping sequential gates  Initialization preserving liveness and safeness Optimizations  Performance optimization Fine-grain pipelining (natural) Slack matching  Area optimization Optimizing out identity function stages

Basic Weaving De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals  Merge insertion  Fork insertion Reset routing

Basic Weaving: example (C17 MCNC benchmark)

Linear pipeline (RTL)

Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization

Linear pipeline pipeline PN (PPN) model with local handshake pipeline PN model with global synchronization

Linear pipeline PPN models asynchronous full-buffer pipelines pipeline PN model with global synchronization

Linear pipeline GTL implementation RTL implementation

Correctness Safeness  Guarantees that the number of data portions (tokens) stays the same over the time Liveness  Guarantees that the system operates continuously Flow equivalence  In both RTL and GTL implementations corresponding sequential elements hold the same data values On the same iterations (order wise) For the same input stream

Non-linear pipelines Deterministic token flow  Broadcasting tokens to all channels at Forks  Synchronizing at Merges Data dependent token flow  Ctrl is also a dual-rail channel  To guarantee liveness MUXes need to match deMUXes – hard computationally

Non-linear pipeline liveness Currently guaranteed for deterministic token flow only by construction (weaving) A marking of a marked graph is live if each directed PN circuit has a marker  Linear closed pipelines can be considered instead

Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition

Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits  Forward – has at least one token inferred for a DFF

Closed linear PPN Every PPN “stage” is a circuit and has a marker by definition Each implementation loop forms two directed circuits  Forward – has at least one token inferred for a DFF  Feedback – has at least one NULL inferred from CL or added explicitly

Closed linear PPN pipeline is live iff (for full-buffer pipelines) Every loop has at least 2 stages Token capacity for any loop: 1  C  N - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless  Liveness conditions hold by construction (Weaving)

Initialization: example

Initialization: FSM example … HB …

Flow equivalence GTL data flow structure is equivalent to the source RTL by weaving  No data dependencies are removed  No additional dependencies introduced In deterministic flow architecture  There are no token races (tokens cannot pass each other)  All forks are broadcast and all joins are synchronizers  Flow equivalence preserved by construction

Flow equivalence 2112 NNN21NNNNNN 2112 NNNNNNNNN21 GTL initialization is same as RTL

Flow equivalence NN2NNNN2NN 2112 NNNNNNNN321 but token propagation is independent

Flow equivalence NN2NNN22NN 2112 NNNNNNN332N but token propagation is independent

Flow equivalence NNNNNN22N NNNNNN3N2N but token propagation is independent

Flow equivalence 2112 NNNNNNN2N N32NNN3NNN but token propagation is independent

Flow equivalence 3212 N2NNNNN2N NNNNNN but token propagation is independent

Flow equivalence 3212 N2N3NNNNN3N NN4NN but token propagation is independent In GTL “3” hits the first top register output

Flow equivalence N3NN2NNNN 3212 N3N222NN4NN but token propagation is independent

Flow equivalence N3N22N3NN 3212 N3NN222443N In GTL “3” hits the first bottom register output but token propagation is independent

Flow equivalence N23N22N3N NNNNN224N3N but token propagation is independent

Flow equivalence 3212 NN2N NN4NNN2NN32 In GTL “2” hits the second register output but token propagation is independent

Flow equivalence 3223 NN2N22N N4NNN24N32 but token propagation is independent In RTL “3” and “2” moved one stage ahead timing is independent, the order is unchanged

Optimizations Area  Optimizing out identity function stages Performance  Fine-grain pipelining (natural)  Slack matching

Optimizing out identity function stages Identity function stages (buffers) are inferred for clocked DFFs and D-latches Implement no functionality  Can be removed as long as  The token capacity is not decreased below the RTL level  The resulting circuit can still be properly initialized

CL Optimizing out identity function stages: example HB DFF Final implementation is the same as if the RTL had not been pipelined (except for initialization)  Saves pipelining effort

Slack matching implementation Adjusting the pipeline slack to optimize its throughput Implementation  leveling gates according to their shortest paths from primary inputs (outputs)  Inserting buffer stages to break long dependencies  Buffer stages initialized to NULL Currently performed for circuits with no loops only Complexity O(|X||C| 2 )  |X| - the number of primary inputs  |C| - the number of connection points in the netlist

Slack matching correctness Increases the token capacity  Potentially increases performance Does not affect the number of initial tokens  Liveness is not affected Does not affect the system structure  The flow equivalence is not affected

Experimental results: MCNC RTL implementation  Not pipelined GTL implementation  Naturally fine-grain pipelined  Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification on average ~ x4 better performance

Experimental results: AES ~ x36 better performance~ x12 larger

Base line Demonstrated an automatic synthesis of  QDI (robust to variations)  automatically gate-level pipelined  implementations from large behavioral specifications Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced Resulting circuits feature increased performance (depth dependent ~4x for MCNC)  area overhead Practical solution – first prerelease at Demonstrated correctness of transformations (weaving)

Future work Library design  Dynamic (domino-like) library design  Low leakage library design to combine high performance of fine-grain pipelining with low power from very aggressive voltage reduction  Balanced library for security related applications Extending the concept to other technologies  Automated asynchronous fine-grain pipelining for standard FPGAs Synthesis flow development  Integration of efficient GTL “design-ware” and architectures

Thank you! Questions? Comments? Suggestions?

Backup slides Slack matching animated example Similar work FSM + datapath example (1-round AES) Experiments setup Linear HB PPN Non-linear HB PPN Closed linear HB pipeline liveness

Slack matching: example (C17)

Back to backup slides

Similar work: the difference Null Convention Logic  Coarse-grain  Slow and large synchronization trees Phased logic  Different encoding provides less switching activity  Complicated synthesis algorithm due to encoding De-synchronization  Bundled data  Coarse grain  None of the above provide support for automated fine-grain pipelining

Back to backup slides

Example: data path CL FSM CLREG MUX CL REG

Example: data path CL FSM CLREG MUX CL

Example: data path CL FSM CLREG MUX CL

Example: data path CL FSM CL MUX DE MUX CL

Back to backup slides

Experiments setup Standard gates library vtvt from Virginia Tech TSMC 0.25 C-elements – derived from PCHB library from USC and simulated to obtain performance

Back to backup slides

All correctness prerequisites 1. no additional data dependencies are added and no existing data dependencies are removed during weaving; 2. every gate implementing a logical function is mapped to a GTL gate (stage) implementing equivalent function for dual-rail encoded data and initialized to NULL (spacer); 3. closed asynchronous HB pipeline maximum token capacity is  S/2  - 1 (where S is the number of HB stages); 4. closed asynchronous FB pipeline maximum token capacity is S - 1 (S is the number of HB stages); 5. in HB pipelines distinct tokens are always separated with spacers (there are no two distinct tokens in any two adjacent stages); 6. for each DFF in RTL implementation there exist in GTL implementation two HB stages one initialized to a spacer and another – to a token; 7. the number of HB pipeline stages in any cycle of GTL implementation is greater than the number of DLs (or half-DFFs) in the corresponding synchronous RTL implementation; 8. GTL pipeline token capacity is greater or equal to that of the synchronous implementation; 9. no stage state is shared between any two stages. 10. exactly one place is marked in every stage state. 11. a HB PPN marking is valid iff every FB-stage in the HB PPN has exactly one marker; 12. GTL style pipeline is properly modeled by HB PPN. 13. a live closed HB PPN is at least 3 HB stages long; 14. a live closed HB PPN has at least one token and at most  S/2  – 1 tokens; 15. the token flow is deterministic and does not depend on data itself; 16. a marked graph is live iff M0 assigns at least one token on each directed loop (or circuit); 17. for a HB PPN to be live each of its directed circuits composed of forward arcs as a closed HB PPN must satisfy the conditions (xi), (xiii) and (xiv); 18. every feedback loop in synchronous implementation contains at least one DFF (or a pair of DLs);

Back to backup slides

Linear pipeline PPN models full-buffer pipelines HB PPN models half-buffer pipelines

Linear pipeline HB PPN stage has three states PPN stage has two states

Linear pipeline HB PPN stage has three states models properly HB GTL implementation

Back to backup slides

Non-linear pipeline HB PPN model PPN equivalent to HB PPN besides token capacity

Non-linear pipeline HB PPN model MG PN equivalent to HB PPN besides token capacity

Back to backup slides

Closed linear HB pipeline is live iff Every loop has at least 3 stages Token capacity for any loop: 1  C   N/2  - 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless  Liveness conditions hold

Back to backup slides