Synthesis of synchronous elastic architectures Jordi Cortadella (Universitat Politècnica Catalunya) Mike Kishinevsky (Intel Corp.) Bill Grundmann (Intel.

Slides:



Advertisements
Similar presentations
Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.
Advertisements

Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Introduction to CMOS VLSI Design Sequential Circuits.
VLSI Design EE 447/547 Sequential circuits 1 EE 447/547 VLSI Design Lecture 9: Sequential Circuits.
MICROELETTRONICA Sequential circuits Lection 7.
ELEC 256 / Saif Zahir UBC / 2000 Timing Methodology Overview Set of rules for interconnecting components and clocks When followed, guarantee proper operation.
Digital Design - Sequential Logic Design Chapter 3 - Sequential Logic Design.
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits David Harris Harvey Mudd College Spring 2004.
Circuits require memory to store intermediate data
Sequential Circuits. Outline  Floorplanning  Sequencing  Sequencing Element Design  Max and Min-Delay  Clock Skew  Time Borrowing  Two-Phase Clocking.
6/14/991 Symbolic verification of systems with state machines David L. Dill Jeffrey Su Jens Skakkebaek Computer System Laboratory Stanford University.
Jordi Cortadella, Universitat Politecnica de Catalunya, Barcelona Mike Kishinevsky, Intel Corp., Strategic CAD Labs, Hillsboro.
Synchronous Digital Design Methodology and Guidelines
11-May-04 Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos) A Distributed Colouring Algorithm.
Useful Things to Know Norm. Administrative Midterm Grading Finished –Stats on course homepage –Pickup after this lab lec. –Regrade requests within 1wk.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
© Ran GinosarAsynchronous Design and Synchronization 1 VLSI Architectures Lecture 2: Theoretical Aspects (S&F 2.5) Data Flow Structures.
Handshake protocols for de-synchronization I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin and C. Sotiriou Politecnico di Torino, Italy Universitat.
Jordi Cortadella, Universitat Politècnica de Catalunya, Spain
Software Engineering, COMP201 Slide 1 Protocol Engineering Protocol Specification using CFSM model Lecture 30.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
CS 151 Digital Systems Design Lecture 20 Sequential Circuits: Flip flops.
Synchronous Elastic Systems Mike Kishinevsky and Jordi Cortadella Mike Kishinevsky and Jordi Cortadella Universitat Politecnica de Catalunya Barcelona,
Chapter #6: Sequential Logic Design 6.2 Timing Methodologies
Asynchronous Circuit Verification and Synthesis with Petri Nets J. Cortadella Universitat Politècnica de Catalunya, Barcelona Thanks to: Michael Kishinevsky.
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
CS61C L15 Synchronous Digital Systems (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Digital Logic Level.
CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza [Adapted.
Advanced Digital Design Asynchronous EDA by A. Steininger, J. Lechner and R. Najvirt Vienna University of Technology.
Digital System Bus A bus in a digital system is a collection of (usually unbroken) signal lines that carry module-to-module communications. The signals.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Micropipeline design in asynchronous circuit Wilson Kwan M.A.Sc. Candidate Ottawa-Carleton Institute for Electrical & Computer Engineering (OCIECE) Carleton.
Elastic-Buffer Flow-Control for On-Chip Networks
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
ENG241 Digital Design Week #8 Registers and Counters.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Registers Page 1. Page 2 What is a Register?  A Register is a collection of flip-flops with some common function or characteristic  Control signals.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Latches & Flip-Flops.
Part 2: Synchronous Elastic Systems 28th Int. Conf. on Application and Theory of Petri Nets and Other Models of Concurrency Siedlce, Poland, June 25, 2007.
Automatic Pipelining during Sequential Logic Synthesis Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona Joint work with Marc Galceran-Oms.
03/31/031 ECE 551: Digital System Design & Synthesis Lecture Set 8 8.1: Miscellaneous Synthesis (In separate file) 8.2: Sequential Synthesis.
Developing a Framework for Simulation, Verification and Testing of SDL Specifications Olga Shumsky Lawrence Henschen Northwestern University
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Principles of reliable data transfer 0.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Latches, Flip Flops, and Memory ECE/CS 252, Fall 2010 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin – Madison.
CS5270 Lecture 41 Timed Automata I CS 5270 Lecture 4.
Specification mining for asynchronous controllers Javier de San Pedro† Thomas Bourgeat ‡ Jordi Cortadella† † Universitat Politecnica de Catalunya ‡ Massachusetts.
1 Advanced Digital Design Asynchronous Design Automation by A. Steininger and J. Lechner Vienna University of Technology.
Class Exercise 1B.
Other Approaches.
Asynchronous Interface Specification, Analysis and Synthesis
The network-on-chip protocol
Introduction to Sequential Logic Design
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits
From C to Elastic Circuits
Interlocked Synchronous Pipelines
CSE 370 – Winter Sequential Logic - 1
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
Dynamically Scheduled High-level Synthesis
Jordi Cortadella and Jordi Petit
CSE 370 – Winter Sequential Logic-2 - 1
De-synchronization: from synchronous to asynchronous
Clockless Logic: Asynchronous Pipelines
Presentation transcript:

Synthesis of synchronous elastic architectures Jordi Cortadella (Universitat Politècnica Catalunya) Mike Kishinevsky (Intel Corp.) Bill Grundmann (Intel Corp.)

Network of Computing Units In Out B1 B3 B2

Network of Computing Units In Out B1 B3 B2

Network of Computing Units In Out B1 B3 B2

Latency-insensitive (elastic) system In Out B1 B3 B2 Every block only makes one step when all inputs are valid

Why Scalable Modular (Plug & Play) Tolerance to variable latency –Communication –Computation Not asynchronous –Use existing design paradigms –CAD tools

Outline The cost of elasticity SELF: an elastic protocol –Basic implementation (linear pipelines) –General netlists (forks and joins) –Formal models and verification Synthesis of elastic architectures Related work

Elastic block Data Valid Stop Control Core CLK Gated clock What’s the cost of elasticity?

Communication channel receiversender Data Long wires: slow transmission

Pipelined communication senderreceiver Data

senderreceiver Data Pipelined communication

senderreceiver Data How about if the sender does not always send valid data? Pipelined communication

The Valid bit senderreceiver Data Valid

The Valid bit senderreceiver Data Valid Data Valid

The Valid bit sender Data Valid receiver Data Valid

The Valid bit sender Data Valid receiver Data Valid

Data Valid The Valid bit senderreceiver Data Valid How about if the receiver is not always ready ?

The Stop bit sender Data Valid Stop receiver Data Valid Stop

The Stop bit sender Data Valid Stop receiver Data Valid Stop

The Stop bit sender Data Valid Stop receiver Data Valid Stop

The Stop bit sender Data Valid Stop receiver Data Valid Stop Back-pressure

The Stop bit sender Data Valid Stop receiver Data Valid Stop Long combinational path

Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S

Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender Handshakes with short wires Double storage required V S V S V S V S

Proposal: an elastic protocol SELF (Synchronous ELastic Flow) Simple and provably correct Data-path with no overhead in: –Area –Latency –Energy Negligible control overhead Fine-grain elasticity

Flip-flops vs. latches senderreceiver 1 cycle FF

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL

Flip-flops vs. latches senderreceiver 1 cycle HLHL Flip-flops already have a double storage capability, but …

Flip-flops vs. latches senderreceiver 1 cycle HLHL Not allowed in conventional FF-based design !

Flip-flops vs. latches senderreceiver 1 cycle HLLH Let’s make the master/slave latches independent

Flip-flops vs. latches senderreceiver HLHL ½ cycle Let’s make the master/slave latches independent Only half of the latches (H or L) can move tokens

Elastic buffer keeps data while stop is in flight W1R1 W2R1 W1R2 W2R2 Cannot be done with Single Edge Flops without double pumping Use latches inside MS Carloni’s relay station belongs to this class

Shorthand notation (clock lines not shown) D Q clk En …

SELF (linear communication) senderreceiver V V V V S S S S En 11 Data Valid Stop Data Valid Stop 1 1

SELF senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 1 SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop 1 0 Data Valid Stop SELF

senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF

senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF

senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF

senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF

The protocol SenderReceiver Data Valid Stop Idle cycle: Valid = 0 0 

The protocol SenderReceiver Data Valid Stop Transfer cycle: Valid = 1  Stop = D

The protocol SenderReceiver Data Valid Stop Retry cycle: Valid = 1  Stop = D Persistency: G [ V S (Data=D)  Next (V Data=D) ] Persistency: G [ V  S  (Data=D)  Next (V  Data=D) ]

Retry Transfer The protocol SenderReceiver Data Valid Stop Data Valid Stop * D D * C C C B * A

Elastic Half Buffer SiSiSiSi En i ViViViVi S i-1 V i-1 Data Latch EHB

Join EHB + V1V1 V2V2 S1S1 S2S2 V S

Lazy Fork V1V1 V2V2 S1S1 S2S2 V S

Eager Fork V1V1 V2V2 S1S1 S2S2 ^ ^ V S

Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB

Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB Enable signal to data latches

Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB

Elastic buffer: formal model … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Buffer [ 0..  ] Initial state: rd = wr = 0 Invariant: wr  rd

Elastic buffer: formal model … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Liveness properties (finite unbounded latencies) Finite forward latency: G (rd  wr  F Vout) Finite backward latency : G(  Sout  F  Sin)

Formal verification … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Din Vin Sin Dout Vout Sout Implementation 

Formal verification The abstract FSM model is appropriate for compositional verification Verification of implementations with model checking (1-bit abstractions of the datapath) –LTL specs + NuSMV –Buffer is a refinement of the spec –In-order data-transmission –Correct synchronization of fork/join structures –Absence of deadlocks

Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM)  Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM)

Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM)  Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM)

Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM)  Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM) Assuming the same initial contents (e.g. empty)

Observational equivalence D: a b c d e f g h i j k … Synchronous: Elastic: D: a a b b b c d e e f g g h i i i j k … D: a a b b b c d e e f g g h i i i j k … En: …

Elasticization Synchronous Elastic

CLK

CLK PC IF/IDID/EXEX/MEMMEM/WB JOIN JOIN FORK FORK

V S CLK V S V S V S V S JOINJOIN JOINJOIN FORKFORK FORK

1 0 CLK JOINJOIN JOINJOIN FORKFORK

1 0 CLK JOINJOIN JOINJOIN FORKFORK 0 0

Elastic control layer Generation of gated clocks CLK

Variable-latency Units [0 - k] cycles VS done go

Variable-latency units Telescopic units: –1 cycle for fast operations –2 cycles for slow operations Examples: –Short / long additions (carry propagation) –A × 0, A / 1 –Dynamic changes in latency (fast if cold, slow if hot)

Microarchitectural exploration Bubble insertion + Variable-latency units –May improve performance More bubbles but reduces cycle time –Reduce power Units designed for most frequent input data Exploration at fine-granularity

Some related work Asynchronous design –Micropipelines (Sutherland) –Rings (Williams, Sparso) –CHP and slack-elasticity (Martin, Burns, Manohar et al.) Latency insensitive design –Carloni and a few follow-ups (large overhead) –Wire pipelining: Svensson, Nookala, Casu, … Interlock pipelines (H. Jacobson et al.) De-synchronization –J. Cortadella et al. –V. Varshavsky Synchronous implementations of CSP –J. O’Leary et al. –A. Peeters et al.

Summary SELF: a specific protocol and implementation for elastic systems with very small overhead buffering Compositional theory proving correctness (Krstic et al., FMCAD’06) Library of controllers has been designed and their correctness verified Elasticization CAD in progress New micro-architectural opportunities based on bubbles and variable latency units