Synthesis of synchronous elastic architectures Jordi Cortadella (Universitat Politècnica Catalunya) Mike Kishinevsky (Intel Corp.) Bill Grundmann (Intel Corp.)
Network of Computing Units In Out B1 B3 B2
Network of Computing Units In Out B1 B3 B2
Network of Computing Units In Out B1 B3 B2
Latency-insensitive (elastic) system In Out B1 B3 B2 Every block only makes one step when all inputs are valid
Why Scalable Modular (Plug & Play) Tolerance to variable latency –Communication –Computation Not asynchronous –Use existing design paradigms –CAD tools
Outline The cost of elasticity SELF: an elastic protocol –Basic implementation (linear pipelines) –General netlists (forks and joins) –Formal models and verification Synthesis of elastic architectures Related work
Elastic block Data Valid Stop Control Core CLK Gated clock What’s the cost of elasticity?
Communication channel receiversender Data Long wires: slow transmission
Pipelined communication senderreceiver Data
senderreceiver Data Pipelined communication
senderreceiver Data How about if the sender does not always send valid data? Pipelined communication
The Valid bit senderreceiver Data Valid
The Valid bit senderreceiver Data Valid Data Valid
The Valid bit sender Data Valid receiver Data Valid
The Valid bit sender Data Valid receiver Data Valid
Data Valid The Valid bit senderreceiver Data Valid How about if the receiver is not always ready ?
The Stop bit sender Data Valid Stop receiver Data Valid Stop
The Stop bit sender Data Valid Stop receiver Data Valid Stop
The Stop bit sender Data Valid Stop receiver Data Valid Stop
The Stop bit sender Data Valid Stop receiver Data Valid Stop Back-pressure
The Stop bit sender Data Valid Stop receiver Data Valid Stop Long combinational path
Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl sender shell pearl receiver V S V S V S V S
Carloni’s relay stations (double storage) main aux shell pearl receiver shell pearl sender Handshakes with short wires Double storage required V S V S V S V S
Proposal: an elastic protocol SELF (Synchronous ELastic Flow) Simple and provably correct Data-path with no overhead in: –Area –Latency –Energy Negligible control overhead Fine-grain elasticity
Flip-flops vs. latches senderreceiver 1 cycle FF
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL
Flip-flops vs. latches senderreceiver 1 cycle HLHL Flip-flops already have a double storage capability, but …
Flip-flops vs. latches senderreceiver 1 cycle HLHL Not allowed in conventional FF-based design !
Flip-flops vs. latches senderreceiver 1 cycle HLLH Let’s make the master/slave latches independent
Flip-flops vs. latches senderreceiver HLHL ½ cycle Let’s make the master/slave latches independent Only half of the latches (H or L) can move tokens
Elastic buffer keeps data while stop is in flight W1R1 W2R1 W1R2 W2R2 Cannot be done with Single Edge Flops without double pumping Use latches inside MS Carloni’s relay station belongs to this class
Shorthand notation (clock lines not shown) D Q clk En …
SELF (linear communication) senderreceiver V V V V S S S S En 11 Data Valid Stop Data Valid Stop 1 1
SELF senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 0 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 1 SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 1 Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop 1 0 Data Valid Stop SELF
senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF
senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF
senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF
senderreceiver V V V V S S S S En 1 0 Data Valid Stop Data Valid Stop SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
senderreceiver V V V V S S S S En Data Valid Stop Data Valid Stop 1 0 SELF
The protocol SenderReceiver Data Valid Stop Idle cycle: Valid = 0 0
The protocol SenderReceiver Data Valid Stop Transfer cycle: Valid = 1 Stop = D
The protocol SenderReceiver Data Valid Stop Retry cycle: Valid = 1 Stop = D Persistency: G [ V S (Data=D) Next (V Data=D) ] Persistency: G [ V S (Data=D) Next (V Data=D) ]
Retry Transfer The protocol SenderReceiver Data Valid Stop Data Valid Stop * D D * C C C B * A
Elastic Half Buffer SiSiSiSi En i ViViViVi S i-1 V i-1 Data Latch EHB
Join EHB + V1V1 V2V2 S1S1 S2S2 V S
Lazy Fork V1V1 V2V2 S1S1 S2S2 V S
Eager Fork V1V1 V2V2 S1S1 S2S2 ^ ^ V S
Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB
Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB Enable signal to data latches
Elastic combinational paths Fork Join Join / Fork Wire EBEBEB EB
Elastic buffer: formal model … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Buffer [ 0.. ] Initial state: rd = wr = 0 Invariant: wr rd
Elastic buffer: formal model … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Liveness properties (finite unbounded latencies) Finite forward latency: G (rd wr F Vout) Finite backward latency : G( Sout F Sin)
Formal verification … i i+1 i+k rdwr Dout Vout Sout Din Vin Sin Din Vin Sin Dout Vout Sout Implementation
Formal verification The abstract FSM model is appropriate for compositional verification Verification of implementations with model checking (1-bit abstractions of the datapath) –LTL specs + NuSMV –Buffer is a refinement of the spec –In-order data-transmission –Correct synchronization of fork/join structures –Absence of deadlocks
Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM) Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM)
Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM) Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM)
Formal verification Din Vin Sin Dout Vout Sout Abstract model (NFSM) Din Vin Sin Abstract model (NFSM) Dout Vout Sout Abstract model (NFSM) Assuming the same initial contents (e.g. empty)
Observational equivalence D: a b c d e f g h i j k … Synchronous: Elastic: D: a a b b b c d e e f g g h i i i j k … D: a a b b b c d e e f g g h i i i j k … En: …
Elasticization Synchronous Elastic
CLK
CLK PC IF/IDID/EXEX/MEMMEM/WB JOIN JOIN FORK FORK
V S CLK V S V S V S V S JOINJOIN JOINJOIN FORKFORK FORK
1 0 CLK JOINJOIN JOINJOIN FORKFORK
1 0 CLK JOINJOIN JOINJOIN FORKFORK 0 0
Elastic control layer Generation of gated clocks CLK
Variable-latency Units [0 - k] cycles VS done go
Variable-latency units Telescopic units: –1 cycle for fast operations –2 cycles for slow operations Examples: –Short / long additions (carry propagation) –A × 0, A / 1 –Dynamic changes in latency (fast if cold, slow if hot)
Microarchitectural exploration Bubble insertion + Variable-latency units –May improve performance More bubbles but reduces cycle time –Reduce power Units designed for most frequent input data Exploration at fine-granularity
Some related work Asynchronous design –Micropipelines (Sutherland) –Rings (Williams, Sparso) –CHP and slack-elasticity (Martin, Burns, Manohar et al.) Latency insensitive design –Carloni and a few follow-ups (large overhead) –Wire pipelining: Svensson, Nookala, Casu, … Interlock pipelines (H. Jacobson et al.) De-synchronization –J. Cortadella et al. –V. Varshavsky Synchronous implementations of CSP –J. O’Leary et al. –A. Peeters et al.
Summary SELF: a specific protocol and implementation for elastic systems with very small overhead buffering Compositional theory proving correctness (Krstic et al., FMCAD’06) Library of controllers has been designed and their correctness verified Elasticization CAD in progress New micro-architectural opportunities based on bubbles and variable latency units