Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines
2 MOUSETRAP Pipelines Simple asynchronous implementation style, uses… transparent D-latches transparent D-latches simple control: 1 gate/pipeline stage simple control: 1 gate/pipeline stage Target = static logic blocks “MOUSETRAP”: uses a “capture protocol” Latches … are normally transparent: before new data arrives are normally transparent: before new data arrives become opaque: after data arrives (“capture” data) become opaque: after data arrives (“capture” data) Control Signaling: transition-signaling = 2-phase simple protocol: req/ack = only 2 events per handshake (not 4) simple protocol: req/ack = only 2 events per handshake (not 4) no “return-to-zero” no “return-to-zero” each transition (up/down) signals a distinct operation each transition (up/down) signals a distinct operation Our Goal: very fast cycle time simple inter-stage communication simple inter-stage communication
3 req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data in Data out Stage NStage N-1Stage N+1 En MOUSETRAP: A Basic FIFO Stages communicate using transition-signaling: 1 transition per data item! 1 st data item flowing through the pipeline 2 nd data item flowing through the pipeline
4 MOUSETRAP: A Basic FIFO (contd.) Latch controller (XNOR) acts as “phase converter”: 2 distinct transitions (up or down) pulsed latch enable 2 distinct transitions (up or down) pulsed latch enable 2 transitions per latch cycle latch cycle req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data inData out Stage NStage N-1Stage N+1 En Latch is re-enabled when next stage is “done” Latch is disabled when current stage is “done”
5 MOUSETRAP: FIFO Cycle Time Cycle Time = req N ack N-1 req N+1 ack N Data Latch Latch Controller done N Data inData out Stage NStage N-1Stage N+1 En Fast self-loop: N disables itself N disables itself 2 N computes 1 N+1 computes 2 3 N re-enabled to compute to compute
6 Detailed Controller Operation One pulse per data item flowing through: down transition: caused by “done” of N down transition: caused by “done” of N up transition: caused by “done” of N+1 up transition: caused by “done” of N+1 No minimum pulse width constraint! simply, down transition should start “early enough” simply, down transition should start “early enough” can be “negative width” (no pulse!) can be “negative width” (no pulse!) ack from N+1 Stage N’s Latch Controller to Latch done from N
7 Stage N+1 logic delay Stage N Data Latch Latch Controller done N logic delay Stage N-1 logic delay req N ack N-1 req N+1 ack N MOUSETRAP: Pipeline With Logic Logic Blocks: can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: each “req” must arrive after data inputs valid and stable each “req” must arrive after data inputs valid and stable Simple Extension to FIFO: insert logic block + matching delay in each stage
8 Special Case: Using “Clocked Logic” Clocked-CMOS = C 2 MOS: eliminate explicit latches latch folded into logic itself latch folded into logic itself pull-up network pull-up network pull-down network pull-down network “keeper” En En A General C 2 MOS gate logic inputs logic inputs logic output C 2 MOS AND-gate “keeper” En En A B B A logic output
9 Gate-Level MOUSETRAP: with C 2 MOS Use C 2 MOS: eliminate explicit latches New Control Optimization = “Dual-Rail XNOR” eliminate 2 inverters from critical path eliminate 2 inverters from critical path C 2 MOS logic Latch Controller Stage N Stage N-1Stage N En,En pair of bit latches req N ack N-1 req N+1 ack N done N (En,En’) (done,done’) (ack,ack’)
10 Problems with Linear Pipelining: l handles limited applications; real systems are more complex Complex Pipelining: Forks & Joins Contribution: introduce efficient circuit structures Forks: distribute data + control to multiple destinations Forks: distribute data + control to multiple destinations Joins: merge data + control from multiple sources Joins: merge data + control from multiple sources è Enabling technology for building complex async systems forkjoin Non-Linear Pipelining: has forks/joins
11 req ack2 Stage N C ack1 req req2 Stage N C req1 ack Forks and Joins: Implementation Join: merge multiple requests Fork: merge multiple acknowledges
12 Related Protocols Day/Woods (’97), and Charlie Boxes (’00) Similarities: all use… transition signaling for handshakes transition signaling for handshakes phase conversion for latch signals phase conversion for latch signals Differences: MOUSETRAP has… higher throughput higher throughput ability to handle fork/join datapaths ability to handle fork/join datapaths more aggressive timing, less insensitivity to delays more aggressive timing, less insensitivity to delays
13 Performance, Timing and Optzn. MOUSETRAP with Logic: Cycle Time = Stage Latency = Cycle Time = MOUSETRAP Using C 2 MOS Gates:
14 Timing Analysis Main Timing Constraint: avoid “data overrun” Data must be safely “captured” by Stage N before new inputs arrive from Stage N-1 simple 1-sided timing constraint: fast latch disable simple 1-sided timing constraint: fast latch disable Stage N’s “self-loop” faster than entire path through previous stage Stage N’s “self-loop” faster than entire path through previous stage Stage N Data Latch Latch Controller done N logic delay Stage N-1 logic delay req N ack N-1 req N+1 ack N
15 Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal: shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: XNOR switches twice per data item: XNOR switches twice per data item: only 2nd (up) transition critical for performance: only 2nd (up) transition critical for performance: Solution: reduce XNOR output swing degrade “slew” for start of pulse degrade “slew” for start of pulse allows quick pulse completion: faster rise time allows quick pulse completion: faster rise time Still safe when congested: pulse starts on time pulse maintained until congestion clears pulse maintained until congestion clears
16 Timing Optzn (contd.) N “done” N+1 “done” N’s latch disabled disabled N’s latch re-enabled re-enabled “unoptimized” XNOR output “optimized” XNOR output latch only partly disabled; recovers quicker! (no pulse width requirement)
17 Comparison with Wave Pipelining Two Scenarios: Steady State: Steady State: both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines Congestion: Congestion: right environment stalls: each MOUSETRAP stage safely captures data internal stage slow: MOUSETRAP stages to its left safely capture data congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… speed of wave pipelining speed of wave pipelining greater robustness and flexibility greater robustness and flexibility
18 Timing Issues: Handling Wide Datapaths Buffers inserted to amplify latch signals (En): req N req N+1 done N Stage NStage N-1 En Reducing Impact of Buffers: l control uses unbuffered signals buffer delay off of critical path! l datapath skewed w.r.t. control Timing assumption: buffer delays roughly equal
19 Preliminary Results Pre-Layout Simulations of FIFO’s: do not account for wire delays, parasitics, etc. do not account for wire delays, parasitics, etc. careful transistor sizing/verification of timing constraints careful transistor sizing/verification of timing constraints
20 Conclusions and Future Work Introduced a new asynchronous pipeline style: Static logic blocks Static logic blocks Simple latches and control: Simple latches and control: transparent latches, or C 2 MOS gates single gate control = 1 XNOR gate/stage Highly concurrent event-driven protocol Highly concurrent event-driven protocol High throughputs obtained: High throughputs obtained: 3.5 GHz in 0.25 , 1.9 GHz in 0.6 comparable to wave pipelines; yet more robust/less design effort Correctly handle forks and joins in datapaths Correctly handle forks and joins in datapaths Timing constrains: local, 1-sided, easily met Timing constrains: local, 1-sided, easily met Ongoing Work: more realistic performance measurement (incl. parasitics) more realistic performance measurement (incl. parasitics) layout and fabrication layout and fabrication