1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

1 Clockless Computing Montek Singh Thu, Sep 13, 2007

2 Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines [Singh/Nowick 2000]  High-Capacity Pipelines [Singh/Nowick 2000]

3 Drawbacks of PSO Pipelining 1. Poor throughput: long cycle time: 6 events per cycle long cycle time: 6 events per cycle data “tokens” are forced far apart in time data “tokens” are forced far apart in time 2. Limited storage capacity: max only 50% of stages can hold distinct tokens max only 50% of stages can hold distinct tokens data tokens must be separated by at least one spacer data tokens must be separated by at least one spacer My Research Goals have been: address both issues still maintain very low latency still maintain very low latency

4 Recent Approaches 3 novel styles for high-speed async pipelining: MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] Goal: significantly improve throughput of PS0 Two Distinct Strategies: LP: introduce protocol optimizations LP: introduce protocol optimizations  “shave off” components from critical cycle HC: fundamentally new protocol HC: fundamentally new protocol  greater concurrency: “loosely-coupled” stages  

5Outline è New Asynchronous Pipelines: MOUSETRAP Pipelines MOUSETRAP Pipelines è Lookahead Pipelines (LP) High-Capacity Pipelines (HC) High-Capacity Pipelines (HC) Dynamic circuit style Static circuit style

6 Lookahead Pipeline Styles Singh and Nowick Async-2000 [Best Paper Award]

7 Lookahead Pipelines: Strategy #1 Use non-neighbor communication: stage receives information from multiple later stages stage receives information from multiple later stages allows “early evaluation” allows “early evaluation” Benefit: stage gets head-start on next cycle

8 Lookahead Pipelines: Strategy #2 Use early completion detection: completion detector moved before stage (not after) completion detector moved before stage (not after) stage indicates “early done” in parallel with computation stage indicates “early done” in parallel with computation Benefit: again, stage gets head-start on next cycle early completion detector

9 Lookahead Pipelines: Overview 5 New Designs: è“Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done”  “Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

10 Optimization = “early evaluation” each stage has two control inputs: from stages N+1 and N+2 each stage has two control inputs: from stages N+1 and N+2 Idea: shorten precharge phase terminate precharge early: when N+2 is done evaluating terminate precharge early: when N+2 is done evaluating Dual-Rail Design #1: LP3/1 Data in Data out PC Eval From N+2 N N+1 N+2 Processing Block Completion Detector

11 LP3/1 Protocol LP3/1 Protocol PRECHARGE N: when N+1 completes evaluation PRECHARGE N: when N+1 completes evaluation EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+2 completes evaluation New! 1 2 3 Enables “early evaluation!” 4 N evaluates N+1 evaluates N+2 indicates “done” N+2 evaluates N N+1 N+2 N+1 indicates “done” 3

12 PS0PS0 LP3/1LP3/1 LP3/1: Comparison with PS0 5 4 4 6 NN+1N+2 NN+1N+2 Enables “early evaluation!” 1 1 evaluates evaluates 2 2 evaluates evaluates 3 3 evaluates evaluates Only 4 events in cycle! 6 events in cycle PRECHARGE N: when N+1 completes evaluation 3 indicates “done” 3 EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+1 completes precharging

13 1 2 3 4 LP3/1 Performance Cycle Time = saved path Savings over PS0: 1 Precharge + 1 Completion Detection

14 LP3/1: Inside a Stage Timing Issues:  must satisfy several simple constraints  Ex.: PC must arrive before Eval de-asserted 1-sided timing requirement 1-sided timing requirement easily satisfied in practice easily satisfied in practice PC (From Stage N+1) Eval (From Stage N+2) NAND “early Eval” “old Eval” Merging 2 Control Inputs:

15 Dual-Rail Design #2: LP2/2 Optimization = “early done” Idea: move completion detector before processing block Idea: move completion detector before processing block  stage indicates when “about to” precharge/evaluate Processing Block “early” Completion Detector Data in Data out “early done”

16 LP2/2 Completion Detector Modified completion detectors needed: Done =1 when stage starts evaluating, and inputs valid Done =1 when stage starts evaluating, and inputs valid Done =0 when stage starts precharging Done =0 when stage starts precharging  asymmetric C-element C Done OR bit 0 OR bit 1 OR bit n + + +PC

17 1 2 4 LP2/2 Protocol Completion Detection: performed in parallel with evaluation/precharge of stage N evaluates N+1 evaluates N N+1 N+2 2 “early done” of N+1 eval 3 3 “early done” of N+2 eval “early done” of N+1 prech

18 LP2/2 Performance 1 2 3 4 LP2/2 savings over PS0: 1 Evaluation + 1 Precharge Cycle Time =

19 Dual-Rail Design #3: LP2/1 Hybrid of LP3/1 and LP2/2. Combines: early evaluation of LP3/1 early evaluation of LP3/1 early done of LP2/2 early done of LP2/2 Cycle Time =

20 Lookahead Pipelines: Overview 5 New Designs:  “Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done” è“Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

21 Single-Rail Design: LP SR 2/1 Derivative of LP2/1, adapted to single-rail:  bundled-data: matched delays instead of completion detectors delaydelay delay “Ack” to previous stages is “tapped off early”  once in evaluate (precharge), dynamic logic insensitive to input changes

22 PC and Eval are combined exactly as in LP3/1 Inside an LP SR 2/1 Stage “done” generated by an asymmetric C-element done =1 when stage evaluates, and data inputs valid done =1 when stage evaluates, and data inputs valid done =0 when stage precharges done =0 when stage precharges PC (From Stage N+1) Eval (From Stage N+2) NAND aC + “ack” “req” in data in data out “req” out matched delay done

23 LP SR 2/1 Protocol 1 2 3 Cycle Time = N evaluates N+2 evaluates N+2 indicates “done” N N+1 N+2 2 N+1 evaluates N+1 indicates “done”

24 dual-rail single-rail FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency LP single-rail: even faster 0.19  CMOS 3.3 V, 300°K

25 datapath width = 32 dual-rail bits! Practicality of Gate-Level Pipelining When datapath is wide:  Can often split into narrow “streams”  comp. d et. f airly low cost!  Use “localized” completion detector for each stream: for each stream: need to examine only a few bits need to examine only a few bits  small fan-in  small fan-in send “done” to only a few gates send “done” to only a few gates  small fan-out  small fan-outdone fan-out=2 comp. det. fan-in = 2

26 High-Capacity Pipelines Singh/Nowick WVLSI-00, ISSCC-02, Async-02

27 HC Pipeline Style High-Capacity Pipelines (HC) bundled datapaths; dynamic logic function blocks bundled datapaths; dynamic logic function blocks latch-free: no explicit latches needed latch-free: no explicit latches needed  dynamic logic provides implicit latching novel highly-concurrent protocol maximizes storage capacity novel highly-concurrent protocol maximizes storage capacity  traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-down separate control of pull-up/pull-down result = new “isolate phase” result = new “isolate phase” stage holds outputs/impervious to input changes stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item è 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

28 HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge pc: controls precharge eval: controls evaluation eval: controls evaluation Allows novel 3-phase cycle: Evaluate Evaluate “Isolate” (hold) “Isolate” (hold) Precharge Precharge delay stagecontroller pceval ack N N+1N+2 delay Single-rail “Bundled Datapath”: l matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path for data delay

29 HC: Inside a Stage Independent Controls of pull-up and pull-down: allows new 3 rd phase: “isolate” allows new 3 rd phase: “isolate” l pc asserted: precharge l eval asserted: evaluate l pc and eval de-asserted: enter “isolate” (hold) phase “keeper”controlsevaluationcontrolsprecharge eval inputs outputs pc

30 HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 1 forward arc: data dependency 2 backward arcs: control synchronization 2 backward arcs: control synchronization Our protocol: only 2 synchronization arcs only 1 backward arc only 1 backward arc  once stage N+1 evaluates, N can complete entire next cycle! Eval Isolate Precharge pc=1 eval=1 pc=1 eval=0 pc=0 eval=0 Eval Isolate Precharge Stage N Stage N+1 X

31 Formal Specification of Controller Problem: Specification too concurrent for direct synthesis desired precharge condition: N and N+1 have evaluated same data desired precharge condition: N and N+1 have evaluated same data problem: this condition not uniquely captured by given signals! problem: this condition not uniquely captured by given signals!  N may evaluate next data item, while N+1 stuck on current item! T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete) pc+ eval+ S+ eval- pc- S- (Startevaluate) (Evaluatecomplete) (Isolate) (Startprecharge) (Prechargecomplete)

32 Modified Specification of Controller Solution: Add a state variable ok2pc ok2pc records whether N+1 has “absorbed” N’s data item  ok2pc resets immediately when N deletes item (N precharges)  ok2pc is set when N+1 deletes item (N+1 precharges) ok2pc+ ok2pc- pc+eval+ S+ eval- pc- S- T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete)

33 Controller implementation Controller implementation is very simple: each signal implemented using a single gate each signal implemented using a single gate ok2pc typically off the critical path ok2pc typically off the critical path INV NAND3 aC + S TST ok2pc pc eval S

34 + eval pc HC: Stage Implementation req done ack NAND INV delay state variable: off the critical path off the critical path from current stage self-loop: key to fast key to fast “isolation” “isolation” from next stage early ack

35 HC: Operation 1 NN+1 N evaluates N+1 starts to evaluate evaluate N precharges N enables itself for next evaluation 2 3 (fastself-loop) N isolates (fastself-loop) (early Ack) Cycle Time = 8 CMOS gate delays

36 N enables itself for next evaluation N precharges Performance1 Cycle Time = N evaluates N N+1N+2 N+1 evaluates 3 2 N isolates 2

37 dual-rail single-rail FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency LP single-rail: even faster 0.19  CMOS 3.3 V, 300°K

38 Fabricated Chip: HC FIFO  2.5 GHz in 0.18u

1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

Similar presentations

Presentation on theme: "1 Clockless Computing Montek Singh Thu, Sep 13, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

Similar presentations

Presentation on theme: "1 Clockless Computing Montek Singh Thu, Sep 13, 2007."— Presentation transcript:

Similar presentations

About project

Feedback