Download presentation
Presentation is loading. Please wait.
Published byMorris Mathews Modified over 9 years ago
1
Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel
2
USC Asynchronous Group2 Motivation Can we reduce asynchronous pipelines communication overhead while hiding precharge time?Can we reduce asynchronous pipelines communication overhead while hiding precharge time? Can we have cycle time in asynchronous pipelines as fast, if not faster, than best synchronous counterparts.Can we have cycle time in asynchronous pipelines as fast, if not faster, than best synchronous counterparts.
3
USC Asynchronous Group3 Motivation: System Performance Fixed stage pipelineFixed stage pipeline –Low pipeline usage: Low latency is critical –High pipeline usage: Cycle time is the limiting factor to generate new outputs as fast as possible Flexible stage pipelineFlexible stage pipeline –With zero forward overhead and short cycle time, we can achieve a given desired throughput with fewer stages
4
USC Asynchronous Group4 Motivation: System Performance Pipelines with loop dependenciesPipelines with loop dependencies –Optimal cycle time is the sum of latency around the loop –Pipelining is required to ensure precharge/reset is not in the critical path –Our scheme requires less pipeline stages to achieve same performance
5
USC Asynchronous Group5 Introduction Asynchronous pipeline schemes using Taken Detector (TD)Asynchronous pipeline schemes using Taken Detector (TD) Best use in coarse-grained pipelinesBest use in coarse-grained pipelines Two schemes targeting different requirements (a possible third SI scheme as well)Two schemes targeting different requirements (a possible third SI scheme as well)
6
USC Asynchronous Group6 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionConclusion
7
USC Asynchronous Group7 Definition Stage: A collection of logic that is precharged or evaluated at the same timeStage: A collection of logic that is precharged or evaluated at the same time Cycle: The time it takes for a stage to start next evaluation from the current oneCycle: The time it takes for a stage to start next evaluation from the current one Forward Latency: The time it takes between the start of the evaluation of current stage to next stageForward Latency: The time it takes between the start of the evaluation of current stage to next stage
8
USC Asynchronous Group8 Background Outline Sutherland’s Micropipeline schemeSutherland’s Micropipeline scheme Ted William’s PS0 and PC0 pipeline schemesTed William’s PS0 and PC0 pipeline schemes Renaudin’s DCVSL pipeline schemeRenaudin’s DCVSL pipeline scheme Martin’s deep pipeline schemeMartin’s deep pipeline scheme
9
USC Asynchronous Group9 Sutherland’s Micropipeline Father of Asynchronous Pipeline. Presented in Turing Award lectureFather of Asynchronous Pipeline. Presented in Turing Award lecture Delay InsensitiveDelay Insensitive C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC c c c R(in) A(in) D(in) A(out) R(out) D(out)
10
USC Asynchronous Group10 William’s PC0 Speed IndependentSpeed Independent Cycle Time (P) = 3tF +1tF +4tC+4tDCycle Time (P) = 3tF +1tF +4tC+4tD Forward Latency (L f ) = 1tF +1tD+1tCForward Latency (L f ) = 1tF +1tD+1tC Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F3 D1 C1C2C3 D2 D3 D(in) R(in) A(in) A(out) R(out) Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F2 D(out)
11
USC Asynchronous Group11 PC0 Timing Diagram The cycle time is shown in read arrows while the blue arrows show the precharge phaseThe cycle time is shown in read arrows while the blue arrows show the precharge phase
12
USC Asynchronous Group12 Dependency Graph C2 F2 C3 F3 C4 F4 D2 C1 F1 C2 F2 C3 F3 D1 D2 D3 CC FF DD CC FF DD 0 0 00 +1 Folded Dependency Graph Flat Dependency Graph
13
USC Asynchronous Group13 William’s PC1 Cycle Time (P) = 2tF +4tC+4tDCycle Time (P) = 2tF +4tC+4tD Forward Latency (L f ) = 1tF +2tC+1tDForward Latency (L f ) = 1tF +2tC+1tD Precharged Function Block F1 Precharged Function Block F2 DA C1C2 DB D2 D(in) R(in) A(in) A(out) R(out) D(out) C Latch
14
USC Asynchronous Group14 William’s PS0 Not Speed IndependentNot Speed Independent Cycle Time (P) = 3tF +1tF +2tDCycle Time (P) = 3tF +1tF +2tD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 D2 D3 D(in) A(in) A(out) D(out)
15
USC Asynchronous Group15 PS0 Timing Diagram
16
USC Asynchronous Group16 PS0 Timing Assumption The pipeline has to meet the following timing assoumptionThe pipeline has to meet the following timing assoumption tF
17
USC Asynchronous Group17 Renaudin’s DCVSL Pipeline Compare to Ted’s PC0 onlyCompare to Ted’s PC0 only Use DCVSL exclusivelyUse DCVSL exclusively Introduce Latched DCVSLIntroduce Latched DCVSL Improve cycle time but not forward latencyImprove cycle time but not forward latency Cycle Time (P) = 1tF +1tF + 4tC +2tDCycle Time (P) = 1tF +1tF + 4tC +2tD Forward Latency (L f ) = 1tF + 1tC +1tDForward Latency (L f ) = 1tF + 1tC +1tD
18
USC Asynchronous Group18 DCVS Logic Family DCVS Logic Latched DCVS Logic
19
USC Asynchronous Group19 More on DCVSL AdvantageAdvantage –Fast, based on the dynamic domino type logic –Build-in Four-Phase handshaking –Robust completion sensing –Storage element DisadvantageDisadvantage –Higher Complexity - increase in number of transistors and area –Higher Power dissipation
20
USC Asynchronous Group20 DCVS Pipeline Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 C1C2C3 D2 D3 D(in) R(in) A(in) A(out) R(out) D(out) Cycle Time (P) = 1tF +1tF +4tC +2tDCycle Time (P) = 1tF +1tF +4tC +2tD (2tF +4tC +2tD ) (2tF +4tC +2tD ) Forward Latency (L f ) = 1tF +1tC +1tDForward Latency (L f ) = 1tF +1tC +1tD
21
USC Asynchronous Group21 DCVS Pipeline Timing Diagram
22
USC Asynchronous Group22 DCVS Dependency Graph CC FF DD CC FF DD 0 0 00 +1 Folded Dependency Graph Cycle Time (P) = 1tF +1tF +4tC +2tDCycle Time (P) = 1tF +1tF +4tC +2tD Forward Latency (L f ) = 1tF +1tC +1tDForward Latency (L f ) = 1tF +1tC +1tD
23
USC Asynchronous Group23 Martin’s Pipeline Schemes Deep pipeliningDeep pipelining Quasi Delay-Insensitive (QDI) No timing assumptionQuasi Delay-Insensitive (QDI) No timing assumption Based on different handshaking reshufflingBased on different handshaking reshuffling Best scheme has high concurrency which reduce control overheadBest scheme has high concurrency which reduce control overhead Control logic is more complexControl logic is more complex
24
USC Asynchronous Group24 Basic Asynchronous Handshaking 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 Reshuffling eliminates the explicit variable xReshuffling eliminates the explicit variable x Large control overheadLarge control overhead L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe
25
USC Asynchronous Group25 Handshaking Reshuffling Still wait for predecessor to reset before resetting itself larger overhead for more inputsStill wait for predecessor to reset before resetting itself larger overhead for more inputs 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe
26
USC Asynchronous Group26 Precharge-Logic Half-Buffer Doesn’t wait for the predecessor to reset before it resets its outputs. Yet, the control logic wait for the reset of the predecessor only after current stage has resetDoesn’t wait for the predecessor to reset before it resets its outputs. Yet, the control logic wait for the reset of the predecessor only after current stage has reset 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe
27
USC Asynchronous Group27 Precharge-Logic Full-Buffer Allows the neutrality test of the output data to overlap with raising the left enablesAllows the neutrality test of the output data to overlap with raising the left enables Complex control logic, requires extra state variableComplex control logic, requires extra state variable 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe en en
28
USC Asynchronous Group28 Martin’s PCHB Full-adder
29
USC Asynchronous Group29 Martin’s Pipeline in General The Cycle time is limited by the properties of QDIThe Cycle time is limited by the properties of QDI –Next stage has to finish precharge before the current stage can evaluate next input Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 D2 D3 D(in) D(out) Control LeLe LeLe ReRe
30
USC Asynchronous Group30 Performance Analysis on PCFB Control logic can be seen as completion detection (D) plus C-element (C)Control logic can be seen as completion detection (D) plus C-element (C) Reshuffling of handshaking just changes the degree of the concurrency but it doesn’t affect the best case performance analysisReshuffling of handshaking just changes the degree of the concurrency but it doesn’t affect the best case performance analysis Cycle Time (P) = 3tF +1tF +2tC +2tDCycle Time (P) = 3tF +1tF +2tC +2tD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF
31
USC Asynchronous Group31 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionConclusion
32
USC Asynchronous Group32 Taken Pipeline Use of Taken DetectorUse of Taken Detector Two schemes to satisfy different requirementsTwo schemes to satisfy different requirements Both are not speed independentBoth are not speed independent
33
USC Asynchronous Group33 Initial Idea Precharge: only when next stage has taken the current resultPrecharge: only when next stage has taken the current result Evaluation: only when next stage has prechargedEvaluation: only when next stage has precharged Similar idea to Martin’s pipeline schemesSimilar idea to Martin’s pipeline schemes
34
USC Asynchronous Group34 Further Observation PrechargePrecharge –We can precharge the current stage as soon as the first level logic of next stage has evaluated next stage has taken the result EvaluateEvaluate –Evaluation can be started as soon as the guarded N-transistor in the first level logic of next stage has turned off
35
USC Asynchronous Group35 Relax Precharge (RP) Constraint Current stage can precharge as soon as the first level logic of next stage has evaluated: Next stage has Taken the resultCurrent stage can precharge as soon as the first level logic of next stage has evaluated: Next stage has Taken the result Current stage can evaluate as soon as the first level logic of next stage has precharged, blocking the new result from passing throughCurrent stage can evaluate as soon as the first level logic of next stage has precharged, blocking the new result from passing through No need for extra control logic except TD which is similar to completion detectorNo need for extra control logic except TD which is similar to completion detector
36
USC Asynchronous Group36 RP Pipeline Scheme Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 TD1TD2TD3 D(in)D(out) Cycle Time (P) = 2tF + 1tF1 +1tF1 +2tTDCycle Time (P) = 2tF + 1tF1 +1tF1 +2tTD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF
37
USC Asynchronous Group37 RP Timing Diagram
38
USC Asynchronous Group38 RP Timing Assumption Easy to meet timing assumptionEasy to meet timing assumption
39
USC Asynchronous Group39 RP Timing Assumption Cont. tF1 i is the first level logic of stage itF1 i is the first level logic of stage i tF2 i is the logic after the first level of stage itF2 i is the logic after the first level of stage i Assuming rising and falling of TD is the sameAssuming rising and falling of TD is the same
40
USC Asynchronous Group40 Relax Evaluation (RE) Constraint Current stage can start the evaluation about the same time as the next stage turns off the guarded N-transistors in the first level logicCurrent stage can start the evaluation about the same time as the next stage turns off the guarded N-transistors in the first level logic Requires general C-element, yet improve cycle timeRequires general C-element, yet improve cycle time
41
USC Asynchronous Group41 RE Pipeline Scheme TD can be skewed for fast evaluation detectionTD can be skewed for fast evaluation detection Cycle Time (P) = 2tF + 1tF1 +1tTD +1tCCycle Time (P) = 2tF + 1tF1 +1tTD +1tC Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 TD1TD2TD3 D(in)D(out) GC1 + ++
42
USC Asynchronous Group42 RE Timing Diagram
43
USC Asynchronous Group43 RE Timing Assumption 1 Precharge constraintPrecharge constraint
44
USC Asynchronous Group44 RE Timing Assumption 2 Evaluation constraint (Min Delay)Evaluation constraint (Min Delay)
45
USC Asynchronous Group45 Issue in Fine-Grained Pipelines In a fine-grained pipeline, such as Martin’s single gate pipeline, RE scheme may require buffering due to process variationIn a fine-grained pipeline, such as Martin’s single gate pipeline, RE scheme may require buffering due to process variation –Buffering is necessary because of second timing assumption, next gate (stage) may not have turned off N-stack before the result from current stage reaches it
46
USC Asynchronous Group46 Taken Detector (TD) Similar to Completion DetectorSimilar to Completion Detector Detect both evaluation and prechargeDetect both evaluation and precharge Inputs are the output of first level logic of each stageInputs are the output of first level logic of each stage
47
USC Asynchronous Group47 Datapath Merging & Splitting Datapath merging and splitting can be done similar to William’s styleDatapath merging and splitting can be done similar to William’s style Precharged Function Block F2a Precharged Function Block F3 TD2a TD3 D(out) Precharged Function Block F2b Precharged Function Block F1 TD1 TD2b C D(in)
48
USC Asynchronous Group48 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionsConclusions
49
USC Asynchronous Group49 Comparison of RE and Synchronous Skew Tolerant Assuming 4 stages pipeline, stage 1-4, and 4 phases clockingAssuming 4 stages pipeline, stage 1-4, and 4 phases clocking Synchronous:Synchronous: –Stage 1 starts next evaluation after stage 4 starts evaluation Asynchronous:Asynchronous: –Stage 1 starts next evaluation after we detect the completion of the first level logic of stage 3
50
USC Asynchronous Group50 Comparison Assumptions It is a balanced pipeline—all stages have equal evaluation timeIt is a balanced pipeline—all stages have equal evaluation time Precharge time is same as evaluation timePrecharge time is same as evaluation time
51
USC Asynchronous Group51 Graphical Comparison
52
USC Asynchronous Group52 Optimum Number of Stages Optimum Number of Stages (ONS)Optimum Number of Stages (ONS) Cycle Time is not the only factor in system performance, Forward Latency is also a limiting factorCycle Time is not the only factor in system performance, Forward Latency is also a limiting factor Larger cycle time can be compensated by increasing the number of stagesLarger cycle time can be compensated by increasing the number of stages However, high L f means system throughput can not be increased by adding more stagesHowever, high L f means system throughput can not be increased by adding more stages
53
USC Asynchronous Group53 Conclusion With Taken logic and some easy to meet timing requirement, we can achieve the best cycle time and forward latencyWith Taken logic and some easy to meet timing requirement, we can achieve the best cycle time and forward latency The performance comparison with existing pipeline schemes are favorableThe performance comparison with existing pipeline schemes are favorable Implementation is still required to prove the theoryImplementation is still required to prove the theory
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.