University of California Davis

University of California Davis
Clocked Storage Elements for High-Performance and Low-Power Systems ICCD 2001 Tutorial Vojin G. Oklobdzija University of California Davis Integration Corp. Berkeley, CA 94708

Prof. V.G. Oklobdzija, University of California
Outline Importance of Clocked Storage Elements (CSE) Basic Definitions Difference between Latch and Flip-Flop Timing and Power metrics Representative designs used in High-Performance Microprocessors Comparison Conclusion, New Directions and Some novel designs 9/16/2018 Prof. V.G. Oklobdzija, University of California

Importance of Clocked Storage Elements (CSE)
9/16/2018 Prof. V.G. Oklobdzija, University of California

Trends in high-performance systems: Higher clock frequency

Power vs. Year High-end growing at 25% / year 12% / yr 15% / yr Consumer (low-end) At 13% / year 9/16/2018 Prof. V.G. Oklobdzija, University of California

Predictions Source: Shekhar Borkar, Intel 9/16/2018 Prof. V.G. Oklobdzija, University of California

Recent Interest in Clocked Storage Elements
Trends in high-performance systems Higher clock frequency: 1.8GHz Pentium 4 4GHz logic presented) More transistors on chip (214 million, ISSCC 2001) Consequences Increased Flip-Flop overhead relative to cycle time Pipeline depth of 20 or more Cycle time FO4 delays, F-F overhead FO4 9/16/2018 Prof. V.G. Oklobdzija, University of California

Courtesy: Doug Carmean, Hot-Chips-13 presentation

Processor Frequency Trend
Source: Intel S. Borkar Frequency doubles each generation Number of gates/clock reduce by 25% 9/16/2018 Prof. V.G. Oklobdzija, University of California

Pentium 3 uArchitecture
stage stage stage logic register logic register logic register Delay: 0.6 ? 0.3 ? 0.6 ? 0.3 ? 0.6 ? 0.3 ? The total delay from pipeline stage to pipeline stage is 0.9 ns. The maximum clock rate for this design is 1.1 GHz. 9/16/2018 Prof. V.G. Oklobdzija, University of California

The Pentium 4 Depends on Pipelines
logic register logic register logic register logic register logic register logic register Delay: 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? The total delay from pipeline stage to pipeline stage is 560 pS. This design, with twice the stages, has a maximum clock rate of 1.8 GHz. As the design is broken into more pipeline stages, the logic in each stage has less delay, and the registers between stages consume a higher percentage of the delay, causing diminishing returns. At some point the cost of adding more stages, such as branch prediction, causes a very marginal return. The only way out of this bottleneck is a faster register. This is one reason why the P4 is not significantly faster than a slower-clocked P3 for many applications. 9/16/2018 Prof. V.G. Oklobdzija, University of California

Courtesy: Doug Carmean, Hot-Chips-13 presentation

Why Interest in Clocked Storage Elements ?
Higher impact of storage element delay High-speed requires low CSE pipeline overhead: 3 FO4 or less. Logic embedding property Limits on performance FF delays of 10pS - 100pS Higher impact of clock skew Ability to control both edges of the clock Higher power consumption >100W for recent processors Clock system burns up to 40%, storage elements up to 20% of total power Battery-powered applications 9/16/2018 Prof. V.G. Oklobdzija, University of California

Basic Definitions 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clock Signals Clocks are defined as pulsed, synchronizing signals that provide the time reference for the movement of data in the synchronous digital system. The clocking in a digital system can be either single phase, or multi-phase (usually two-phase). Clocking strategy is dependent and largely influenced by the choice of the CSE: latch or flip-flop The dark rectangles in the figure represent the interval during which the bi-stable element samples its data input. Fig. 4.2 shows the possible types of clocking techniques and corresponding general finite-state machine structures: 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clock Signal Uncertainty
Effects on cycle- time: – maximum delay restriction – violation of setup time May cause race – minimum delay restriction – violation of hold time Uncertainty is: Jitter, Skew, and Duty Cycle 9/16/2018 Prof. V.G. Oklobdzija, University of California

Jitter • Uncertainty in consecutive edges of a periodic signal • Caused by temporal noise events • Quantified as: – cycle-to-cycle or short-term jitter, tJS – long-term jitter, tJL 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clock Skew Time difference between temporally-equivalent or concurrent edges of two periodic signals Caused by spatial noise events 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clocking Strategies Single-phase clocking and single latch machine Edge-triggered clocking and Flip-Flop based machine 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clocking Strategies Two-phase clocking and two-phase latch machine with single latch Two-phase clocking and two-phase latch machine with double latch 9/16/2018 Prof. V.G. Oklobdzija, University of California

Delay Restrictions Clock defines hard boundaries for edge-triggered design Clock boundaries are soft for level sensitive clocking and they are: Tolerant for clock edge uncertainty Tolerant to uncertainty of data arrival Timing slack can voluntarily be passed forward Time can forcefully be borrowed *Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Single-Phase Clocking, Single Latch: Timing Constraints

Two-Phase Clocking with Two-Phase Double Latch

Two-Phase Clocking with One-Phase Double Latch
Some people refer to this clocking arrangement as: “negative edge Flip-Flop” – erroneously ! 9/16/2018 Prof. V.G. Oklobdzija, University of California

Difference between Latch and Flip-Flop

Difference between Latch and Flip-Flop
After the transition of the clock data can not change Latch is “transparent” 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop and M-S Latch Arrangement
How can one recognize the difference without knowing what is inside the “black-box” ? 9/16/2018 Prof. V.G. Oklobdzija, University of California

F-F and M-S Latch: Difference
Experiment: 9/16/2018 Prof. V.G. Oklobdzija, University of California

F-F and M-S Latch: Difference
Structural Difference: No Clock Flip-Flop M-S Latch 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop vs. Latch Edge sensitive Easier to use as frequency increases Robustness to duty cycle Simpler logic timing requirements Fits into CAD tools Level sensitive May consume less power for the operation Better clock skew/jitter characteristics More difficult clock requirements Choice between use of FF or latch is subject to each individual design and its specifications Flip-flops are edge sensitive - simpler timing requirements and lower sensitivity to duty cycle imperfections Latches are level sensitive, simpler - less power consumption and better clock skew/jitter characteristics 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop: Example HLFF (Partovi) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Pulse-Based Flip-Flops*
*Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop: Example D=0 pulse D=1 SAFF DEC Alpha 21264 9/16/2018 Prof. V.G. Oklobdzija, University of California

Requirements in the Flip-Flop Design
Small Clk-Output delay, Narrow sampling window Low power Small clock load High driving capability (increased levels of parallelism) Typical load ranges from 3-4 FO4 to FO4. High driving should be achieved by inserting inverters and following “logical effort” rules starting with minimal size CSE. Symmetry: balanced D-Q and D-Q/not delay. Integration of logic into the flop Multiplexed or clock scan Cross-talk insensitivity - dynamic/high impedance nodes are affected 9/16/2018 Prof. V.G. Oklobdzija, University of California

Timing and Power metrics

Delay Sum of setup time U and Clk-Q delay is the only true measure of the performance with respect to the system speed T = TClk-Q + TLogic + Tsetup+ Tskew TClk-Q TSetup TLogic 9/16/2018 Prof. V.G. Oklobdzija, University of California

Delay vs. Setup/Hold Times

Timing Characteristics
Figure presenting typical clock-to-output and data-to-output characteristics is shown.. In stable region, clock-to-output characteristic is constant. As setup requirement of the device starts to be violated, clock-to-output curve rises, ending in failure at some point. Data-to-output characteristic, being simple sum of clock-to-output and data-to-clock time, falls with the slope of 45° in stable region. In metastable region, the slope starts to decrease as a function of increased clock-to-output characteristic. Minimum of data-to-output curve occurs at 45 ° slope of clock-to-output curve. Data-to-clock time that corresponds to this point is termed optimal setup time. 9/16/2018 Prof. V.G. Oklobdzija, University of California

Timing parameters, details
The best point to pick on delay curve is minimum D-Q 9/16/2018 Prof. V.G. Oklobdzija, University of California

Simulation Condition and Testbench
Power Data activity dependence as a FF characteristics Consumption with 50% (30%)activity adopted as a figure of merit Dissipation of driving inverters is part of total power consumption In order to perform evaluation and comparison of flip-flops, simulation conditions and testbench for simulations are defined. They are set according to flip-flop characterization presented earlier. Measurement of power consumption is done with several different input activities; power consumption with input activity of 50% is adopted as a figure of merit. Total dissipation includes dissipation of driving inverters 9/16/2018 Prof. V.G. Oklobdzija, University of California

Simulation Condition and Testbench
Timing Total FF overhead is setup + clock-to-output time Circuit optimization towards td-q Clock skew robustness obtained from observing DQ curve Power-Delay Product Overall performance parameter at fixed frequency Circuit delay parameter used for evaluation is data-to-output time. Circuits are optimized towards this parameter. Ultimate performance parameter is power-delay product, measured at fixed clock frequency. It is calculated as a product of data-to-output time and total power consumption measured at optimal-setup time 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop Performance Comparison
Test bench Total power consumed internal power data power clock power Measured for four cases no activity (0000… and 1111…) maximum activity ( ) average activity (random sequence) Delay is (minimum D-Q) Clk-Q + setup time 9/16/2018 Prof. V.G. Oklobdzija, University of California

The sources of internal power consumption

Design & optimization tradeoffs
Opposite Goals Minimal Total power consumption Minimal Delay Power-Delay tradeoff Minimize Power-Delay product (PDPtot) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clocked Storage Elements in High-Performance Microprocessors

Master-Slave Latches Positive setup times Two clock phases: distributed globally generated locally Small penalty in delay for incorporating MUX Some circuit tricks needed to reduce the overall delay 9/16/2018 Prof. V.G. Oklobdzija, University of California

PowerPC 603 M-S Latch Combination
Used in PowerPC family Low-power High speed Big clock load Easily embedded scan function Our simulations show PowerPC 603 (Gerosa, JSSC 12/94) Small internal power consumption Low-power feedback Double the clock load compared with other latches Locally generated second phase (reduces overall clock load) 9/16/2018 Prof. V.G. Oklobdzija, University of California

mC2MOS M-S Latch Small clock load (local clock buffering) Low-power feedback Big positive setup time Robustness to clock slope, unlike classic C2MOS structure Our simulations show Y. Suzuki, “Clocked CMOS Calculator Circuitry”, IEEE J. Solid-State Circuits, Dec. 1973 9/16/2018 Prof. V.G. Oklobdzija, University of California

Advanced Flip-Flops 9/16/2018 Prof. V.G. Oklobdzija, University of California

21264 Flip-Flop Used in Digital's WD21264 high-performance processor Runs at 600MHz 450pS Clk-Q delay, simulated in 0.35u technology Our simulations show Small clock load High internal power consumption S-R latch ruins the speed by 40% Dynamic nodes, potential hazard in low-power applications 9/16/2018 Prof. V.G. Oklobdzija, University of California

Strong Arm 110 Flip-Flop Used in SA W low-power processor Runs at 200MHz One transistor more than flip-flop 450ps Clk-Q delay, simulated in 0.35u CMOS technology Our simulations show Additional transistor provides fully static operation (robustness to leakage currents) essential for low-power applications, but slightly increased internal power consumption 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flops First stage is a pulse generator generates a pulse (glitch) on a rising edge of the clock Second stage is a latch captures the pulse generated in the first stage Pulse generation results in a negative setup time Frequently exhibit a soft edge property Must check for hold time violations Note: power is always consumed in the clocked pulse generator 9/16/2018 Prof. V.G. Oklobdzija, University of California

Partovi’s HLFF Hybrid Latch-Flip-Flop combination 280pS Clk-Q delay Negative set-up time of pS Robustness to clock skew and fast clocking Our simulations show AMD K-6, Partovi, ISSCC’96 Hybrid design Gains speed (negative setup time) robustness to clock skew Drawbacks sensitivity to clock slope relatively high internal power (due to precharge) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Hybrid Latch Flip-Flop
Skew absorption Partovi et al, ISSCC’96 9/16/2018 Prof. V.G. Oklobdzija, University of California

HLFF Flip-Flop Flip-flop features: single phase clock edge triggered, on one clock edge Features: Soft clock edge property brief transparency, equal to 3 inverter delays negative setup time allows slack passing absorbs skew Hold time is comparable to HLFF delay minimum delay between flip-flops must be controlled Pseudo static Possible to incorporate logic 9/16/2018 Prof. V.G. Oklobdzija, University of California

K-6 Dual-Rail ETL Self-reset property Hybrid combination 260ps Clk-Q delay simulated in .35u CMOS technology negative setup time: -20ps small clock load Our simulations show Double-ended, precharge structure is the most power hungry (switching on all input combinations) Self-reset property increases power consumption drives succeeding fast domino stages Precharge increases speed 9/16/2018 Prof. V.G. Oklobdzija, University of California

Semi-Dynamic Flip-Flop
Hybrid combination used in UltraSPARC-III Very fast circuit ( 188ps Clk-Q delay .25u technology, 1.6V, 105oC ) Our simulations show F. Klass, VLSI Circuits’98 Negative setup time Feature of small penalty for embedded logic Relatively high internal power consumption and clock load 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Sense Amplifier-Based Flip-Flop
Nikolic, Oklobdzija, Stojanovic, ISSCC, 1999 Delay of each of the outputs is independent of the load on the other output Delay of Q and Q is symmetrical as opposed to the NAND based design Convenient for dual rail logic and driving strength for standard CMOS is effectively doubled SAFF presents a small clock load, small setup time and all the advantages of original design Possible tradeoff between speed and robustness to cross-talk 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Sense Amplifier-Based Flip-Flop
The first stage is unchanged sense amplifier Second stage is sized to provide maximum switching speed Driver transistors are large Keeper transistors are small and disengaged during transitions Nikolic, Oklobdzija, Stojanovic ISSCC ‘99 9/16/2018 Prof. V.G. Oklobdzija, University of California

New Sense Amplifier-Based Flip-Flop
New pulse-generating stage Inverters relocated to de-couple gates of MN3, MN4 MN5, MN6 provide leakage current paths Second stage is unchanged Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

Falling edge flip-flop Output stage has identical topology Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison with Other Flip-Flops
Delay vs. power comparison of different flip-flops Flip-flops are optimized for speed with output transistor sizes limited to 7.5m/4.3 m, driving 200fF Total transistor gate width is indicated Nikolic, Oklobdzija, ESSCIRC’99 70 60 TG M-S 52mm 50 Original SAFF 60mm HLFF 54mm 40 Total power [uW] THIS 30 WORK 69mm C 2 MOS 80mm 20 SDFF 49mm 10 100 150 200 250 300 350 400 450 500 Delay [ps] 9/16/2018 Prof. V.G. Oklobdzija, University of California

Overall results 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison in terms of speed and PDPtot
Delay below 200ps SDFF ps HLFF ps K-6 ETL ps ps PowerPC latch ps 21264 Alpha FF ps Strong Arm FF ps mC2MOS latch ps above 500ps SSTC latch ps DSTC latch ps SSTC* latch ps DSTC* latch ps PDPtot below 30fJ PowerPC latch fJ fJ HLFF fJ SDFF fJ mC2MOS latch fJ 21264 Alpha FF fJ Strong Arm FF fJ fJ K-6 ETL fJ above 70fJ SSTC latch fJ DSTC latch fJ 9/16/2018 Prof. V.G. Oklobdzija, University of California

Delay comparison F-F design brings the fastest structures 9/16/2018 Prof. V.G. Oklobdzija, University of California

Overall ranking, zoomed
Real signals have the activity between 0 and 0.25 () Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” More “ones” above the  point 9/16/2018 Prof. V.G. Oklobdzija, University of California

Overall performance Real signals have the activity between 0 and 0.5 () Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” More “ones” above the  point 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conventional Clk-Q vs. minimum D-Q
Hidden positive setup time Degradation of Clk-Q 9/16/2018 Prof. V.G. Oklobdzija, University of California

Internal Power distribution
Four sequences characterize the boundaries for internal power consumption …010101… maximum random, equal transition probability, average …111111… precharge activity …000000… leakage + internal clock processing 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison of Clock power consumption

Conclusion and New Directions

New Directions Reducing CSE power: Using conditional pre-charge techniques Using conditional data capture techniques Reducing clock distribution network power: Capture data on each edge – Double Edge Triggered structure Improving CSE reliability: Fully derived CSE (ESSCIRC’99, ICCD 2000) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conditional Precharge Flip-Flop Circuit
Proposed flip-flop is shown. First stage employs the feedback from the output to disable the precharge and keep the internal node at the low level if Q is high <Mn4, Mp2>. Second stage implement conditional keeping function <Mn8, Mp3, Mp4> Nedovic, Oklobdzija, SBCCI 2000 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conditional Capture Flip-Flop (Im-CCFF: Nedovic, Oklobdzija, ICECS 2001) Use conditional capture idea When Q=1, 1=>0 transition of X is prohibited To equalize 1=>0 and 0=>1 set-up times, the signal from the middle of the stack (Y) controls HL transition on Q Y is output of the first stage of domino-like inverter, obtained almost for free Easy logic embedding First stage has dynamic behavior only in transparency window Improved Conditional Capture Flip-Flop: First stage computes nodes X and Y. If CLK=1, D=1, and CLKbb=Q=0 (I.e. if D=1, Q=0 in transparency window), X evaluates to 0. Lower part of the stack is used for Y: Y=not(D) if clock is at high level (CLK=1). X is ‘conditional-capture signal’ with the activity equal to activity of D. Y has larger activity. Second stage uses both X and Y: If X=0 (i.e. D=1, Q=0 in the transparency window), Q is brought to high level. If Y=1 when CLKbb=1 (i.e. D=0 in transparency window), Q is brought to 0. CLKbb in second stage is used instead of CLK to leave time to Y to evaluate to 0 and remove hazard in second stage 9/16/2018 Prof. V.G. Oklobdzija, University of California

Power Consumption Comparison: Im-CCFF: Nedovic, Oklobdzija, ICECS-2001
SBCCI 2000 NOTE: Conditional flip-flops behave like MS latches with respect to input data activity 9/16/2018 Prof. V.G. Oklobdzija, University of California

Dual-Edge Triggered Flip-Flops
Structurally, two different designs are distinguished a) Latch-Mux (LM) b) Pulsed Latch (PL, flip-flop) Classification very similar to single edge triggered SE 9/16/2018 Prof. V.G. Oklobdzija, University of California

DETSE Overall Results 1 4 3 2 1 4 3 2 9/16/2018 Prof. V.G. Oklobdzija, University of California

Summary: Double-Edge Flip-Flops
PDP [fJ] PD2P [10-24 Js] Fujitsu 0.18m, wmin = 0.22m, wmax = 10m, le = 0.18m, fclk=250/500MHz, activity =0.5, VDD = 1.8V, Temp = 25º, load=14 min. inv Even ‘local’ performance of DETFFs (not considering power savings of clock distribution) is comparable to that of SETFFs Analogy between double edge flip-flops behavior and their single-edge counterparts 9/16/2018 Prof. V.G. Oklobdzija, University of California

SDFF improvement: Nedovic, Oklobdzija ICCD 2000
Eliminated glitch Avoided keeper overpowering Faster operation Improved power PDP improvement over SDFF about 27% (first version only 8% improvement Preserved Logic Embedding Property Achieved strong driving capability at the output More robust to scaling down supply voltage 0.25u bulk CMOS, VDD=2.5V, T=27 C, fclk=500MHz, load=14 min. inv’s 9/16/2018 Prof. V.G. Oklobdzija, University of California

New pulse-generating stage Inverters relocated to de-couple gates of MN3, MN4 MN5, MN6 provide leakage current paths Second stage is unchanged Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison with Other Flip-Flops
Delay vs. power comparison of different flip-flops Flip-flops are optimized for speed with output transistor sizes limited to 7.5m/4.3 m, driving 200fF Total transistor gate width is indicated Nikolic, Oklobdzija, ESSCIRC’99 70 60 TG M-S 52mm 50 Original SAFF 60mm HLFF 54mm 40 Total power [uW] THIS 30 WORK 69mm C 2 MOS 80mm 20 SDFF 49mm 10 100 150 200 250 300 350 400 450 500 Delay [ps] 9/16/2018 Prof. V.G. Oklobdzija, University of California

What to Expect in the Future ?
Important: Incorporating logic into the CSE Absorbing clock skew Quiet state (battery powered applications) Pipeline boundaries will start to blur CSE will be mixed with logic Waver pipelining, domino style, signals used to clock Synchronous design only in a limited domain Asynchronous communication between synchronous domains 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Test Bench and PD2P Optimization

PDP, EDP Comparison SDFF is best; PowerPC and SAFF are competitive 9/16/2018 Prof. V.G. Oklobdzija, University of California

50%-Data-Activities -- 1GHz Clock -- PD2P Optimization
1.8VDD, 0.18um CMOS Technology 50%-Data-Activities -- 1GHz Clock -- PD2P Optimization 9/16/2018 Prof. V.G. Oklobdzija, University of California

University of California Davis

Similar presentations

Presentation on theme: "University of California Davis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of California Davis

Similar presentations

Presentation on theme: "University of California Davis"— Presentation transcript:

Similar presentations

About project

Feedback