University of California Davis

Slides:

Advertisements

Similar presentations

Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.

Advertisements

Transmission Gate Based Circuits

Circuiti sequenziali1 Progettazione di circuiti e sistemi VLSI Anno Accademico Lezione Circuiti sequenziali.

Issues in System on the Chip Clocking November 6th, 2003 SoC Design Conference, Seoul, KOREA Vojin G. Oklobdzija Advanced Computer System Engineering Laboratory.

Introduction to CMOS VLSI Design Sequential Circuits.

Introduction to CMOS VLSI Design Sequential Circuits

ECE C03 Lecture 81 Lecture 8 Memory Elements and Clocking Hai Zhou ECE 303 Advanced Digital Design Spring 2002.

Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.

(Neil west - p: ). Finite-state machine (FSM) which is composed of a set of logic input feeding a block of combinational logic resulting in a set.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Sequential Definitions  Use two level sensitive latches of opposite type to build one master-slave flipflop that changes state on a clock edge (when the.

EE141 © Digital Integrated Circuits 2nd Timing Issues 1 Digital Integrated Circuits A Design Perspective Timing Issues Jan M. Rabaey Anantha Chandrakasan.

Synchronous Digital Design Methodology and Guidelines

CSE477 L19 Timing Issues; Datapaths.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 19: Timing Issues; Introduction to Datapath.

Clock Design Adopted from David Harris of Harvey Mudd College.

Chapter 11 Timing Issues in Digital Systems Boonchuay Supmonchai Integrated Design Application Research (IDAR) Laboratory August 20, 2004; Revised - July.

Embedded Systems Hardware:

Embedded Systems Hardware: Storage Elements; Finite State Machines; Sequential Logic.

Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.

Contemporary Logic Design Sequential Logic © R.H. Katz Transparency No Chapter #6: Sequential Logic Design Sequential Switching Networks.

Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.

Digital Integrated Circuits for Communication

CSE477 L17 Static Sequential Logic.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 17: Static Sequential Circuits Mary Jane Irwin.

© Digital Integrated Circuits 2nd Sequential Circuits Digital Integrated Circuits A Design Perspective Designing Sequential Logic Circuits Jan M. Rabaey.

EE415 VLSI Design DYNAMIC LOGIC [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]

CSE477 L17 Static Sequential Logic.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 17: Static Sequential Circuits Mary Jane Irwin.

DCSL & LVDCSL: A High Fan-in, High Performance Differential Current Switch Logic Families Dinesh Somasekhaar, Kaushik Roy Presented by Hazem Awad.

EEE2243 Digital System Design Chapter 7: Advanced Design Considerations by Muhazam Mustapha, extracted from Intel Training Slides, April 2012.

Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.

Sp09 CMPEN 411 L18 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Static Sequential Circuits [Adapted from Rabaey’s Digital Integrated Circuits,

Introduction to Clock Tree Synthesis

Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.

Review: Sequential Definitions

EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003 Rev /05/2003.

EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003.

Digital Integrated Circuits A Design Perspective

Lecture 11: Sequential Circuit Design

Digital Integrated Circuits A Design Perspective

Subject Name: Fundamentals Of CMOS VLSI Subject Code: 10EC56

Chapter #6: Sequential Logic Design

Chapter 7 Designing Sequential Logic Circuits Rev 1.0: 05/11/03

Low Power Very Fast Dynamic Logic Circuits

Sequential circuit design with metastability

Digital Fundamentals Floyd Chapter 7 Tenth Edition

Appendix B The Basics of Logic Design

SEQUENTIAL LOGIC -II.

Overview Part 1 – The Design Space

University of California Davis

Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2

ECE Digital logic Lecture 16: Synchronous Sequential Logic

Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits

University of California Davis

CSE 370 – Winter Sequential Logic-2 - 1

Limitations of STA, Slew of a waveform, Skew between Signals

Timing Analysis 11/21/2018.

University of California Davis

触发器 Flip-Flops 刘鹏浙江大学信息与电子工程学院 March 27, 2018

Subject Name: Fundamentals Of CMOS VLSI Subject Code: 10EC56

Future Directions in Clocking Multi-GHz Systems ISLPED 2002 Tutorial This presentation is available at: under Presentations.

Clocking in High-Performance and Low-Power Systems Presentation given at: EPFL Lausanne, Switzerland June 23th, 2003 Vojin G. Oklobdzija Advanced.

Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003

CSE 370 – Winter Sequential Logic - 1

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

Combinational Circuit Design

Load-Sensitive Flip-Flop Characterization

Lecture 19 Logistics Last lecture Today

Low Power Digital Design

SEQUENTIAL CIRCUITS __________________________________________________

Presentation transcript:

University of California Davis Clocked Storage Elements for High-Performance and Low-Power Systems ICCD 2001 Tutorial Vojin G. Oklobdzija University of California Davis http://www.ece.ucdavis.edu/acsel Integration Corp. Berkeley, CA 94708 http://www.integration-corp.com

Prof. V.G. Oklobdzija, University of California Outline Importance of Clocked Storage Elements (CSE) Basic Definitions Difference between Latch and Flip-Flop Timing and Power metrics Representative designs used in High-Performance Microprocessors Comparison Conclusion, New Directions and Some novel designs 9/16/2018 Prof. V.G. Oklobdzija, University of California

Importance of Clocked Storage Elements (CSE) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Trends in high-performance systems: Higher clock frequency 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Power vs. Year High-end growing at 25% / year RISC @ 12% / yr X86 @ 15% / yr Consumer (low-end) At 13% / year 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Predictions Source: Shekhar Borkar, Intel 9/16/2018 Prof. V.G. Oklobdzija, University of California

Recent Interest in Clocked Storage Elements Trends in high-performance systems Higher clock frequency: 1.8GHz Pentium 4 4GHz logic presented) More transistors on chip (214 million, ISSCC 2001) Consequences Increased Flip-Flop overhead relative to cycle time Pipeline depth of 20 or more Cycle time 10 - 20 FO4 delays, F-F overhead 3 - 4 FO4 9/16/2018 Prof. V.G. Oklobdzija, University of California

Courtesy: Doug Carmean, Hot-Chips-13 presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Processor Frequency Trend Source: Intel S. Borkar Frequency doubles each generation Number of gates/clock reduce by 25% 9/16/2018 Prof. V.G. Oklobdzija, University of California

Pentium 3 uArchitecture stage stage stage logic register logic register logic register Delay: 0.6 ? 0.3 ? 0.6 ? 0.3 ? 0.6 ? 0.3 ? The total delay from pipeline stage to pipeline stage is 0.9 ns. The maximum clock rate for this design is 1.1 GHz. 9/16/2018 Prof. V.G. Oklobdzija, University of California

The Pentium 4 Depends on Pipelines logic register logic register logic register logic register logic register logic register Delay: 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? 0.4? 0.16? The total delay from pipeline stage to pipeline stage is 560 pS. This design, with twice the stages, has a maximum clock rate of 1.8 GHz. As the design is broken into more pipeline stages, the logic in each stage has less delay, and the registers between stages consume a higher percentage of the delay, causing diminishing returns. At some point the cost of adding more stages, such as branch prediction, causes a very marginal return. The only way out of this bottleneck is a faster register. This is one reason why the P4 is not significantly faster than a slower-clocked P3 for many applications. 9/16/2018 Prof. V.G. Oklobdzija, University of California

Courtesy: Doug Carmean, Hot-Chips-13 presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Why Interest in Clocked Storage Elements ? Higher impact of storage element delay High-speed requires low CSE pipeline overhead: 3 FO4 or less. Logic embedding property Limits on performance FF delays of 10pS - 100pS Higher impact of clock skew Ability to control both edges of the clock Higher power consumption >100W for recent processors Clock system burns up to 40%, storage elements up to 20% of total power Battery-powered applications 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Basic Definitions 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Clock Signals Clocks are defined as pulsed, synchronizing signals that provide the time reference for the movement of data in the synchronous digital system. The clocking in a digital system can be either single phase, or multi-phase (usually two-phase). Clocking strategy is dependent and largely influenced by the choice of the CSE: latch or flip-flop The dark rectangles in the figure represent the interval during which the bi-stable element samples its data input. Fig. 4.2 shows the possible types of clocking techniques and corresponding general finite-state machine structures: 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clock Signal Uncertainty Effects on cycle- time: – maximum delay restriction – violation of setup time May cause race – minimum delay restriction – violation of hold time Uncertainty is: Jitter, Skew, and Duty Cycle 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Jitter • Uncertainty in consecutive edges of a periodic signal • Caused by temporal noise events • Quantified as: – cycle-to-cycle or short-term jitter, tJS – long-term jitter, tJL 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Clock Skew Time difference between temporally-equivalent or concurrent edges of two periodic signals Caused by spatial noise events 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Clocking Strategies Single-phase clocking and single latch machine Edge-triggered clocking and Flip-Flop based machine 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Clocking Strategies Two-phase clocking and two-phase latch machine with single latch Two-phase clocking and two-phase latch machine with double latch 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Delay Restrictions Clock defines hard boundaries for edge-triggered design Clock boundaries are soft for level sensitive clocking and they are: Tolerant for clock edge uncertainty Tolerant to uncertainty of data arrival Timing slack can voluntarily be passed forward Time can forcefully be borrowed *Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Single-Phase Clocking, Single Latch: Timing Constraints 9/16/2018 Prof. V.G. Oklobdzija, University of California

Two-Phase Clocking with Two-Phase Double Latch 9/16/2018 Prof. V.G. Oklobdzija, University of California

Two-Phase Clocking with One-Phase Double Latch Some people refer to this clocking arrangement as: “negative edge Flip-Flop” – erroneously ! 9/16/2018 Prof. V.G. Oklobdzija, University of California

Difference between Latch and Flip-Flop 9/16/2018 Prof. V.G. Oklobdzija, University of California

Difference between Latch and Flip-Flop After the transition of the clock data can not change Latch is “transparent” 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop and M-S Latch Arrangement How can one recognize the difference without knowing what is inside the “black-box” ? 9/16/2018 Prof. V.G. Oklobdzija, University of California

F-F and M-S Latch: Difference Experiment: 9/16/2018 Prof. V.G. Oklobdzija, University of California

F-F and M-S Latch: Difference Structural Difference: No Clock Flip-Flop M-S Latch 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Flip-Flop vs. Latch Edge sensitive Easier to use as frequency increases Robustness to duty cycle Simpler logic timing requirements Fits into CAD tools Level sensitive May consume less power for the operation Better clock skew/jitter characteristics More difficult clock requirements Choice between use of FF or latch is subject to each individual design and its specifications Flip-flops are edge sensitive - simpler timing requirements and lower sensitivity to duty cycle imperfections Latches are level sensitive, simpler - less power consumption and better clock skew/jitter characteristics 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Flip-Flop: Example HLFF (Partovi) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Flip-Flop: Example HLFF (Partovi) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Pulse-Based Flip-Flops* *Taken from Hamid Partovi’s ISSCC-2000 GHz Processor Design Workshop presentation 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Flip-Flop: Example D=0 pulse D=1 SAFF DEC Alpha 21264 9/16/2018 Prof. V.G. Oklobdzija, University of California

Requirements in the Flip-Flop Design Small Clk-Output delay, Narrow sampling window Low power Small clock load High driving capability (increased levels of parallelism) Typical load ranges from 3-4 FO4 to 15-25 FO4. High driving should be achieved by inserting inverters and following “logical effort” rules starting with minimal size CSE. Symmetry: balanced D-Q and D-Q/not delay. Integration of logic into the flop Multiplexed or clock scan Cross-talk insensitivity - dynamic/high impedance nodes are affected 9/16/2018 Prof. V.G. Oklobdzija, University of California

Timing and Power metrics 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Delay Sum of setup time U and Clk-Q delay is the only true measure of the performance with respect to the system speed T = TClk-Q + TLogic + Tsetup+ Tskew TClk-Q TSetup TLogic 9/16/2018 Prof. V.G. Oklobdzija, University of California

Delay vs. Setup/Hold Times 9/16/2018 Prof. V.G. Oklobdzija, University of California

Timing Characteristics Figure presenting typical clock-to-output and data-to-output characteristics is shown.. In stable region, clock-to-output characteristic is constant. As setup requirement of the device starts to be violated, clock-to-output curve rises, ending in failure at some point. Data-to-output characteristic, being simple sum of clock-to-output and data-to-clock time, falls with the slope of 45° in stable region. In metastable region, the slope starts to decrease as a function of increased clock-to-output characteristic. Minimum of data-to-output curve occurs at 45 ° slope of clock-to-output curve. Data-to-clock time that corresponds to this point is termed optimal setup time. 9/16/2018 Prof. V.G. Oklobdzija, University of California

Timing parameters, details The best point to pick on delay curve is minimum D-Q 9/16/2018 Prof. V.G. Oklobdzija, University of California

Simulation Condition and Testbench Power Data activity dependence as a FF characteristics Consumption with 50% (30%)activity adopted as a figure of merit Dissipation of driving inverters is part of total power consumption In order to perform evaluation and comparison of flip-flops, simulation conditions and testbench for simulations are defined. They are set according to flip-flop characterization presented earlier. Measurement of power consumption is done with several different input activities; power consumption with input activity of 50% is adopted as a figure of merit. Total dissipation includes dissipation of driving inverters 9/16/2018 Prof. V.G. Oklobdzija, University of California

Simulation Condition and Testbench Timing Total FF overhead is setup + clock-to-output time Circuit optimization towards td-q Clock skew robustness obtained from observing DQ curve Power-Delay Product Overall performance parameter at fixed frequency Circuit delay parameter used for evaluation is data-to-output time. Circuits are optimized towards this parameter. Ultimate performance parameter is power-delay product, measured at fixed clock frequency. It is calculated as a product of data-to-output time and total power consumption measured at optimal-setup time 9/16/2018 Prof. V.G. Oklobdzija, University of California

Flip-Flop Performance Comparison Test bench Total power consumed internal power data power clock power Measured for four cases no activity (0000… and 1111…) maximum activity (0101010..) average activity (random sequence) Delay is (minimum D-Q) Clk-Q + setup time 9/16/2018 Prof. V.G. Oklobdzija, University of California

The sources of internal power consumption 9/16/2018 Prof. V.G. Oklobdzija, University of California

Design & optimization tradeoffs Opposite Goals Minimal Total power consumption Minimal Delay Power-Delay tradeoff Minimize Power-Delay product (PDPtot) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Clocked Storage Elements in High-Performance Microprocessors 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Master-Slave Latches Positive setup times Two clock phases: distributed globally generated locally Small penalty in delay for incorporating MUX Some circuit tricks needed to reduce the overall delay 9/16/2018 Prof. V.G. Oklobdzija, University of California

PowerPC 603 M-S Latch Combination Used in PowerPC family Low-power High speed Big clock load Easily embedded scan function Our simulations show PowerPC 603 (Gerosa, JSSC 12/94) Small internal power consumption Low-power feedback Double the clock load compared with other latches Locally generated second phase (reduces overall clock load) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California mC2MOS M-S Latch Small clock load (local clock buffering) Low-power feedback Big positive setup time Robustness to clock slope, unlike classic C2MOS structure Our simulations show Y. Suzuki, “Clocked CMOS Calculator Circuitry”, IEEE J. Solid-State Circuits, Dec. 1973 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Advanced Flip-Flops 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California 21264 Flip-Flop Used in Digital's WD21264 high-performance processor Runs at 600MHz 450pS Clk-Q delay, simulated in 0.35u technology Our simulations show Small clock load High internal power consumption S-R latch ruins the speed by 40% Dynamic nodes, potential hazard in low-power applications 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Strong Arm 110 Flip-Flop Used in SA110 0.5W low-power processor Runs at 200MHz One transistor more than 21264 flip-flop 450ps Clk-Q delay, simulated in 0.35u CMOS technology Our simulations show Additional transistor provides fully static operation (robustness to leakage currents) essential for low-power applications, but slightly increased internal power consumption 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Flip-Flops First stage is a pulse generator generates a pulse (glitch) on a rising edge of the clock Second stage is a latch captures the pulse generated in the first stage Pulse generation results in a negative setup time Frequently exhibit a soft edge property Must check for hold time violations Note: power is always consumed in the clocked pulse generator 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Partovi’s HLFF Hybrid Latch-Flip-Flop combination 280pS Clk-Q delay Negative set-up time of -100pS Robustness to clock skew and fast clocking Our simulations show AMD K-6, Partovi, ISSCC’96 Hybrid design Gains speed (negative setup time) robustness to clock skew Drawbacks sensitivity to clock slope relatively high internal power (due to precharge) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Hybrid Latch Flip-Flop Skew absorption Partovi et al, ISSCC’96 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California HLFF Flip-Flop Flip-flop features: single phase clock edge triggered, on one clock edge Features: Soft clock edge property brief transparency, equal to 3 inverter delays negative setup time allows slack passing absorbs skew Hold time is comparable to HLFF delay minimum delay between flip-flops must be controlled Pseudo static Possible to incorporate logic 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California K-6 Dual-Rail ETL Self-reset property Hybrid combination 260ps Clk-Q delay simulated in .35u CMOS technology negative setup time: -20ps small clock load Our simulations show Double-ended, precharge structure is the most power hungry (switching on all input combinations) Self-reset property increases power consumption drives succeeding fast domino stages Precharge increases speed 9/16/2018 Prof. V.G. Oklobdzija, University of California

Semi-Dynamic Flip-Flop Hybrid combination used in UltraSPARC-III Very fast circuit ( 188ps Clk-Q delay .25u technology, 1.6V, 105oC ) Our simulations show F. Klass, VLSI Circuits’98 Negative setup time Feature of small penalty for embedded logic Relatively high internal power consumption and clock load 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Sense Amplifier-Based Flip-Flop Nikolic, Oklobdzija, Stojanovic, ISSCC, 1999 Delay of each of the outputs is independent of the load on the other output Delay of Q and Q is symmetrical as opposed to the NAND based design Convenient for dual rail logic and driving strength for standard CMOS is effectively doubled SAFF presents a small clock load, small setup time and all the advantages of original design Possible tradeoff between speed and robustness to cross-talk 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Sense Amplifier-Based Flip-Flop The first stage is unchanged sense amplifier Second stage is sized to provide maximum switching speed Driver transistors are large Keeper transistors are small and disengaged during transitions Nikolic, Oklobdzija, Stojanovic ISSCC ‘99 9/16/2018 Prof. V.G. Oklobdzija, University of California

New Sense Amplifier-Based Flip-Flop New pulse-generating stage Inverters relocated to de-couple gates of MN3, MN4 MN5, MN6 provide leakage current paths Second stage is unchanged Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

New Sense Amplifier-Based Flip-Flop Falling edge flip-flop Output stage has identical topology Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison with Other Flip-Flops Delay vs. power comparison of different flip-flops Flip-flops are optimized for speed with output transistor sizes limited to 7.5m/4.3 m, driving 200fF Total transistor gate width is indicated Nikolic, Oklobdzija, ESSCIRC’99 70 60 TG M-S 52mm 50 Original SAFF 60mm HLFF 54mm 40 Total power [uW] THIS 30 WORK 69mm C 2 MOS 80mm 20 SDFF 49mm 10 100 150 200 250 300 350 400 450 500 Delay [ps] 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Overall results 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison in terms of speed and PDPtot Delay below 200ps SDFF 187ps HLFF 199ps K-6 ETL 200ps 200-300ps PowerPC latch 266ps 21264 Alpha FF 272ps Strong Arm FF 275ps mC2MOS latch 292ps above 500ps SSTC latch 592ps DSTC latch 629ps SSTC* latch 898ps DSTC* latch 1060ps PDPtot below 30fJ PowerPC latch 28fJ 30 - 50fJ HLFF 29fJ SDFF 39fJ mC2MOS latch 40fJ 21264 Alpha FF 43fJ Strong Arm FF 45fJ 50 - 70fJ K-6 ETL 70fJ above 70fJ SSTC latch 95fJ DSTC latch 125fJ 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Delay comparison F-F design brings the fastest structures 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Delay comparison F-F design brings the fastest structures 9/16/2018 Prof. V.G. Oklobdzija, University of California

Overall ranking, zoomed Real signals have the activity between 0 and 0.25 () Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” More “ones” above the  point 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Overall performance Real signals have the activity between 0 and 0.5 () Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” More “ones” above the  point 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conventional Clk-Q vs. minimum D-Q Hidden positive setup time Degradation of Clk-Q 9/16/2018 Prof. V.G. Oklobdzija, University of California

Internal Power distribution Four sequences characterize the boundaries for internal power consumption …010101… maximum random, equal transition probability, average …111111… precharge activity …000000… leakage + internal clock processing 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison of Clock power consumption 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conclusion and New Directions 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California New Directions Reducing CSE power: Using conditional pre-charge techniques Using conditional data capture techniques Reducing clock distribution network power: Capture data on each edge – Double Edge Triggered structure Improving CSE reliability: Fully derived CSE (ESSCIRC’99, ICCD 2000) 9/16/2018 Prof. V.G. Oklobdzija, University of California

Conditional Precharge Flip-Flop Circuit Proposed flip-flop is shown. First stage employs the feedback from the output to disable the precharge and keep the internal node at the low level if Q is high <Mn4, Mp2>. Second stage implement conditional keeping function <Mn8, Mp3, Mp4> Nedovic, Oklobdzija, SBCCI 2000 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California Conditional Capture Flip-Flop (Im-CCFF: Nedovic, Oklobdzija, ICECS 2001) Use conditional capture idea When Q=1, 1=>0 transition of X is prohibited To equalize 1=>0 and 0=>1 set-up times, the signal from the middle of the stack (Y) controls HL transition on Q Y is output of the first stage of domino-like inverter, obtained almost for free Easy logic embedding First stage has dynamic behavior only in transparency window Improved Conditional Capture Flip-Flop: First stage computes nodes X and Y. If CLK=1, D=1, and CLKbb=Q=0 (I.e. if D=1, Q=0 in transparency window), X evaluates to 0. Lower part of the stack is used for Y: Y=not(D) if clock is at high level (CLK=1). X is ‘conditional-capture signal’ with the activity equal to activity of D. Y has larger activity. Second stage uses both X and Y: If X=0 (i.e. D=1, Q=0 in the transparency window), Q is brought to high level. If Y=1 when CLKbb=1 (i.e. D=0 in transparency window), Q is brought to 0. CLKbb in second stage is used instead of CLK to leave time to Y to evaluate to 0 and remove hazard in second stage 9/16/2018 Prof. V.G. Oklobdzija, University of California

Power Consumption Comparison: Im-CCFF: Nedovic, Oklobdzija, ICECS-2001 SBCCI 2000 NOTE: Conditional flip-flops behave like MS latches with respect to input data activity 9/16/2018 Prof. V.G. Oklobdzija, University of California

Dual-Edge Triggered Flip-Flops Structurally, two different designs are distinguished a) Latch-Mux (LM) b) Pulsed Latch (PL, flip-flop) Classification very similar to single edge triggered SE 9/16/2018 Prof. V.G. Oklobdzija, University of California

Prof. V.G. Oklobdzija, University of California DETSE Overall Results 1 4 3 2 1 4 3 2 9/16/2018 Prof. V.G. Oklobdzija, University of California

Summary: Double-Edge Flip-Flops PDP [fJ] PD2P [10-24 Js] Fujitsu 0.18m, wmin = 0.22m, wmax = 10m, le = 0.18m, fclk=250/500MHz, activity =0.5, VDD = 1.8V, Temp = 25º, load=14 min. inv Even ‘local’ performance of DETFFs (not considering power savings of clock distribution) is comparable to that of SETFFs Analogy between double edge flip-flops behavior and their single-edge counterparts 9/16/2018 Prof. V.G. Oklobdzija, University of California

SDFF improvement: Nedovic, Oklobdzija ICCD 2000 Eliminated glitch Avoided keeper overpowering Faster operation Improved power PDP improvement over SDFF about 27% (first version only 8% improvement Preserved Logic Embedding Property Achieved strong driving capability at the output More robust to scaling down supply voltage 0.25u bulk CMOS, VDD=2.5V, T=27 C, fclk=500MHz, load=14 min. inv’s 9/16/2018 Prof. V.G. Oklobdzija, University of California

New Sense Amplifier-Based Flip-Flop New pulse-generating stage Inverters relocated to de-couple gates of MN3, MN4 MN5, MN6 provide leakage current paths Second stage is unchanged Nikolic, Oklobdzija, ESSCIRC’99 9/16/2018 Prof. V.G. Oklobdzija, University of California

Comparison with Other Flip-Flops Delay vs. power comparison of different flip-flops Flip-flops are optimized for speed with output transistor sizes limited to 7.5m/4.3 m, driving 200fF Total transistor gate width is indicated Nikolic, Oklobdzija, ESSCIRC’99 70 60 TG M-S 52mm 50 Original SAFF 60mm HLFF 54mm 40 Total power [uW] THIS 30 WORK 69mm C 2 MOS 80mm 20 SDFF 49mm 10 100 150 200 250 300 350 400 450 500 Delay [ps] 9/16/2018 Prof. V.G. Oklobdzija, University of California

What to Expect in the Future ? Important: Incorporating logic into the CSE Absorbing clock skew Quiet state (battery powered applications) Pipeline boundaries will start to blur CSE will be mixed with logic Waver pipelining, domino style, signals used to clock Synchronous design only in a limited domain Asynchronous communication between synchronous domains 9/16/2018 Prof. V.G. Oklobdzija, University of California

Modified Test Bench and PD2P Optimization

Prof. V.G. Oklobdzija, University of California PDP, EDP Comparison SDFF is best; PowerPC and SAFF are competitive 9/16/2018 Prof. V.G. Oklobdzija, University of California

50%-Data-Activities -- 1GHz Clock -- PD2P Optimization 1.8VDD, 0.18um CMOS Technology 50%-Data-Activities -- 1GHz Clock -- PD2P Optimization 9/16/2018 Prof. V.G. Oklobdzija, University of California