The Cost of Fixing Hold Time Violations in Sub-threshold Circuits Yanqing Zhang, Benton Calhoun University of Virginia Motivation and Background Power.

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.

Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.

OCV-Aware Top-Level Clock Tree Optimization

1 COMP541 Flip-Flop Timing Montek Singh Oct 6, 2014.

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Timing Margin Recovery With Flexible Flip-Flop Timing Model

Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

EE141 © Digital Integrated Circuits 2nd Timing Issues 1 Digital Integrated Circuits A Design Perspective Timing Issues Jan M. Rabaey Anantha Chandrakasan.

Synchronous Digital Design Methodology and Guidelines

Clock Design Adopted from David Harris of Harvey Mudd College.

RTL Hardware Design by P. Chu Chapter 161 Clock and Synchronization.

Assume array size is 256 (mult: 4ns, add: 2ns)

Yuanlin Lu Intel Corporation, Folsom, CA Vishwani D. Agrawal

Dynamic Scan Clock Control In BIST Circuits Priyadharshini Shanmugasundaram Vishwani D. Agrawal

Design of Variable Input Delay Gates for Low Dynamic Power Circuits

Lecture 8: Clock Distribution, PLL & DLL

Interconnect Optimizations

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Toward Performance-Driven Reduction of the Cost of RET-Based Lithography Control Dennis Sylvester Jie Yang (Univ. of Michigan,

Rethinking Internet Traffic Management: From Multiple Decompositions to a Practical Protocol Jiayue He Princeton University Joint work with Martin Suchara,

A Cost-Driven Lithographic Correction Methodology Based on Off-the-Shelf Sizing Tools.

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

Pei-Ci Wu Martin D. F. Wong On Timing Closure: Buffer Insertion for Hold-Violation Removal DAC’14.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Timing Analysis and Optimization Implications of Bimodal CD Distribution in Double Patterning Lithography Kwangok Jeong and Andrew B. Kahng VLSI CAD LABORATORY.

1 Jieyi Long, Ja Chun Ku, Seda Ogrenci Memik, Yehea Ismail Dept. of EECS, Northwestern Univ. SACTA: A Self-Adjusting Clock Tree Architecture to Cope with.

DELAY INSERTION METHOD IN CLOCK SKEW SCHEDULING BARIS TASKIN and IVAN S. KOURTEV ISPD 2005 High Performance Integrated Circuit Design Lab. Department of.

Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.

-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.

A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

Power Reduction for FPGA using Multiple Vdd/Vth

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

Stochastic sleep scheduling (SSS) for large scale wireless sensor networks Yaxiong Zhao Jie Wu Computer and Information Sciences Temple University.

ASIC Design Flow – An Overview Ing. Pullini Antonio

A 240ps 64b Carry-Lookahead Adder in 90nm CMOS Faezeh Montazeri Advanced VLSI Course Presentation University of Tehran December.

Optimal digital circuit design Mohammad Sharifkhani.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

1 Compacting Test Vector Sets via Strategic Use of Implications Kundan Nepal Electrical Engineering Bucknell University Lewisburg, PA Nuno Alves, Jennifer.

Outline Introduction: BTI Aging and AVS Signoff Problem

An EDA-Friendly Protection Scheme against Side-Channel Attacks Ali Galip Bayrak 1 Nikola Velickovic 1, Francesco Regazzoni 2, David Novo 1, Philip Brisk.

Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,

Modern VLSI Design 3e: Chapter 7 Copyright  1998, 2002 Prentice Hall PTR Topics n Power/ground routing. n Clock routing. n Floorplanning tips. n Off-chip.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Introduction to Clock Tree Synthesis

1 COMP541 Sequential Logic Timing Montek Singh Sep 30, 2015.

Static Timing Analysis - II Sushant SinghSushant Singh.

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

1 Timing Closure and the constant delay paradigm Problem: (timing closure problem) It has been difficult to get a circuit that meets delay requirements.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Yanqing Zhang University of Virginia On Clock Network Design for Sub- threshold Circuitry 1.

Gopakumar.G Hardware Design Group

Lecture 11: Sequential Circuit Design

Andrea Acquaviva, Luca Benini, Bruno Riccò

Revisiting and Bounding the Benefit From 3D Integration

Two-phase Latch based design

Circuit Design Techniques for Low Power DSPs

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

Post-Silicon Calibration for Large-Volume Products

Characterization of C2MOS Flip-Flop in Sub-Threshold Region

An Energy Efficient Two-Phase Clocking Scheme

Presentation transcript:

The Cost of Fixing Hold Time Violations in Sub-threshold Circuits Yanqing Zhang, Benton Calhoun University of Virginia Motivation and Background Power Performance Hold time important!  Shift register structures in computer architecture, e.g. re-order buffer, result bus reservation, etc.  Test structures, e.g. scan chains Much more problematic in sub-threshold  More susceptible to effects of process variations  Long variation distribution tail  Effects clock skew, slew, and logic delay  Largely un-correlated  higher chance of failure! Conventional methods adequate?  Improving clock network costs power/energy  Excessive hold buffer insertion costly  Could undermine purpose of low power t SKEW Excessive buffer insertion is COSTLY Clock network optimization is COSTLY Near- or sub-threshold circuits vital for low power  Power wall imminent for high end applications  Battery life/form factor constraint for low end New design problems in sub-threshold  Performance degradation  More susceptible to process variation  Smaller Ion/Ioff –less noise tolerance  Different timing characteristics  Hold time one problem Thus, need to analyze new design space  How to adapt to sub-threshold?  How to design in sub-threshold?  Other alternative methods needed? Tool Flow and Simulation Test Setup 45 nm PTM standard cell library used  High Vt for low power  TT corner  Vt only variation (Gaussian distribution) Library condition  In contrast to nominal V DD and then scaling V DD down  V DD = 0.35 V  Captures sub-threshold delays  Nominal margins Standard synthesis flow  Synthesis, Place and Route  Power aware clock design  Simplified delay model for simulation, wire RCs not accounted Standard Cell Library Cells.lib Timing_arc: V DD = 0.35 V Library Characterization Synthesis, Place and Route Monte-Carlo hold time simulations  128 stage shift register as design under test  Each design case subject to 100 iterations  Simulation time considerations Sweep amount of buffer insertion  Hold constraint slowly increased  Place and route tool performs timing closure  Buffer penalty measured as power overhead needed to shift data from input to output of shift register Sweep design of clock network  Both slew and skew are design variables  Clock tree synthesis also done by EDA tool  Clock overhead measured as power needed to shift data from input to output of shift register stages total Sweep buffer insertion Sweep slew Sweep skew In Out Power breakdown: 1.P reg =Register power 2.P clk =Clock network power 3.P hold =Hold buffer power Sub-threshold Effects of Process Variation on Cell Delay in Sub-threshold Count (% Total) Cell Delay µ-2σµ-σµµ+σµ+2σµ+3σ Concluding Remarks Conclusions:  Slew is least effective variable for hold fixing  For certain register load, use smaller clock trees  Hold buffer insertion is expensive (>50% total!)  Yield requirements may compromise low power  Complex clock trees fail miserably Other solutions worth looking into?  Conventional methods scaling in sub-threshold is worrisome  Larger designs mean inheritently complex clock trees  skew is a major player  Buffer insertion solution proven as great overhead  need other methods  Better place and route algorithms?  Delay cell design?  Timing scheme ‘tricks’? Results: Effects of Skew Test setup:  Iso-slew at register  Same amount of buffer insertion  Constant level (4) of clock tree  # of clock tree branches (skew) swept Observations  Skew is a major factor  Yields very low for skews > 2 clock buffer delays  Process variation culprit in undermining clock path balancing  Tendency is more levels of clock tree = worse skew (NOT more balancing)! Yield (%) Skew Effects on Yield Max Skew (# of clock buffer delays) 1234 Results: Effects of Slew Test setup:  2 level, 4 branch clock tree used (drive sufficient)  Iso-skew with similar clock topology  800ns clock clock input  Case 1: max allowed clock buffer swept(8X,16X,32X…), no hold buffer insertion  Case2: min clock buffer (8X), hold buffer insertion Observations  Slew not the most effective hold time solution  Little changes in yield for improving slew  Clock energy becomes expensive  For same power budget, (smaller clock tree+buffer insertion) > (bigger clock tree, no buffer insertion) Case 1 vs. Case 2 P clk P reg P hold Normalized Power Consumptions Yield (%) 32X clock tree, No buffers 8X clock tree, Hold buffers Slew Affects on Yield P clk P reg Relative Power Consumptions Yield (%) Slew (ns) X16X32X Results: Cost of Buffer Insertion Test setup:  Simple 2 level, 4 branch clock tree used (drive sufficient)  Minimizes skew (1 clock buffer delay)  Optimum clock clock input Observations  Buffers VERY expensive (>50% total power)  Different size buffers used, data slew a factor  Small buffers add logic delay  Large buffers improve data slew  Steep penalty as yield increases P clk P reg P hold Cost of Buffer Insertion in Hold-time Fix % Power Overhead of Buffers Yield (%) Total Circuit Power (Normalized)