An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

Slides:



Advertisements
Similar presentations
Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.
Advertisements

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.
ECE 667 Synthesis and Verification of Digital Circuits
Advanced Interconnect Optimizations. Buffers Improve Slack RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack.
Fast Algorithms For Hierarchical Range Histogram Constructions
4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.
Buffer and FF Insertion Slides from Charles J. Alpert IBM Corp.
ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.
1 Modeling and Optimization of VLSI Interconnect Lecture 9: Multi-net optimization Avinoam Kolodny Konstantin Moiseev.
Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer.
Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.
1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.
Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.
Variability-Driven Formulation for Simultaneous Gate Sizing and Post-Silicon Tunability Allocation Vishal Khandelwal and Ankur Srivastava Department of.
© Yamacraw, 2001 Minimum-Buffered Routing of Non-Critical Nets for Slew Rate and Reliability A. Zelikovsky GSU Joint work with C. Alpert.
Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.
Fuzzy Simulated Evolution for Power and Performance of VLSI Placement Sadiq M. Sait Habib Youssef Junaid A. KhanAimane El-Maleh Department of Computer.
Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.
Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries.
EE4271 VLSI Design Interconnect Optimizations Buffer Insertion.
Chapter 7 Reading on Moment Calculation. Time Moments of Impulse Response h(t) Definition of moments i-th moment Note that m 1 = Elmore delay when h(t)
UCLA TRIO Package Jason Cong, Lei He Cheng-Kok Koh, and David Z. Pan Cheng-Kok Koh, and David Z. Pan UCLA Computer Science Dept Los Angeles, CA
Fuzzy Simulated Evolution for Power and Performance of VLSI Placement Sadiq M. SaitHabib Youssef Junaid A. KhanAimane El-Maleh Department of Computer Engineering.
Interconnect Optimizations
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
NuCAD ELECTRICAL ENGINEERING AND COMPUTER SCIENCE McCormick Northwestern University Robert R. McCormick School of Engineering and Applied Science Nostra-XTalk.
Fast Buffer Insertion Considering Process Variation Jinjun Xiong, Lei He EE Department University of California, Los Angeles Sponsors: NSF, UC MICRO, Actel,
EE4271 VLSI Design Advanced Interconnect Optimizations Buffer Insertion.
Fuzzy Evolutionary Algorithm for VLSI Placement Sadiq M. SaitHabib YoussefJunaid A. Khan Department of Computer Engineering King Fahd University of Petroleum.
ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.
Gate Sizing by Mathematical Programming Prof. Shiyan Hu
Pei-Ci Wu Martin D. F. Wong On Timing Closure: Buffer Insertion for Hold-Violation Removal DAC’14.
1 A Method for Fast Delay/Area Estimation EE219b Semester Project Mike Sheets May 16, 2000.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Advanced Interconnect Optimizations. Timing Driven Buffering Problem Formulation Given –A Steiner tree –RAT at each sink –A buffer type –RC parameters.
VLSI Physical Design: From Graph Partitioning to Timing Closure Paper Presentation © KLMH Lienig 1 EECS 527 Paper Presentation Accurate Estimation of Global.
Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.
Xin-Wei Shih and Yao-Wen Chang.  Introduction  Problem formulation  Algorithms  Experimental results  Conclusions.
POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.
1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.
EE 5900 Advanced Algorithms for Robust VLSI CAD, Spring 2009 Static Timing Analysis and Gate Sizing.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
1 Design Space Exploration for Power-Efficient Mixed-Radix Ling Adders Chung-Kuan Cheng Computer Science and Engineering Depart. University of California,
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.
A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.
1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.
Fast Algorithms for Slew Constrained Minimum Cost Buffering S. Hu*, C. Alpert**, J. Hu*, S. Karandikar**, Z. Li*, W. Shi* and C. Sze** *Dept of ECE, Texas.
Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Jianhua Liu1, Yi Zhu1, Haikun Zhu1, John Lillis2, Chung-Kuan Cheng1
Routing Tree Construction with Buffer Insertion under Obstacle Constraints Ying Rao, Tianxiang Yang Fall 2002.
Logic synthesis flow Technology independent mapping –Two level or multilevel optimization to optimize a coarse metric related to area/delay Technology.
1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.
Incorporating Driver Sizing Into Buffer Insertion Via a Delay Penalty Technique Chuck Alpert, IBM Chris Chu, Iowa State Milos Hrkic, UIC Jiang Hu, IBM.
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.
A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design Jingyu Xu, Xianlong Hong, Tong Jing, Yici.
An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.
An O(nm) Time Algorithm for Optimal Buffer Insertion of m Sink Nets Zhuo Li and Weiping Shi {zhuoli, Texas A&M University College Station,
Dirk Stroobandt Ghent University Electronics and Information Systems Department Multi-terminal Nets do Change Conventional Wire Length Distribution Models.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Buffer Insertion with Adaptive Blockage Avoidance
Buffered tree construction for timing optimization, slew rate, and reliability control Abstract: With the rapid scaling of IC technology, buffer insertion.
Objectives What have we learned? What are we going to learn?
Performance-Driven Interconnect Optimization Charlie Chung-Ping Chen
Presentation transcript:

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm Rajeev R. Rao, David Blaauw, Dennis Sylvester, Charles Alpert*, Sani Nassif* Department of EECS, University of Michigan, Ann Arbor, MI IBM Austin Research Laboratory, Austin, TX* {rrrao, blaauw, dennis}@eecs.umich.edu, {alpert, nassif}@us.ibm.com*

Total Dynamic Power Breakdown Interconnect Trends Interconnect power a major issue Huge power consumption in both global and local signal nets Repeater counts increasing drastically IBM: 50% of leakage in inverters/buffers Assuming continuation of current design styles, dramatic projections for the 32nm technology node 70% of cell count = repeaters 65-80% of dynamic power due to interconnects Leakage increasing exponentially Require: Optimal repeater usage with the objective of total power minimization Source: N. Magen, SLIP’04 Total Dynamic Power Breakdown 10 20 30 40 50 60 70 80 90nm 65nm 45nm 32nm %repeater cells in block-level nets clk-rep rep tot-rep Source: P. Saxena, ISPD’04

Outline Introduction Previous Work Proposed Algorithm Results Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

Introduction Wire RC delay is quadratic function of wire length Segmenting wires decreases delay Same idea applicable for interconnect tree structures Buffers inserted for delay management Additional benefit: Buffers/Inverters decouple large output loads Receiver Driver 2 Wire Length = 2, Wire Delay  (2)2 = 4 1 1 Driver Repeater Receiver Wire Length = 2, Wire Delay  (1)2+(1)2 = 2

Elmore Delay model Represent interconnect tree with a lumped RC model Assume binary tree topology is fixed with an initial Steiner tree estimation n vertices (branch points) and (n-1) edges (ie., wires) For a wire e connecting vertices (u, v) the Elmore delay is: where T(v) is the maximal subtree rooted at v that does not contain buffers The total delay from a vertex v to a sink node si is: Source: Digital Int. Circuits, J. Rabaey

Buffer model Linear gate delay model used for the buffers Assumption: Delay is a linear function of output capacitance Isolation Property: Buffer devices decouple “downstream” output loads from the parent trees Assumption: Miller effect (“bootstrapping”) due to Cgd is negligible Dbuffer = Dintrinsic-delay + Rintrinsic-resistance*Coutput-load Node v “sees” a downstream load = Cbuf. Cload is “invisible” to v. v Cgd Cbuf Cload

Buffer Insertion Problem BufLib b1 b2 b3 … Source Sink Legal position Timing Metrics Required Arrival Time (RAT) Each sink specified a given RAT(si) value and source is fixed as RAT(so)=0 Delay minimization  Maximize slack at source q(so) Subtree Delay (SD) SD(si) = RATmax(si) – RAT(si) Delay minimization  Minimize SD(so) Advantage: Unlike RAT, equations using SD are additive Our approach Tradeoff surfaces in 3D space of delay, capacitance and power Continuously-sized buffer libraries

Outline Introduction Previous Work Proposed Algorithm Results Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

Previous Work L. P. P. P. van Ginneken (VG) – ISCAS’90 Two phase dynamic programming algorithm Backward traversal up the interconnect tree to compute of load and delay values Forward solution pass to reconstruct “best” candidate Function BOTTOM_UP (v) 1. If v ε sink { return (Cv, SDv) } Else 2. /* compute options for subtrees */ 3. BOTTOM_UP( left(v) ) 4. BOTTOM_UP( right(v) ) 5. Join pairs of subtrees by a merge operation 6. Find best cnd among merged cnds to add a buffer 7. Add parent wire to both types of cnds 8. Prune inferior cnds from set of cnds 9. Store cnd list for node v and return Post-order DFS traversal Merge operation Cparent = Cleft + Cright SDparent = max(SDleft, SDright) Buffer candidate creation Pruning provably inferior candidates

VG Algorithm Candidate Format: 2-tuple (Load, Subtree Delay) = (c,s) Recursive forumulas for two possible cases Pruning Criteria: (c1,s1) “better” than (c2,s2) if both load and subtree delay values are lower i.e., c1<c2 and s1<s2 Merge operation linear Complexity = O(n2) where n = number of buffer locations Additional objective: Minimize buffer count  Complexity is non-polynomial Only a wire is added at root of subtree A buffer and a wire added at root of subtree c1 = c0 + cwire s1 = s0 + dwire c1 = cbuf + cwire (Isolation Property) s1 = s0 + dint + rbuf*c0 + dwire (c0.s0) (c1.s1) (c0.s0) (c1.s1)

Previous Work Extensions to VG by Lillis et. al. – ICCAD’95, JSSC’96 A buffer library B can be used during buffer insertion  Complexity = O(n2|B|2) Simultaneous wire sizing and buffer insertion Incorporate signal slew into buffer delay model Dynamic power minimization subject to timing constraints Candidate Format: 3-tuple (Load, Subtree Delay, Power) = (c,s,p) Equate power with effective “total” capacitance Assumption: All capacitive values can be linearly mapped onto a polynomially-bounded integer domain (cmax = max cap value) Sophisticated pruning mechanism using orthogonal range query Complexity = O(n3|B|c2maxlog(ncmax)) based on the assumption

Previous Work Several approaches presented in literature to target power minimization in conjunction with buffer insertion. Examples: Quadratic programming: Chu et. al. – TCAD’99 Lagrangian relaxation: C.-P.Chen et. al. TCAD’99 ClockTune: J.-L.Tsai et. al. – TCAD’04 Associate total power with effective capacitive area of wires + devices Area minimization  Power minimization Ignores the contribution of static leakage power Inclusion of this component results in non-polynomial complexity Addition of extra components in candidates generally leads to exponential complexity for dynamic programming

Contributions of this paper Novel “continuous” buffer insertion algorithm with total power minimization Inclusive of both dynamic and leakage power Generate tradeoff surfaces in the 3D DCP (Delay, Capacitance, Power) space User is able to pick any desired point on this 3D surface Easy to explore trade-offs between the 3 variables Ability to handle arbitrarily large buffer libraries Continuously sized cell libraries with numerous buffer sizes Capable of snapping to discrete buffer sizes if necessary Worst-case polynomial complexity O(n2) Similar to “basic” VG algorithm

Outline Introduction Previous Work Proposed Algorithm Results Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

Library Characterization Buffer library with a set of continuously sized buffers Let S = sizing factor of the library. Express delay (db), capacitance (cb) and leakage (lb) in terms of S. Determine c0, c1, l0, l1, d0, d1 through empirical fitting constants Equations combine discrete buffer sizes approximate the ideal of continuous buffer sizing cb  Buffer Area  cb = c0 + c1*S lb  Device width  lb = l0 + l1*S db Linear gate delay model  db = d0 + d1*(Cout/S)

Generation of candidates (D0, C0, P0) b1 b2 b3 b4 o lw1 u lw2 v lw3 t Point Candidate Candidate Format: 3-tuple (Do, Co, Po) Node has point candidate  there are no buffers in subtree rooted at that node All sinks have point candidates Write equations to determine candidate at u

Generation of candidates (Du, Cu, Pu) (D0, C0, P0) (D0, C0, P0) b1 b2 b3 b4 o lw1 u lw2 v lw3 t  Variable S Curve Candidate Candidate Format: {[Dumin,Dumax], (gi, ki) i=[0,2]} Node has curve candidate  Exactly one buffer in subtree rooted at node

Generation of candidates (Dv, Cv, Pv) (Du, Cu, Pu) (D0, C0, P0) (Du, Cu, Pu) b1 b2 b3 b4 o lw1 u lw2 v lw3 t For a given S, Cv fixed, Dv, Pv vary based on Du  Variable S,Du C-plane with “discrete” Cv Pv Dv Cv Surface Candidate C-plane Format: {Cv, [Dmin,Dmax], (ki) i=[0,2]} Candidate Format: vector<CPlane>

Generation of candidates (Dv, Cv, Pv) (Dt, Ct, Pt) (Du, Cu, Pu) (D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv) b1 b2 b3 b4 o lw1 u lw2 v lw3 t Similar equations can be written to determine candidate at t Ct  S but Dt, Pt  Cv, Dv, S New set of C-planes.  C-plane, Lower envelope  Power optimal solution Surface candidate  Surface candidate

Design Choices Wire network is a binary tree Zero-length wires, dummy nodes Ignore signal polarity on buffers Pair of solution sets (similar to Lillis) Number of surface candidates per node = 2 (Buffered/Non-buffered) Trade-off between more fine grained solutions and efficiency No impact on optimality or complexity

Merging and Implicit Pruning First, merge left and right candidate Compare equal delay points by checking 4 combinations of left and right candidates Create P/C curves and extract the lower envelope  Pruning Translate P/C curves with fixed D value into P/D curves with fixed C values  Creation of C-planes for 4 different surface candidates Next, recombine these 4 surfaces into single candidate Map P/D curves from one C-plane to another using linear interpolation  (D,C) value pick lowest power value  Pruning Use composite surface to create the buffered/non-buffered candidate

Reconstruction and Snapping Pair of candidate solutions created for source Any trade-off point in the DCP surface can be picked Forward solution pass to reconstruct the tree structure with buffer locations Snapping: If required size is unavailable then buffer with nearest size value is chosen Problem: Discrepancies in D, C, P values  Solution: Local refinements in the C-planes Single pass through the RC tree Complexity = O(n2) where n = number of possible buffer locations

Outline Introduction Previous Work Proposed Algorithm Results Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

Results Benchmarks = C-tree nets TSMC 0.13um buffer library Number of discrete buffer choices = 9 Multilinear fitting models using GNU Scientific Library Example 3D surface

Results: Snapping

Results: Comparison Implementation of Lillis algorithm with leakage included Pruning less effective

Conclusion Buffer insertion algorithm with total power (Pdyn + Pstat) minimization as objective Generate 3D surfaces in Delay, Capacitance and Power space Ability to explore different types of trade-offs Able to handle large buffer libraries with continuous sizes Worst case polynomial complexity