Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

Similar presentations


Presentation on theme: "An Efficient Surface-Based Low-Power Buffer Insertion Algorithm"— Presentation transcript:

1 An Efficient Surface-Based Low-Power Buffer Insertion Algorithm
Rajeev R. Rao, David Blaauw, Dennis Sylvester, Charles Alpert*, Sani Nassif* Department of EECS, University of Michigan, Ann Arbor, MI IBM Austin Research Laboratory, Austin, TX* {rrrao, blaauw, {alpert,

2 Total Dynamic Power Breakdown
Interconnect Trends Interconnect power a major issue Huge power consumption in both global and local signal nets Repeater counts increasing drastically IBM: 50% of leakage in inverters/buffers Assuming continuation of current design styles, dramatic projections for the 32nm technology node 70% of cell count = repeaters 65-80% of dynamic power due to interconnects Leakage increasing exponentially Require: Optimal repeater usage with the objective of total power minimization Source: N. Magen, SLIP’04 Total Dynamic Power Breakdown 10 20 30 40 50 60 70 80 90nm 65nm 45nm 32nm %repeater cells in block-level nets clk-rep rep tot-rep Source: P. Saxena, ISPD’04

3 Outline Introduction Previous Work Proposed Algorithm Results
Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

4 Introduction Wire RC delay is quadratic function of wire length
Segmenting wires decreases delay Same idea applicable for interconnect tree structures Buffers inserted for delay management Additional benefit: Buffers/Inverters decouple large output loads Receiver Driver 2 Wire Length = 2, Wire Delay  (2)2 = 4 1 1 Driver Repeater Receiver Wire Length = 2, Wire Delay  (1)2+(1)2 = 2

5 Elmore Delay model Represent interconnect tree with a lumped RC model
Assume binary tree topology is fixed with an initial Steiner tree estimation n vertices (branch points) and (n-1) edges (ie., wires) For a wire e connecting vertices (u, v) the Elmore delay is: where T(v) is the maximal subtree rooted at v that does not contain buffers The total delay from a vertex v to a sink node si is: Source: Digital Int. Circuits, J. Rabaey

6 Buffer model Linear gate delay model used for the buffers
Assumption: Delay is a linear function of output capacitance Isolation Property: Buffer devices decouple “downstream” output loads from the parent trees Assumption: Miller effect (“bootstrapping”) due to Cgd is negligible Dbuffer = Dintrinsic-delay + Rintrinsic-resistance*Coutput-load Node v “sees” a downstream load = Cbuf. Cload is “invisible” to v. v Cgd Cbuf Cload

7 Buffer Insertion Problem
BufLib b1 b2 b3 Source Sink Legal position Timing Metrics Required Arrival Time (RAT) Each sink specified a given RAT(si) value and source is fixed as RAT(so)=0 Delay minimization  Maximize slack at source q(so) Subtree Delay (SD) SD(si) = RATmax(si) – RAT(si) Delay minimization  Minimize SD(so) Advantage: Unlike RAT, equations using SD are additive Our approach Tradeoff surfaces in 3D space of delay, capacitance and power Continuously-sized buffer libraries

8 Outline Introduction Previous Work Proposed Algorithm Results
Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

9 Previous Work L. P. P. P. van Ginneken (VG) – ISCAS’90
Two phase dynamic programming algorithm Backward traversal up the interconnect tree to compute of load and delay values Forward solution pass to reconstruct “best” candidate Function BOTTOM_UP (v) 1. If v ε sink { return (Cv, SDv) } Else 2. /* compute options for subtrees */ 3. BOTTOM_UP( left(v) ) 4. BOTTOM_UP( right(v) ) 5. Join pairs of subtrees by a merge operation 6. Find best cnd among merged cnds to add a buffer 7. Add parent wire to both types of cnds 8. Prune inferior cnds from set of cnds 9. Store cnd list for node v and return Post-order DFS traversal Merge operation Cparent = Cleft + Cright SDparent = max(SDleft, SDright) Buffer candidate creation Pruning provably inferior candidates

10 VG Algorithm Candidate Format: 2-tuple (Load, Subtree Delay) = (c,s)
Recursive forumulas for two possible cases Pruning Criteria: (c1,s1) “better” than (c2,s2) if both load and subtree delay values are lower i.e., c1<c2 and s1<s2 Merge operation linear Complexity = O(n2) where n = number of buffer locations Additional objective: Minimize buffer count  Complexity is non-polynomial Only a wire is added at root of subtree A buffer and a wire added at root of subtree c1 = c0 + cwire s1 = s0 + dwire c1 = cbuf + cwire (Isolation Property) s1 = s0 + dint + rbuf*c0 + dwire (c0.s0) (c1.s1) (c0.s0) (c1.s1)

11 Previous Work Extensions to VG by Lillis et. al. – ICCAD’95, JSSC’96
A buffer library B can be used during buffer insertion  Complexity = O(n2|B|2) Simultaneous wire sizing and buffer insertion Incorporate signal slew into buffer delay model Dynamic power minimization subject to timing constraints Candidate Format: 3-tuple (Load, Subtree Delay, Power) = (c,s,p) Equate power with effective “total” capacitance Assumption: All capacitive values can be linearly mapped onto a polynomially-bounded integer domain (cmax = max cap value) Sophisticated pruning mechanism using orthogonal range query Complexity = O(n3|B|c2maxlog(ncmax)) based on the assumption

12 Previous Work Several approaches presented in literature to target power minimization in conjunction with buffer insertion. Examples: Quadratic programming: Chu et. al. – TCAD’99 Lagrangian relaxation: C.-P.Chen et. al. TCAD’99 ClockTune: J.-L.Tsai et. al. – TCAD’04 Associate total power with effective capacitive area of wires + devices Area minimization  Power minimization Ignores the contribution of static leakage power Inclusion of this component results in non-polynomial complexity Addition of extra components in candidates generally leads to exponential complexity for dynamic programming

13 Contributions of this paper
Novel “continuous” buffer insertion algorithm with total power minimization Inclusive of both dynamic and leakage power Generate tradeoff surfaces in the 3D DCP (Delay, Capacitance, Power) space User is able to pick any desired point on this 3D surface Easy to explore trade-offs between the 3 variables Ability to handle arbitrarily large buffer libraries Continuously sized cell libraries with numerous buffer sizes Capable of snapping to discrete buffer sizes if necessary Worst-case polynomial complexity O(n2) Similar to “basic” VG algorithm

14 Outline Introduction Previous Work Proposed Algorithm Results
Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

15 Library Characterization
Buffer library with a set of continuously sized buffers Let S = sizing factor of the library. Express delay (db), capacitance (cb) and leakage (lb) in terms of S. Determine c0, c1, l0, l1, d0, d1 through empirical fitting constants Equations combine discrete buffer sizes approximate the ideal of continuous buffer sizing cb  Buffer Area  cb = c0 + c1*S lb  Device width  lb = l0 + l1*S db Linear gate delay model  db = d0 + d1*(Cout/S)

16 Generation of candidates
(D0, C0, P0) b1 b2 b3 b4 o lw1 u lw2 v lw3 t Point Candidate Candidate Format: 3-tuple (Do, Co, Po) Node has point candidate  there are no buffers in subtree rooted at that node All sinks have point candidates Write equations to determine candidate at u

17 Generation of candidates
(Du, Cu, Pu) (D0, C0, P0) (D0, C0, P0) b1 b2 b3 b4 o lw1 u lw2 v lw3 t Variable S Curve Candidate Candidate Format: {[Dumin,Dumax], (gi, ki) i=[0,2]} Node has curve candidate  Exactly one buffer in subtree rooted at node

18 Generation of candidates
(Dv, Cv, Pv) (Du, Cu, Pu) (D0, C0, P0) (Du, Cu, Pu) b1 b2 b3 b4 o lw1 u lw2 v lw3 t For a given S, Cv fixed, Dv, Pv vary based on Du Variable S,Du C-plane with “discrete” Cv Pv Dv Cv Surface Candidate C-plane Format: {Cv, [Dmin,Dmax], (ki) i=[0,2]} Candidate Format: vector<CPlane>

19 Generation of candidates
(Dv, Cv, Pv) (Dt, Ct, Pt) (Du, Cu, Pu) (D0, C0, P0) (Du, Cu, Pu) (Dv, Cv, Pv) b1 b2 b3 b4 o lw1 u lw2 v lw3 t Similar equations can be written to determine candidate at t Ct  S but Dt, Pt  Cv, Dv, S New set of C-planes.  C-plane, Lower envelope  Power optimal solution Surface candidate  Surface candidate

20 Design Choices Wire network is a binary tree
Zero-length wires, dummy nodes Ignore signal polarity on buffers Pair of solution sets (similar to Lillis) Number of surface candidates per node = 2 (Buffered/Non-buffered) Trade-off between more fine grained solutions and efficiency No impact on optimality or complexity

21 Merging and Implicit Pruning
First, merge left and right candidate Compare equal delay points by checking 4 combinations of left and right candidates Create P/C curves and extract the lower envelope  Pruning Translate P/C curves with fixed D value into P/D curves with fixed C values  Creation of C-planes for 4 different surface candidates Next, recombine these 4 surfaces into single candidate Map P/D curves from one C-plane to another using linear interpolation  (D,C) value pick lowest power value  Pruning Use composite surface to create the buffered/non-buffered candidate

22 Reconstruction and Snapping
Pair of candidate solutions created for source Any trade-off point in the DCP surface can be picked Forward solution pass to reconstruct the tree structure with buffer locations Snapping: If required size is unavailable then buffer with nearest size value is chosen Problem: Discrepancies in D, C, P values  Solution: Local refinements in the C-planes Single pass through the RC tree Complexity = O(n2) where n = number of possible buffer locations

23 Outline Introduction Previous Work Proposed Algorithm Results
Delay and Buffer models Previous Work Proposed Algorithm Library characterization Generation of different types of candidates Merging, Propagation, Snapping Results Conclusion

24 Results Benchmarks = C-tree nets TSMC 0.13um buffer library
Number of discrete buffer choices = 9 Multilinear fitting models using GNU Scientific Library Example 3D surface

25 Results: Snapping

26 Results: Comparison Implementation of Lillis algorithm with leakage included Pruning less effective

27 Conclusion Buffer insertion algorithm with total power (Pdyn + Pstat) minimization as objective Generate 3D surfaces in Delay, Capacitance and Power space Ability to explore different types of trade-offs Able to handle large buffer libraries with continuous sizes Worst case polynomial complexity


Download ppt "An Efficient Surface-Based Low-Power Buffer Insertion Algorithm"

Similar presentations


Ads by Google