A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu, Zhuo Li, Charles Alpert Dept of Electrical.

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.

ECE 667 Synthesis and Verification of Digital Circuits

Advanced Interconnect Optimizations. Buffers Improve Slack RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack.

Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.

Buffer and FF Insertion Slides from Charles J. Alpert IBM Corp.

ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.

Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer.

1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.

Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.

© Yamacraw, 2001 Minimum-Buffered Routing of Non-Critical Nets for Slew Rate and Reliability A. Zelikovsky GSU Joint work with C. Alpert.

Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University.

Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.

Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.

38 th Design Automation Conference, Las Vegas, June 19, 2001 Creating and Exploiting Flexibility in Steiner Trees Elaheh Bozorgzadeh, Ryan Kastner, Majid.

Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries.

ER UCLA UCLA ICCAD: November 5, 2000 Predictable Routing Ryan Kastner, Elaheh Borzorgzadeh, and Majid Sarrafzadeh ER Group Dept. of Computer Science UCLA.

EE4271 VLSI Design Interconnect Optimizations Buffer Insertion.

Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by S  = 0.7x) Device delay shrinks by s –Wire geometries.

TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.

ABSTRACT We consider the problem of buffering a given tree with the minimum number of buffers under load cap and buffer skew constraints. Our contributions.

Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages King Ho Tam and Lei He Electrical Engineering Department University of.

UCLA TRIO Package Jason Cong, Lei He Cheng-Kok Koh, and David Z. Pan Cheng-Kok Koh, and David Z. Pan UCLA Computer Science Dept Los Angeles, CA

Dean H. Lorenz, Danny Raz Operations Research Letter, Vol. 28, No

Interconnect Optimizations

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

System-Wide Energy Minimization for Real-Time Tasks: Lower Bound and Approximation Xiliang Zhong and Cheng-Zhong Xu Dept. of Electrical & Computer Engg.

Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.

Fast Buffer Insertion Considering Process Variation Jinjun Xiong, Lei He EE Department University of California, Los Angeles Sponsors: NSF, UC MICRO, Actel,

A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*

EE4271 VLSI Design Advanced Interconnect Optimizations Buffer Insertion.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.

Pei-Ci Wu Martin D. F. Wong On Timing Closure: Buffer Insertion for Hold-Violation Removal DAC’14.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

Interconnect Synthesis. Buffering Related Interconnect Synthesis Consider –Layer assignment –Wire sizing –Buffer polarity –Driver sizing –Generalized.

Advanced Interconnect Optimizations. Timing Driven Buffering Problem Formulation Given –A Steiner tree –RAT at each sink –A buffer type –RC parameters.

More Realistic Power Grid Verification Based on Hierarchical Current and Power constraints 2 Chung-Kuan Cheng, 2 Peng Du, 2 Andrew B. Kahng, 1 Grantham.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 10: February 18, 2015 Architecture Synthesis (Provisioning, Allocation)

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Gate and Interconnect Optimization.

Pattern Sensitive Placement For Manufacturability Shiyan Hu, Jiang Hu Department of Electrical and Computer Engineering Texas A&M University College Station,

1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.

Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Fast Algorithms for Slew Constrained Minimum Cost Buffering S. Hu*, C. Alpert**, J. Hu*, S. Karandikar**, Z. Li*, W. Shi* and C. Sze** *Dept of ECE, Texas.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Maze Routing Algorithms with Exact Matching Constraints for Analog and Mixed Signal Designs M. M. Ozdal and R. F. Hentschke Intel Corporation ICCAD 2012.

An Efficient Surface-Based Low-Power Buffer Insertion Algorithm

1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.

An O(nm) Time Algorithm for Optimal Buffer Insertion of m Sink Nets Zhuo Li and Weiping Shi {zhuoli, Texas A&M University College Station,

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Data Driven Resource Allocation for Distributed Learning

Buffer Insertion with Adaptive Blockage Avoidance

Objectives What have we learned? What are we going to learn?

Performance-Driven Interconnect Optimization Charlie Chung-Ping Chen

Presentation transcript:

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical and Computer Engineering Michigan Technological University **IBM Austin Research Lab Austin, TX

2 Outline Introduction Previous Works Timing-cost approximate dynamic programmingTiming-cost approximate dynamic programming Double- ɛ geometric sequence based oracle searchDouble- ɛ geometric sequence based oracle search The Algorithm Experimental Results Conclusion

Technology generation (  m ) Delay (psec) Transistor/Gate delay Interconnect delay Interconnect Delay Dominates

44 Timing Driven Buffer Insertion

R Buffers Reduce RC Wire Delay x/2 cx/4 rx/2 ∆t = t_buf – t_unbuf = RC + t b – rcx 2 /4 x/2 cx/4 rx/2 C C R x ∆t∆t x/2 x Delay grows linearly with interconnect length

6 25% Gates are Buffers Saxena, et al. [TCAD 2004]

7 Problem Formulation T Minimal cost (area/power) solution 1.Steiner Tree 2.n candidate buffer locations

8 Solution Characterization To model effect to downstream, a candidate solution is associated with To model effect to downstream, a candidate solution is associated with v: a node v: a node C: downstream capacitance C: downstream capacitance Q: required arrival time Q: required arrival time W: cumulative buffer cost W: cumulative buffer cost

9 Dynamic Programming (DP) Candidate solutions are propagated toward the source Start from sinks Candidate solutions are generated Three operations – –Add Wire – –Insert Buffer – –Merge Solution Pruning

10 Generating Candidates (1) (2) (3)

11 Pruning Candidates (3) (a) (b) Both (a) and (b) look the same to the source. Remove the one with the worse slack and cost (4)

12 Merging Branches Right Candidates Left Candidates O(n 1 n 2 ) solutions after each branch merge. Worst-case O((n/m) m ) solutions.

13 DP Properties (Q 1,C 1,W 1 ) (Q 2,C 2,W 2 ) inferior/dominated if C 1  C 2, W 1  W 2 and Q 1  Q 2 Non-dominated solutions are maintained - for the same Q and W, pick min C Non-dominated solutions are maintained - for the same Q and W, pick min C # solutions depends on # of distinct W and Q, but not their values # solutions depends on # of distinct W and Q, but not their values

14 Previous Works …….1996…… …… van Ginneken ’ s algorithm Lillis ’ algorithm Shi and Li’s algorithm Chen and Zhou ’ s algorithm NP-hardness proof

15 Bridging The Gap We are bridging the gap! A Fully Polynomial Time Approximation Scheme (FPTAS) A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Provably good Within (1+ ɛ ) optimal cost for any ɛ >0 Within (1+ ɛ ) optimal cost for any ɛ >0 Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ Best solution for an NP-hard problem in theory Best solution for an NP-hard problem in theory Highly practical Highly practical

16 The Rough Picture W*: the cost of optimal solution Check it Make guess on W* Return the solution Good (close to W*) Not Good Key 2: Smart guess Key 1: Efficient checking

17 Key 1: Efficient Checking Benefit of guess Only maintain the solutions with cost no greater than the guessed cost Only maintain the solutions with cost no greater than the guessed cost Accelerate DP Accelerate DP

Oracle (x): the checker, able to decide whether x>W* or not Oracle (x): the checker, able to decide whether x>W* or not – Without knowing W* – Answer efficiently 18 The Oracle Oracle (x) Guess x within the bounds Setup upper and lower bounds of cost W* Update the bounds

19 Construction of Oracle(x) Scale and round each buffer cost Only interested in whether there is a solution with cost up to x satisfying timing constraint Dynamic Programming Perform DP to scaled problem with n/ ɛ. Runtime polynomial in n/ ɛ

20 Scaling and Rounding ɛ x ɛ /n ɛ 2x ɛ /n ɛ 3x ɛ /n ɛ 4x ɛ /n Buffer cost 0 buffer costs are integers due to rounding and are bounded by n/ ɛ. Rounding error at each buffer ɛ, total rounding error ɛ. Rounding error at each buffer  x ɛ /n, total rounding error  x ɛ. Larger x: larger error, fewer distinct costs and faster Larger x: larger error, fewer distinct costs and faster Smaller x: smaller error, more distinct costs and slower Smaller x: smaller error, more distinct costs and slower Rounding is the reason of acceleration Rounding is the reason of acceleration

DP Results 21 Yes, there is a solution satisfying timing constraint No, no such solution With cost rounding back, the solution has cost at most n/ ɛ x ɛ /n + x ɛ = (1+ ɛ )x > W* With cost rounding back, the solution has cost at least n/ ɛ x ɛ /n = x  W* DP result w/ all w are integers  n/ ɛ

22 Rounding on Q # solutions bounded by # distinct W and Q # solutions bounded by # distinct W and Q # W = O(n/ ɛ 1 ) # W = O(n/ ɛ 1 ) –Rounding before DP # Q # Q –Round up Q to nearest value in {0, ɛ 2 T/m, 2 ɛ 2 T/m, 3 ɛ 2 T/m,…,T }, in branch merge (m is # sinks) –Rounding during DP –# Q = O(m/ ɛ 2 ) # non-dominated solutions is O(mn/ ɛ 1 ɛ 2 ) # non-dominated solutions is O(mn/ ɛ 1 ɛ 2 ) 3 ɛ 2 T/m 2 ɛ 2 T/m ɛ 2 T/m 4 ɛ 2 T/m 0

Q-W Rounding Before Branch Merge W Q n/ ɛ 1 T ɛ 2 T/m ɛ 2 T/m 3 ɛ 2 T/m 4 ɛ 2 T/m

24 Solution Propagation: Add Wire c 2 = c 1 + cx c 2 = c 1 + cx q 2 = q 1 - (rcx 2 /2 + rxc 1 ) q 2 = q 1 - (rcx 2 /2 + rxc 1 ) r: wire resistance per unit length r: wire resistance per unit length c: wire capacitance per unit length c: wire capacitance per unit length (v 1, c 1, w 1, q 1 ) (v 2, c 2, w 2, q 2 ) x

25 Solution Propagation: Insert Buffer (v 1, c 1, w 1, q 1 ) (v 1, c 1b, w 1b, q 1b ) q 1b = q 1 - d(b) q 1b = q 1 - d(b) c 1b = C(b) c 1b = C(b) w 1b = w 1 + w(b) w 1b = w 1 + w(b) d(b): buffer delay d(b): buffer delay

Buffer Insertion Runtime

27 Solution Propagation: Merge Round q in both branches Round q in both branches c merge = c l + c r c merge = c l + c r w merge = w l + w r w merge = w l + w r q merge = min(q l, q r ) q merge = min(q l, q r ) (v, c l, w l, q l )(v, c r,w lr, q r )

Branch Merge Runtime - 1 Target Q=0

Branch Merge Runtime - 2 Target Q= ɛ 2 T/m

Branch Merge Runtime -3 Target Q= 2 ɛ 2 T/m

Branch Merge Runtime -4

32 Timing-Cost Approximate DP Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time

33 Key 2: Geometric Sequence Based Guess U (L): upper (lower) bound on W* U (L): upper (lower) bound on W* Naive binary search style approach Naive binary search style approach Runtime (# iterations) depends on the initial bounds U and L Runtime (# iterations) depends on the initial bounds U and L Oracle (x) x=(U+L)/2 Set U and L on W* (1+ ɛ )x U= (1+ ɛ )x L= x W*<(1+ ɛ )x W*  x

34 Adapt ɛ 1 ɛ 1 Rounding factor x ɛ 1 /n for W Larger ɛ 1 : faster with rough estimation Larger ɛ 1 : faster with rough estimation Smaller ɛ 1 : slower with accurate estimation Smaller ɛ 1 : slower with accurate estimation Adapt ɛ 1 according to U and L Adapt ɛ 1 according to U and L

35 U/L Related Scale and Round Buffer cost 0 U/L x ɛ /n

36 Conceptually Begin with large ɛ 1 and progressively reduce it (towards ɛ ) according to U/L as x approaches W* Begin with large ɛ 1 and progressively reduce it (towards ɛ ) according to U/L as x approaches W* Fix ɛ 2 = ɛ in rounding Q for limiting timing violation Fix ɛ 2 = ɛ in rounding Q for limiting timing violation Set ɛ 1 ɛ Set ɛ 1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ ɛ 1 Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ ɛ ) = O(n/ ɛ ), independent of # iterations One run of DP takes about O(n/ ɛ 1 ) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ ɛ ) = O(n/ ɛ ), independent of # iterations

Oracle Query Till U/L<2 37

38 Mathematically

39 The Algorithmic Flow Oracle (x) Adapting ɛ 1 =[U/L-1] 1/2 Set U and L of W* Set x=[UL/(1+ ɛ 1 )] 1/2 Update U or L U/L<2 Compute final solution

When U/L<2 40 At least one feasible solution, otherwise no solution with cost 2n/ ɛ L ɛ /n = 2L  U At least one feasible solution, otherwise no solution with cost 2n/ ɛ L ɛ /n = 2L  U A single DP runtime A single DP runtime Pick min cost solution satisfying timing at driver W=2n/ ɛ Scale and round each cost by L ɛ /n Scale and round each cost by L ɛ /n Run DP

Main Theorem  Theorem: a (1+ ɛ ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time for 0< ɛ <1 and in O(m 2 n 2 b/ ɛ +mn 2 b+n 3 b) time for ɛ  1

42 Experiments Experimental Setup Experimental Setup – 1000 industrial nets – 48 buffer types including non-inverting buffers and inverting buffers Compared to Dynamic Programming Compared to Dynamic Programming

43 Cost Ratio Compared to DP Approximation Ratio ɛ Buffer Cost Ratio

44 Speedup Compared to DP Approximation Ratio ɛ Speedup

45 Timing Violations (% nets) Approximation Ratio ɛ Timing violations

46 Cost Ratio w/ Timing Recovery Approximation Ratio ɛ Buffer Cost Ratio

47 Speedup w/ Timing Recovery Approximation Ratio ɛ Speedup

48 Observations Without timing recovery Without timing recovery –FPTAS always achieves the theoretical guarantee –Larger ɛ leads to more speedup –On average about 5x faster than dynamic programming –Can run 4.6x faster with 0.57% solution degradation –<5% nets with timing violations With timing recovery With timing recovery –FPTAS well approximates the optimal solutions –Can still have >4x speedup

NP-Hardness Complexity Exponential Time Algorithm Our Bridge

50 Conclusion Propose a (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 Propose a (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 –Runs in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time –Timing-cost approximate dynamic programming –Double- ɛ geometric sequence based oracle search –5x speedup in experiments –Few percent additional buffers as guaranteed theoretically The first provably good approximation algorithm on this problem The first provably good approximation algorithm on this problem

Source: Gordon Moore, Chairman Emeritus, Intel Corp Technology generation (  m ) Delay (psec) Transistor/Gate delay Interconnect delay Summary on Buffer Insertion and Layer Assignment This is why Moore’s law does not hold anymore.

Interconnect Delay Scaling Scaling factor s=0.7 per generation Scaling factor s=0.7 per generation Emore Delay of a wire of length l : Emore Delay of a wire of length l :  int = (rl)(cl)/2= rcl 2 /2 (first order) Local interconnects : Local interconnects :  int : (r/s 2 )(c)(ls) 2 /2 = rcl 2 /2 –Local interconnect delay roughly unchanged Global interconnects : Global interconnects :  int : (r/s 2 )(c)(l) 2 /2= (rcl 2 )/2s 2 –Global interconnect delay doubles – unsustainable Interconnect delay increasingly more dominant Interconnect delay increasingly more dominant

Interconnect Optimization

Analogy Advancing technology = period of city expansion More transistors = larger city Buffers = gas stations Interconnects = streets – –Lower layer = local street – –Higher layer = highways Signal delay (timing) = time to cross the city Highway is fast but its power has not been well explored – –Traditional wire sizing = make lane wider – –Layer assignment = highway overpasses

R Buffers Reduce RC Wire Delay x/2 cx/4 rx/2 ∆t = t_buf – t_unbuf = RC + t b – rcx 2 /4 x/2 cx/4 rx/2 C C R x ∆t∆t x/2 x

Detailed Analysis The delay of a wire of length L is T=rcL 2 /2 Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize delay L r,c – Resistance, cap. per unit length R d – On resistance of inverter C g – Gate input capacitance l

Quadratic Delay -> Linear Delay Substituting l opt back into the interconnect delay expression: Delay grows linearly with L instead of quadratically

58 25% Gates are Buffers Saxena, et al. [TCAD 2004]

59 Problem Formulation T Minimal cost (area/power) solution 1.Steiner Tree 2.n candidate buffer locations

60 Dynamic Programming (DP) Candidate solutions are propagated toward the source Start from sinks Candidate solutions are generated Three operations – –Add Wire – –Insert Buffer – –Merge Solution Pruning

61 Solution Propagation: Add Wire c 2 = c 1 + cx c 2 = c 1 + cx q 2 = q 1 - (rcx 2 /2 + rxc 1 ) q 2 = q 1 - (rcx 2 /2 + rxc 1 ) r: wire resistance per unit length r: wire resistance per unit length c: wire capacitance per unit length c: wire capacitance per unit length (v 1, c 1, w 1, q 1 ) (v 2, c 2, w 2, q 2 ) x

62 Solution Propagation: Insert Buffer (v 1, c 1, w 1, q 1 ) (v 1, c 1b, w 1b, q 1b ) q 1b = q 1 - d(b) q 1b = q 1 - d(b) c 1b = C(b) c 1b = C(b) w 1b = w 1 + w(b) w 1b = w 1 + w(b) d(b): buffer delay d(b): buffer delay

63 Solution Propagation: Merge c merge = c l + c r c merge = c l + c r w merge = w l + w r w merge = w l + w r q merge = min(q l, q r ) q merge = min(q l, q r ) (v, c l, w l, q l )(v, c r, w r, q r )

Solution Pruning Needs solution pruning for acceleration Needs solution pruning for acceleration Two candidate solutions Two candidate solutions –(v, c 1, q 1,w 1 ) –(v, c 2, q 2,w 2 ) Solution 1 is inferior to Solution 2 if Solution 1 is inferior to Solution 2 if –c 1  c 2 : larger load –and q 1  q 2 : tighter timing –and w 1  w 2 : larger cost

END Car Race - Speed Car Speed RAT

Car Race - Load Load Load Capacitance

Faster & Smaller Load END Faster & smaller load (larger RAT, smaller capacitance): Good Slower & larger load (smaller RAT, larger capacitance): Inferior

END Faster & Larger Load: Result 1

END Who will be the winner? Cannot tell at this moment, so keep both of them. Faster & Larger Load: Result 2

70 Pruning (Q 1,C 1,W 1 ) (Q 2,C 2,W 2 ) inferior/dominated if C 1  C 2, W 1  W 2 and Q 1  Q 2 Non-dominated solutions are maintained: for the same Q and W, pick min C Non-dominated solutions are maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values # of solutions depends on # of distinct W and Q, but not their values

71 FPTAS For Buffer Insertion We are bridging the gap! A Fully Polynomial Time Approximation Scheme (FPTAS) A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Provably good Within (1+ ɛ ) optimal cost for any ɛ >0 Within (1+ ɛ ) optimal cost for any ɛ >0 Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ Best solution for an NP-hard problem in theory Best solution for an NP-hard problem in theory Highly practical Highly practical

72 The Rough Picture W*: the cost of optimal solution Check it Make guess on W* Return the solution Good (close to W*) Not Good Key 2: Smart guess Key 1: Efficient checking

73 Key 1: Construction of Oracle(x) Scale and round each buffer cost Only interested in whether there is a solution with cost up to x satisfying timing constraint Dynamic Programming Perform DP to scaled problem with cost upper bound n/ ɛ. Time polynomial in n/ ɛ

74 Scaling and Rounding ɛ x ɛ /n ɛ 2x ɛ /n ɛ 3x ɛ /n ɛ 4x ɛ /n Buffer cost 0

Timing-Cost Approximate DP Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time 75

76 Key 2: Geometric Sequence Based Guess U (L): upper (lower) bound on W* U (L): upper (lower) bound on W* Naive binary search style approach Naive binary search style approach Runtime (# iterations) depends on the initial bounds U and L Runtime (# iterations) depends on the initial bounds U and L Oracle (x) x=(U+L)/2 Set U and L on W* (1+ ɛ )x U= (1+ ɛ )x L= x W*<(1+ ɛ )x W*  x

77 Adapt ɛ 1 ɛ 1 Rounding factor x ɛ 1 /n for W Larger ɛ 1 : faster with rough estimation Larger ɛ 1 : faster with rough estimation Smaller ɛ 1 : slower with accurate estimation Smaller ɛ 1 : slower with accurate estimation Adapt ɛ 1 according to U and L Adapt ɛ 1 according to U and L

78 U/L Related Scale and Round Buffer cost 0 U/L x ɛ /n

Oracle Query Till U/L<2 79

Mathematically 80

Main Theorem 81  Theorem: a (1+ ɛ ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time for 0< ɛ <1 and in O(m 2 n 2 b/ ɛ +mn 2 b+n 3 b) time for ɛ  1

Extension For Layer Assignment  Theorem: a (1+ ɛ ) approximation to the timing constrained minimum cost layer assignment problem can be computed in O(mn 2 / ɛ ) time for any ɛ >0. 82 Oracle Lemma: given a tree with n wire segments and m layers, the optimal layer assignment subject to cost budget W=n/ ɛ can be computed in O(mnW)=O(mn 2 / ɛ ) time. Oracle Lemma: given a tree with n wire segments and m layers, the optimal layer assignment subject to cost budget W=n/ ɛ can be computed in O(mnW)=O(mn 2 / ɛ ) time.

Conclusion A (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09) A (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09) –Runs in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time –Timing-cost approximate dynamic programming –Double- ɛ geometric sequence based oracle search –5x speedup in experiments –Few percent additional buffers as guaranteed theoretically The first provably good approximation algorithm on this problem The first provably good approximation algorithm on this problem A similar algorithm for layer assignment problem (ICCAD’08) A similar algorithm for layer assignment problem (ICCAD’08) 83

84 Thanks