Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

Similar presentations


Presentation on theme: "An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR."— Presentation transcript:

1 An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21 st, 2007 ISPD 2007, Austin

2 2 Outline Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion

3 3 Local Clock Capacitance Distribution in a Microprocessor Interconnects contribute to major portion of total capacitance Clocks are the most active nets in the design Minimizing interconnect capacitance in clocks leads to reduction in dynamic power Distribution generated from several blocks in a microprocessor

4 4 Microprocessor Clock Hierarchy Local Clock Network: CTS Solution Space Clock network in a processor: –Distributed as a grid followed by tree PLL Global Clock Distribution Using Multiple spines Tunable Grid Buffers Regional Clock Buffers Local Clock Buffers Clock Grid RCBs LCBs To state elements

5 5 Previous Work Zero skew (unbuffered) trees: Tsay TCAD’93, Boese et al. ASIC’92, Edahiro DAC’93, ’94 Buffered trees: –Vittal et al., DAC’95: Trades off buffers with wires; unsuitable for controlled implementation of clock gating and delayed clocking –Mehta et al., ICCD’97: Uses dynamic programming based heuristic for clustering –Tsai et al., ICCAD’05: Formulation employing tunable buffers

6 6 CTS Clock Tree Synthesis (CTS) Performed after the placement/sizing of sequentials Converts logical clock tree into physical one Flow employed in several microprocessor designs Physical Synthesis RTL Logic Synthesis Routing Sizing Clock Buffers Routing Clock Nets Logical Clock Tree Clock Buffer Duplication Sequentials (x,y), sizes (Simplified version) CTS

7 7 Clock Buffer Duplication Given a clock buffer, duplicate it to meet delay, slope, RC, skew constraints –Decides receivers driven by the same driver the clock tree topology Applied recursively in reverse topological order Driven by clustering or partitioning –Often intractable when capacity constraints specified –Many heuristics available K-stage receivers K-stage buffers Duplication

8 8 Outline Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion

9 9 Effect of Clustering on Capacitance A cluster implies a clock buffer Interconnect capacitance varies significantly for different solutions even with same number of clusters Solution 1 Solution 2Solution 3 4 placed sequentials

10 10 Clustering Targeting Power Find the clusters such that total local clock power is minimum –Power in local clock, P Local Clock = P Dynamic + P Leakgge –P Dynamic = P Sequential Cap + P Buffer Cap + P Routing Cap –P Leakage and P Buffer Cap can be shown proportional to total cap –P Sequential Cap is fixed for CTS purposes –Reducing P Local Clock is equivalent to minimizing interconnect cap Find the clusters such that total interconnect capacitance is minimum

11 11 Routing-aware Clustering: Chicken- and-Egg Problem Routing cap is unknown till the clustering is performed Clustering cannot be performed till routing cap is known ?

12 12 Problem Simplification Let’s assume minimum spanning tree (MST) routing estimates –Other candidates: HPWL, Edahiro metric –Data in the paper show MST and Edahiro metric strongly correlated with actual clock tree wirelength –MST possesses submodularity property suitable for greedy optimization Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum Yes, it can be (if capacity constraints are dropped)

13 13 Problem Definition Given: Set of receivers S = {s 1, …, s n }, their loads (c s i ), and locations (x s i, y s i ) Find: A set of clusters, S clusters = {c 1, …, c m } such that Σ i α + MST (c i ) is minimum Subject to Constraints (or Design Parameters): –Maximum # of receivers Due to process, routing, etc. –Maximum load in a cluster Due to library –Bounding box width/height To control RC delay and variations in it

14 14 Outline Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion

15 15 Power-aware Clustering Algorithm Similar to Kruskal ’ s MST construction algorithm Steps in algorithm: –Create complete graph G(S, E, W) –Assign each edge estimated capacitance as the weight –Create trivial solution with each cluster containing a receiver –For each edge, in ascending order of weights Merge clusters till the cost function is minimized

16 16 Example Constraint: maximum # of receivers constraint 3 A cluster An edge The weight 1 2 4 4 55

17 17 Example Constraint: maximum # of receivers constraint 3 1 2 4 4 55

18 18 Example Constraint: maximum # of receivers constraint 3 1 2 4 4 55

19 19 Example Constraint: maximum # of receivers constraint 3 Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case 1 2 4 4 55

20 20 Optimality, Time Complexity of Algorithm Ensures optimality when no capacity constraints (max. load, # of receivers) specified –Reduces to minimum spanning forest problem Runs in O(n 2 log n) time in number of receivers –Handles blocks with ~5K sequentials easily –1.34 seconds for clustering of 1037 sequentials Run-times practical and comparable to competitive algorithms –Clock buffer duplication takes minutes on ~5K sequential blocks

21 21 Outline Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion

22 22 Evaluation of Power-Aware Clustering (PoAwCl) Implemented clustering algorithm, PoAwCl, in C++ Incorporated in the clock buffer duplication step using TCL Rest of the CTS kept unchanged Generated clock trees on microprocessor blocks by changing only the clustering/partitioning heuristics Best of the results compared with the PoAwCl

23 23 Results on Clock Trees: Int. Cap. Improvement 13% Average Improvement

24 24 Results on Clock Trees: Total Cap. Improvement 6% Average Improvement

25 25 Results on Clock Trees: Wirelength Improvement 11% Average Improvement

26 26 ●,+,*, ▼ denote locations of sequentials; same type symbols denote a cluster 4 clusters, in each case, represent 4 clock buffers driving the sequentials in their clusters Looking at Cluster Pictures Clustering aimed at minimizing # of buffers Power-aware clustering

27 27 Viewing the Routing Power-aware clustering (on right) results in smaller wirelength

28 28 Agenda Introduction Motivation Problem Formulation Clustering Algorithm Experimental Results Conclusion

29 29 Conclusion Power-aware clustering results in 13% improvement in interconnect cap Also Frees up routing resources by 11% discounting shielding and spacing of clock wires Used for other applications such as enable logic (or clock gating) synthesis, trunk-routing Acknowledgment: Intel’s CAD Organization –for providing the source code of the CTS package which sped up the development

30 30 Thank you….


Download ppt "An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR."

Similar presentations


Ads by Google