Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

OCV-Aware Top-Level Clock Tree Optimization

Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.

~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:

Consistent Placement of Macro-Blocks Using Floorplanning and Standard-Cell Placement Saurabh Adya Igor Markov (University of Michigan)

Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.

VLSI Layout Algorithms CSE 6404 A 46 B 65 C 11 D 56 E 23 F 8 H 37 G 19 I 12J 14 K 27 X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Dr. Md. Saidur.

Faster SAT and Smaller BDDs via Common Function Structure Fadi A. Aloul, Igor L. Markov, Karem A. Sakallah University of Michigan.

Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.

Power-Aware Placement

Chapter 2 – Netlist and System Partitioning

EDA (CS286.5b) Day 5 Partitioning: Intro + KLFM. Today Partitioning –why important –practical attack –variations and issues.

Supply Voltage Degradation Aware Analytical Placement Andrew B. Kahng, Bao Liu and Qinke Wang UCSD CSE Department {abk, bliu,

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

A Semi-Persistent Clustering Technique for VLSI Circuit Placement Charles J. Alpert 1, Andrew Kahng 2, Gi-Joon Nam 1, Sherief Reda 2 and Paul G. Villarrubia.

Placement Feedback: A Concept and Method for Better Min-Cut Placements Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Can Recursive Bisection Alone Produce Routable Placements? Andrew E. Caldwell Andrew B. Kahng Igor L. Markov Supported by Cadence.

An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.

A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*

Accurate Pseudo-Constructive Wirelength and Congestion Estimation Andrew B. Kahng, UCSD CSE and ECE Depts., La Jolla Xu Xu, UCSD CSE Dept., La Jolla Supported.

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.

Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,

1 Circuit Partitioning Presented by Jill. 2 Outline Introduction Cut-size driven circuit partitioning Multi-objective circuit partitioning Our approach.

1 Enhancing Performance of Iterative Heuristics for VLSI Netlist Partitioning Dr. Sadiq M. Sait Dr. Aiman El-Maleh Mr. Raslan Al Abaji. Computer Engineering.

Partitioning Outline –What is Partitioning –Partitioning Example –Partitioning Theory –Partitioning Algorithms Goal –Understand partitioning problem –Understand.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004

CSE 494: Electronic Design Automation Lecture 4 Partitioning.

March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Circuit Partitioning Divides circuit into smaller partitions that can be efficiently handled Goal is generally to minimize communication between balanced.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Outline Motivation and Contributions Related Works ILP Formulation

CprE566 / Fall 06 / Prepared by Chris ChuPartitioning1 CprE566 Partitioning.

Improved Path Clustering for Adaptive Path-Delay Testing Tuck-Boon Chan* and Prof. Andrew B. Kahng*# UC San Diego ECE* & CSE # Departments.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

Hypergraph Partitioning With Fixed Vertices Andrew E. Caldwell, Andrew B. Kahng and Igor L. Markov UCLA Computer Science Department

Multilevel Partitioning

Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.

Reducing Structural Bias in Technology Mapping

Kun Young Chung*, Andrew B. Kahng+ and Jiajia Li+

RE-Tree: An Efficient Index Structure for Regular Expressions

Technology Migration Technique for Designs with Strong RET-driven Layout Restrictions Xin Yuan, Kevin McCullen, Fook-Luen Heng, Robert Walker, Jason Hibbeler,

Improved Performance of 3DIC Implementations Through Inherent Awareness of Mix-and-Match Die Stacking Kwangsoo Han, Andrew B. Kahng and Jiajia Li University.

Revisiting and Bounding the Benefit From 3D Integration

Standard-Cell Mapping Revisited

Buffered tree construction for timing optimization, slew rate, and reliability control Abstract: With the rapid scaling of IC technology, buffer insertion.

SAT-Based Optimization with Don’t-Cares Revisited

Timing Optimization.

A Semi-Persistent Clustering Technique for VLSI Circuit Placement

A Fundamental Bi-partition Algorithm of Kernighan-Lin

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Rusakov A. S. (IPPM RAS), Sheblaev M.

Fast Min-Register Retiming Through Binary Max-Flow

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

A Random Access Scan Architecture to Reduce Hardware Overhead

Presentation transcript:

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported in part by MARCO GSRC

Outline  Motivation Performance driven bipartition problem New bipartitioning algorithm Experimental results Conclusion and future work

Partitioning and Performance The goal of traditional hypergraph partitioning is to minimize cutsize. To meet the performance requirement of current designs, we need a performance-driven partitioner, which considers both cutsize and delay.

Previous Work (I) [Cong et al. ISPD-2002] Global clustering based algorithm with retiming Min-delay Clustering w/ retiming Min-cutsize Clustering De-clustering and refinement Reduces delay by 16% while increasing cutsize by 17% Requires substantial gate replication

Previous Work (II) [Ababei et al. ICCAD-2002] Reweighting based method Path based Input Reweighting Cutsize oriented partitioner, such as hMetis,MLPart 1 1 Global timing analysis Find critical paths 1 Net based 1 2 1 14% reduction of delay with 10% increase in cutsize 139% increase in runtime compared with hMetis

Motivating Questions  Can we avoid global timing analysis? Global timing analysis is extremely time-consuming Can we improve path delay without significant degrading of cutsize? Need smooth tradeoff between delay and cutsize Can we reduce implementation overheads? Previous methods store thousands of critical paths and continuously update them

Outline Motivation Performance driven bipartition problem New bipartitioning algorithm Experimental results Conclusion and future work

Delay Model hop Part 0 Part 1 cut [Cong et al. ISPD-2002] Delay = hop_delay + node_delay hop Part 0 Part 1 FF nodes Combinational nodes cut [Cong et al. ISPD-2002] hop_delay=5 node_delay=1  Delay = 3x5 + 5x1 = 20 [Ababei et al. ICCAD-2002] hop_delay=Elmore delay node_delay=constant

Performance Driven Bipartition Problem Given: Hypergraph H=(V,E) Area Balance tolerance s (0<s<1), a parameter to control allowable slack in the area constraint a, a given parameter which captures tradeoff between cutsize and path delay (hopcount) Find: A bipartition (V0|V1) which satisfies: and minimizes a(cutsize)+(1-a)(Max_hopcount)

Outline Motivation Performance driven bipartition problem  New bipartitioning algorithm Experimental results Conclusion and future work

Unidirectional Partition Path delay is minimized with hopcount = 1 if the partition is unidirectional (“acyclic”), that is, all cuts are in the same direction Part 1 Part 0 Part 0 Problem: High cutsize No unidirectional solution Can we achieve “locally unidirectional” partition? Max hopcount=5 Max hopcount=3 Part 0 Part 1 Part 0 Part 1

V-Shaped Nodes vj vt v V-shaped node If a combinational node v satisfies: there exist vj, vt in the other part and a path from vj to vt that includes only v then v is a V-shaped node vj vt Part 0 Part 1 v

V-Shaped Nodes in Critical Paths Empirical observations from study of partitioning solutions: there are V-shaped nodes in the partitioning solutions every V-shaped node is included in many critical paths every critical path contains several V-shaped nodes For testcase 1: Number of nets : 16377 Number of critical paths : 26772 On average, one critical path contains 27.6 nodes On average, one critical path contains 3.4 V-nodes On average, one V-node belongs to 233.7 critical paths

Key Idea: V-Shaped Nodes Elimination Part 0 f Part 0 f a c c a Move b b Part 1 b d e d e Part 1 Move V-shaped node “b” to reduce path hopcount PATH: abc hopcount=2 PATH: dbc hopcount=1 PATH: ebc hopcount=1 PATH: abc hopcount=0 PATH: dbc hopcount=1 PATH: ebc hopcount=1

Distance-k V-Shaped Nodes Elimination Distance-k V-shaped Nodes (Vk Nodes): Paths of k combinational nodes with neighbors in the other part. Part 0 Part 0 d a d a b c Move b,c Part 1 b c Part 1 k = 2: Move V2 node “b, c” reduce path hopcount from 2 to 0 Problems with large k: Cutsize may be greatly increased

New Gain Function v v Gain(v)=δ(0)+ δ(1) After Move Before Move g(v): traditional FM gain rj(v): reduction of Vj nodes after moving v

Distance-k Unidirectional Algorithm Calculate initial gains for all nodes and store the gains Select the node v with maximum gain /* CLIP-like method: move the cluster that v belongs to */ Reset the gains of all nodes to zero Move v and update the gains of v and its neighbors While ( one node not moved) Select one node v with the maximum updated gain Move v and update the related gains Find the point in the move sequence at which the sum of gains is maximum; undo all moves after this point

Outline Motivation New bipartitioning algorithm Experimental results Conclusion and future work

Experimental Setup Four industry testcases obtained as LEF/DEF Model of Ababei et al. (ICCAD-2002) used to calculate delay Partitioning solutions compared to results of MLPart strongest multilevel netlist partitioning code website: http://nexus6.cs.ucla.edu/GSRC/bookshelf/Slots/Partitioning/MLPart All tests on 600MHz Intel Pentium-III Xeon

Biasing against V1 Nodes vs. MLPart δ(0)=1, δ(1)=10 Testcase MLPart MLPart+V-shaped nodes Removal cutsize h delay time(s) 1 820.7 5.3 352.8 11.79 856.1 3.3 266.8 12.58 2 169.9 3.5 220.7 13.45 189.8 2.5 211.2 15.32 3 141.3 291.6 16.67 152.3 2.3 283.6 18.27 4 408.7 302.6 12.43 421.2 3.6 252.7 14.03 Reduction of delay: 4.5%-24.4% average:15.1% Increase of cutsize: 3.0%-10.0% average: 4.9% Increase of runtime: 6.3%-11.4% average: 9.7% Using the delay model in Cong et al. ISPD -2002 Reduction of delay: 4.3%-21.2% average:14.7%

Biasing against V2 Nodes vs. MLPart δ(0)=1, δ(1)=30, δ(2)=3 Testcase MLPart MLPart+Vk=2 nodes Removal cutsize h delay time(s) 1 820.7 5.3 352.8 11.79 847.5 3 262.1 13.16 2 169.9 3.5 220.7 13.45 183.2 202.5 15.67 141.3 291.6 16.67 149.2 275.6 18.92 4 408.7 302.6 12.43 416.7 3.4 243.5 14.79 Reduction of delay: 8.9%-30.0% average: 18.7% Increase of cutsize: 3.1%-7.2% average: 3.5% Increase of runtime: 11.9%-15.9% average: 13.1% Using the delay model in Cong et al. ISPD -2002 Reduction of delay: 8.3%-28.7% average: 17.3%

Outline Motivation Performance driven bipartition problem New bipartitioning algorithm Experimental results  Conclusions and future work

Conclusions Simple yet efficient timing-driven partitioning that does not require global timing analysis Negligible implementation, runtime overhead Significantly reduces path delay with cutsize and runtime almost same as leading-edge MLPart Similar improvements observed with different path delay metrics Futures Impact of new partitioner on placement Efficient methods for biasing δ(k) k>2

Thank you!

Future Work Impact of new partitioner on placement Efficient methods for biasing δ(k) k>2

Why Performance Driven Partitioning? Achieving timing closure becomes increasingly difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…) Timing needs to be addressed at all design stages Partitioning is a critical step in defining interconnect timing properties, but is traditionally driven by cutsize objective

Previous Work (I) With Logic Replication Without Logic Replication Retiming Replication graph Without Logic Replication Net based reweighting Path based reweighting

FM Partitioning and Gain Function Start with random partition v v Move the node with the max gain and lock it Part 0 Before Move After Move Part 1 Gain(v)=-1 Gain(v) = Reduction of cutsize after moving v Keep moving until all nodes are locked Part 1 Find the best point in the move sequence Part 0 Part 0 Part 1

Procedure to Calculate rj(v) Delete all FF nodes and their related edges In the remaining graph, BFS from v For each level j from 1 to k If v is a Vj node before moving, rj’=1 If v is a Vj node after moving, rj’’=1 rj=rj’’-rj’

CLIP Algorithm v CLIP Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline.

Distance-k V-Shaped Nodes Distance-k V-shaped nodes (Vk-node): If k combinational nodes vi,1 … vi,k satisfy: vi,1 … vi,k are in the same part  vj, vt in the other part  a path from vj to vt and only passes vi,1 … vi,k then vi,1 … vi,k are distance-k V-shaped nodes vj vt Part 0 vi,1 Part 1 vi,k

Notation H(V,E)= circuit hypergraph V = set of nodes representing components of the circuit E = set of signal nets A bipartition (V0|V1) of H(V,E) divides V into two disjoint subsets s.t. V= V0V1, which are called Part 0 and Part 1 A = the total area of all the nodes in V A0 = the area of all the nodes in V0