Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.

Slides:

Advertisements

Similar presentations

April 2004NUCAD Northwestern University1 Minimal Period Retiming Under Process Variations Jia Wang and Hai Zhou Electrical & Computer Engineering Northwestern.

Advertisements

Group: Wilber L. Duran Duo (Steve) Liu

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.

ECE 667 Synthesis and Verification of Digital Circuits

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

METIS Three Phases Coarsening Partitioning Uncoarsening

1 Physical Hierarchy Generation with Routing Congestion Control Chin-Chih Chang *, Jason Cong *, Zhigang (David) Pan +, and Xin Yuan * * UCLA Computer.

Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)

Constructing Minimal Spanning Steiner Trees with Bounded Path Length Presenter : Cheng-Yin Wu, NTUGIEE Some of the Slides in this Presentation are Referenced.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

NTHU-CS 1 Performance-Optimal Clustering with Retiming for Sequential Circuits Tzu-Chieh Tien and Youn-Long Lin Department of Computer Science National.

Combining Technology Mapping and Retiming EECS 290A Sequential Logic Synthesis and Verification.

Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.

VLSI Layout Algorithms CSE 6404 A 46 B 65 C 11 D 56 E 23 F 8 H 37 G 19 I 12J 14 K 27 X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Dr. Md. Saidur.

EDA (CS286.5b) Day 5 Partitioning: Intro + KLFM. Today Partitioning –why important –practical attack –variations and issues.

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.

Easy Optimization Problems, Relaxation, Local Processing for a single variable.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Branch and Bound Algorithm for Solving Integer Linear Programming

ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.

1 Circuit Partitioning Presented by Jill. 2 Outline Introduction Cut-size driven circuit partitioning Multi-objective circuit partitioning Our approach.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,

Multilevel Hypergraph Partitioning G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar Computer Science Department, U of MN Applications in VLSI Domain.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Computational Complexity Polynomial time O(n k ) input size n, k constant Tractable problems solvable in polynomial time(Opposite Intractable) Ex: sorting,

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

TSV-Aware Analytical Placement for 3D IC Designs Meng-Kai Hsu, Yao-Wen Chang, and Valerity Balabanov GIEE and EE department of NTU DAC 2011.

Archer: A History-Driven Global Routing Algorithm Mustafa Ozdal Intel Corporation Martin D. F. Wong Univ. of Illinois at Urbana-Champaign Mustafa Ozdal.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

NTUEE 1 Coupling-Constrained Dummy Fill for Density Gradient Minimization Huang-Yu Chen 1, Szu-Jui Chou 2, and Yao-Wen Chang 1 1 National Taiwan University,

1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Ho-Lin Chang, Hsiang-Cheng Lai, Tsu-Yun Hsueh, Wei-Kai Cheng, Mely Chen Chi Department of Information and Computer Engineering, CYCU A 3D IC Designs Partitioning.

Large Scale Circuit Placement: Gap and Promise Jason Cong UCLA VLSI CAD LAB 1 Joint work with Chin-Chih Chang, Tim Kong, Michail Romesis, Joseph R. Shinnerl,

Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Clock-Tree Aware Placement Based on Dynamic Clock-Tree Building Yanfeng Wang, Qiang Zhou, Xianlong Hong, and Yici Cai Department of Computer Science and.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

Optimality, Scalability and Stability study of Partitioning and Placement Algorithms Jason Cong, Michail Romesis, Min Xie UCLA Computer Science Department.

Circuit Partitioning Divides circuit into smaller partitions that can be efficiently handled Goal is generally to minimize communication between balanced.

ELEC692 VLSI Signal Processing Architecture Lecture 3

Pipelining and Retiming

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Large Scale Parallel Graph Coloring 1. Presentation Overview Problem Description Basic Algorithm Parallel Strategy –Work Spawning –Graph Partition Results.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Multilevel Partitioning

Retiming EECS 290A Sequential Logic Synthesis and Verification.

Min-Register Retiming Under Simultaneous Timing and Initial State Constraints Aaron Hurst Dec

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

High Performance Computing Seminar

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

James D. Z. Ma Department of Electrical and Computer Engineering

Buffer Insertion with Adaptive Blockage Avoidance

Performance Optimization Global Routing with RLC Crosstalk Constraints

mPL 5 Overview ISPD 2005 Placement Contest Entry

Design Hierarchy Guided Multilevel Circuit Partitioning

Integrating Logic Synthesis, Technology Mapping, and Retiming

Fast Min-Register Retiming Through Binary Max-Flow

Presentation transcript:

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design Technologies Los Angeles

Problem Definition Problem: k-way circuit partitioning and retiming with balanced area for delay minimization  Delay minimization with consideration of cutsize  Retiming is performed simultaneously with partitioning for best possible delay reduction  Generic delay model: node delay, intra-block delay, inter-block delay D d Node delay d v Inter-block delay D Intra-block delay d D > d

Existing Approaches Clustering-based approaches  PRIME: group nodes into clusters with given area bound Quasi-optimal delay solution with node duplication Huge cutsize (3X) Partitioning-based approaches  Partition circuits into k-blocks and then iteratively move nodes to further improve  Cut-size minimization: hMetis Multi-level partitioning, very fast, excellent cutsize, fair circuit delay  Delay minimization: HPM Performance-driven clustering + cutsize-driven partitioning, tradeoff between delay and cutsize

Existing Approaches (cont) Clustering-based approaches  Delay optimization with node duplication is optimally solved  Node duplication-free clustering is NP-complete, but with fairly good results by resolving duplications heuristically  Huge cutsize Partitioning-based approaches  Very good cutsize  Difficulty on delay minimization: delay update for each node- move is too costly (linear time)  hMetis: does not consider delay directly, gradual coarsening is difficult to target for delay  HPM: separate clustering and partitioning, clustering does not know its impact on cutsize, partitioning does not have much control on delay

HPM: Combination of Clustering and Partitioning HPM by Cong, et al, [DAC99]  Clustering followed by partitioning Good delay and cutsize balance  Clustering and partitioning are two completely separated steps Clustering with very small and fixed area bound (10) on each blocks: much less than A/K, where A is circuit area Achieve inferior delay to clustering with cluster area bound of A/K (delay is ~23% larger) Achieve larger cutsize than hMetis because clustering constraints reduces cutsize reduction capability of partitioning  Better solution is Needed

Multi-Level Partitioning for Cutsize hMetis by Karypis, et al. [DAC97]  Gradual coarsening to group tightly connected nodes together  Uncoarsening gradually and reducing cutsize by moving clusters Fast algorithm: reduced solution space at each level as many nodes are grouped and moved together Smaller cutsize: more thorough search is possible in reduced solution space Hyperedge-based coarsening is very suitable for cutsize  Delay is completely ignored

Existing Multi-level Optimization Engine V-shape multi-level optimization used in hMetis  Not very suitable for delay minimization Gradual coarsening has difficulty to predict impact on delay

MLPR: Performance-Driven Multi- Level Partitioning and Retiming K-way partitioning algorithm for performance optimization  Retiming is performed during partitioning for best possible circuit delay  Cutsize reduction is also considered MLPR  Clustering with area bound of A/K, where A is circuit area  Partitioning of clusters into K blocks  For level from 1 to log(A/K) Clustering with area bound of A/(K  2 level ) –Each cluster is bounded by the block it belongs to Moving clusters to reduce cutsize while preserving circuit delay  Final movement of individual nodes for best solution

Our Contribution: Global Clustering Based Multi-Level Optimization Engine Start directly from the coarsest level with global clustering for best possible delay Clustering-based gradual declustering to increase the freedom for refinement Retiming is considered simultaneously during clustering and partitioning for smaller delay

Global Clustering for Delay Minimization Clustering: to group nodes into clusters with area no more than a given bound CLUS by Pan, et al. [TCAD98] PRIME by Cong, et al [DAC99]  Quasi-optimal clustering with retiming for delay minimization By setting area-bound to be A/K, clustering can compute a partitioning solution with quasi-optimal delay  Existing coarsening algorithms considering local node connectivity cannot predict circuit delay Theorem: Let  c be the circuit delay of a clustering solution. For any partitioning solution P on the clusters, its delay is less than or equal to  c  Clustering can compute an upper-bound on circuit delay after partitioning

Global Clustering-Based Optimization Engine Start from the coarsest level with clustering to define a good circuit delay  Comparison: coarsening with gradually increased cluster size has difficulty to predict circuit delay after partitioning on clusters Clustering with gradually reduced area bound to decluster at each level  Nodes on a critical path will be grouped together and will NOT be partitioned into different partitions  Avoid delay increase by partitioning refinement as much as possible Partition-bounded clustering to guarantee consistent solution improvement and algorithm convergency  Guarantee a better solution in a finer level than a coarser level

Partitioning with Retiming Retiming is considered during clustering and partitioning at each level for best possible circuit delay  Sequential arrival time: a v =  l(e), where l(e)=d v +d e -  w e for a given target clock period , where d v is node delay of v, d e is edge delay, w e is the number of FFs on edge e from u to v.  Theorem [Pan98]: if max(a po )  , minimum circuit delay after retiming is no more than  + D.  Timing analysis in both clustering and partitioning is based on sequential arrival time  Binary search to get the minimum clock period after retiming

Test Results 16-way partitioningBi-partitioning 16x 120

Conclusion Global clustering is more suitable for delay minimization Global clustering-based multi-level optimization engine achieves good delay and cutsize Retiming further helps delay reduction  Simultaneously retiming with partitioning achieves better results than separate partitioning with retiming  Not a necessity to the main algorithm, can be disabled