Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan.

Slides:



Advertisements
Similar presentations
Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.
Advertisements

MIP-based Detailed Placer for Mixed-size Circuits Shuai Li, Cheng-Kok Koh ECE, Purdue University {li263,
Optimization of Placement Solutions for Routability Wen-Hao Liu, Cheng-Kok Koh, and Yih-Lang Li DAC’13.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Natarajan Viswanathan Min Pan Chris Chu Iowa State University International Symposium on Physical Design April 6, 2005 FastPlace: An Analytical Placer.
X-Architecture Placement Based on Effective Wire Models Tung-Chieh Chen, Yi-Lin Chuang, and Yao-Wen Chang Graduate Institute of Electronics Engineering.
MAPLE: Multilevel Adaptive PLacEment for Mixed-Size Designs Myung-Chul Kim †, Natarajan Viswanathan ‡, Charles J. Alpert ‡, Igor L. Markov †, Shyam Ramji.
Meng-Kai Hsu, Sheng Chou, Tzu-Hen Lin, and Yao-Wen Chang Electronics Engineering, National Taiwan University Routability Driven Analytical Placement for.
A Size Scaling Approach for Mixed-size Placement Kalliopi Tsota, Cheng-Kok Koh, Venkataramanan Balakrishnan School of Electrical and Computer Engineering.
Ripple: An Effective Routability-Driven Placer by Iterative Cell Movement Xu He, Tao Huang, Linfu Xiao, Haitong Tian, Guxin Cui and Evangeline F.Y. Young.
SimPL: An Effective Placement Algorithm Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1ICCAD 2010, Myung-Chul Kim,
1 Physical Hierarchy Generation with Routing Congestion Control Chin-Chih Chang *, Jason Cong *, Zhigang (David) Pan +, and Xin Yuan * * UCLA Computer.
Coupling-Aware Length-Ratio- Matching Routing for Capacitor Arrays in Analog Integrated Circuits Kuan-Hsien Ho, Hung-Chih Ou, Yao-Wen Chang and Hui-Fang.
Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.
FastPlace: Efficient Analytical Placement using Cell Shifting, Iterative Local Refinement and a Hybrid Net Model FastPlace: Efficient Analytical Placement.
Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University.
APLACE: A General and Extensible Large-Scale Placer Andrew B. KahngSherief Reda Qinke Wang VLSICAD lab University of CA, San Diego.
Boosting: Min-Cut Placement with Improved Signal Delay Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La Jolla, CA
38 th Design Automation Conference, Las Vegas, June 19, 2001 Creating and Exploiting Flexibility in Steiner Trees Elaheh Bozorgzadeh, Ryan Kastner, Majid.
Supply Voltage Degradation Aware Analytical Placement Andrew B. Kahng, Bao Liu and Qinke Wang UCSD CSE Department {abk, bliu,
Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.
Placement Feedback: A Concept and Method for Better Min-Cut Placements Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Can Recursive Bisection Alone Produce Routable Placements? Andrew E. Caldwell Andrew B. Kahng Igor L. Markov Supported by Cadence.
DUSD(Labs) GSRC bX update March 2003 Aaron Ng, Marius Eriksen and Igor Markov University of Michigan.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
A Resource-level Parallel Approach for Global-routing-based Routing Congestion Estimation and a Method to Quantify Estimation Accuracy Wen-Hao Liu, Zhen-Yu.
POLAR 2.0: An Effective Routability-Driven Placer Chris Chu Tao Lin.
CRISP: Congestion Reduction by Iterated Spreading during Placement Jarrod A. Roy†‡, Natarajan Viswanathan‡, Gi-Joon Nam‡, Charles J. Alpert‡ and Igor L.
CAD for Physical Design of VLSI Circuits
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Horizontal Benchmark Extension for Improved Assessment of Physical CAD Research Andrew B. Kahng, Hyein Lee and Jiajia Li UC San Diego VLSI CAD Laboratory.
TSV-Aware Analytical Placement for 3D IC Designs Meng-Kai Hsu, Yao-Wen Chang, and Valerity Balabanov GIEE and EE department of NTU DAC 2011.
Solving Hard Instances of FPGA Routing with a Congestion-Optimal Restrained-Norm Path Search Space Keith So School of Computer Science and Engineering.
March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,
VLSI Physical Design: From Graph Partitioning to Timing Closure Chapter 5: Global Routing © KLMH Lienig 1 EECS 527 Paper Presentation High-Performance.
Archer: A History-Driven Global Routing Algorithm Mustafa Ozdal Intel Corporation Martin D. F. Wong Univ. of Illinois at Urbana-Champaign Mustafa Ozdal.
UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.
Seeing the Forest and the Trees: Steiner Wirelength Optimization in Placement Jarrod A. Roy, James F. Lu and Igor L. Markov University of Michigan Ann.
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
Analytic Placement. Layout Project:  Sending the RTL file: −Thursday, 27 Farvardin  Final deadline: −Tuesday, 22 Ordibehesht  New Project: −Soon 2.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Multilevel Generalized Force-directed Method for Circuit Placement Tony Chan 1, Jason Cong 2, Kenton Sze 1 1 UCLA Mathematics Department 2 UCLA Computer.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
1/24/20071 ECO-system: Embracing the Change in Placement Jarrod A. Roy and Igor L. Markov University of Michigan at Ann Arbor.
Placement. Physical Design Cycle Partitioning Placement/ Floorplanning Placement/ Floorplanning Routing Break the circuit up into smaller segments Place.
Jason Cong‡†, Guojie Luo*†, Kalliopi Tsota‡, and Bingjun Xiao‡ ‡Computer Science Department, University of California, Los Angeles, USA *School of Electrical.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
Quadratic VLSI Placement Manolis Pantelias. General Various types of VLSI placement  Simulated-Annealing  Quadratic or Force-Directed  Min-Cut  Nonlinear.
I N V E N T I V EI N V E N T I V E A Morphing Approach To Address Placement Stability Philip Chong Christian Szegedy.
Physical Synthesis Comes of Age Chuck Alpert, IBM Corp. Chris Chu, Iowa State University Paul Villarrubia, IBM Corp.
Analytical Minimization of Signal Delay in VLSI Placement Andrew B. Kahng and Igor L. Markov UCSD, Univ. of Michigan
An Effective Congestion Driven Placement Framework André Rohe University of Bonn, Germany joint work with Ulrich Brenner.
Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Multi-objective Placement Optimization for High-performance Nanoscale Integrated Circuits Igor L. Markov August 20, 2012.
1 NTUplace: A Partitioning Based Placement Algorithm for Large-Scale Designs Tung-Chieh Chen 1, Tien-Chang Hsu 1, Zhe-Wei Jiang 1, and Yao-Wen Chang 1,2.
High-Performance Global Routing with Fast Overflow Reduction Huang-Yu Chen, Chin-Hsiung Hsu, and Yao-Wen Chang National Taiwan University Taiwan.
International Symposium on Physical Design San Diego, CA April 2002ER UCLA UCLA 1 Routability Driven White Space Allocation for Fixed-Die Standard-Cell.
Design Automation Conference (DAC), June 6 th, Taming the Complexity of Coordinated Place and Route Jin Hu †, Myung-Chul Kim †† and Igor L. Markov.
May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IBM Contact – Karl Erickson.
Effective Linear Programming-Based Placement Techniques Sherief Reda UC San Diego Amit Chowdhary Intel Corporation.
6/19/ VLSI Physical Design Automation Prof. David Pan Office: ACES Placement (3)
RTL Design Flow RTL Synthesis HDL netlist logic optimization netlist Library/ module generators physical design layout manual design a b s q 0 1 d clk.
A SimPLR Method for Routability-driven Placement
HeAP: Heterogeneous Analytical Placement for FPGAs
APLACE: A General and Extensible Large-Scale Placer
2 University of California, Los Angeles
EE5780 Advanced VLSI Computer-Aided Design
Presentation transcript:

Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1PAPA2011, University of Michigan

Complexities of Parallel Algorithms & SW 1.Objectives of parallelization A. Improve completion time by using multiple cores in || B. Improve throughput by using stream processing (latency may increase and become less predictable) C. Improve power consumption (by decreasing clk rate) 2.Not an objective (a pitfall) −Come up with a slow algorithm that is easy to parallelize ■In this talk: how to accomplish 1.A without 2 −Take a leading algorithm and speed up its bottlenecks −Design a new algorithm that is (a) better, (b) easy to parallelize 2PAPA2011, University of Michigan

CAD Algorithms ■Sequence of optimizations −Subject to Amdahl’s law −The more the stages, the harder to parallelize effectively ■Additional complications −Elaborate data structures may entail overhead for parallel access −When processing is light, memory bandwidth may become a bottleneck (with 4+ threads) ■Recommendations −A simpler algorithm is often either to parallelize (fewer stages, simpler data structures) −Using standard solvers, e.g., linear algebra helps reuse previous work on parallelization 3PAPA2011, University of Michigan

Global Placement: Motivation ■Interconnect lagging in performance while transistors continue scaling −Circuit delay, power dissipation and area dominated by interconnect −Routing quality highly controlled by placement ■Circuit size and complexity rapidly increasing −Scalable placement algorithm is critical −Simplicity, integration with other optimizations 4PAPA2011, University of Michigan Unloaded Coupling IR drop RC delay

Goals in Placement ■Find good relative ordering of cells −Minimize wire length and congestion −Maximize timing slack ■Find good spacing of cells −Eliminate wiring congestion problems −Provide space for post placement stages –clock trees –buffer insertion –timing correction ■Find good global position 5PAPA2011, University of Michigan

ABC Optimize Relative Order 6PAPA2011, University of Michigan

ABC To spread... 7PAPA2011, University of Michigan

ABC.. or not to spread 8PAPA2011, University of Michigan

ABC Place to the left 9PAPA2011, University of Michigan

ABC … or to the right 10PAPA2011, University of Michigan

ABC Optimize Relative Order Without whitespace, placement is dominated by ordering 11PAPA2011, University of Michigan

Example of Global Placement (APlace 2.04 from UCSD)

Example of Global Placement (mFar from UCSB)

Placement Formulation ■Objective: Minimize estimated wirelength −Half-perimeter wirelength (HPWL) −  (max X – min X) + (max Y – min Y) ■Subject to constraints: −Legality: Row-based placement with no overlaps −Routability: Limiting local interconnect congestion for successful routing −Timing: Meeting performance target of a design 14PAPA2011, University of Michigan xx yy

Quadratic Placement ■Consider a graph first, not a hypergraph ■Minimize Σ(x i -x j ) 2 +(y i -y j ) 2 (the sum is over e ij ) −Seems unrelated to Σ |x i -x j |+|y i -y j | but can still be separated into x- and y- components ■Physical analogy: Hooke’s law −Consider an elastic spring, spread by x −Force F=-kx (k is the spring constant) −Energy E=kx 2 −Our goal: minimize the energy of the system  A system of springs will only settle in a minimum 15PAPA2011, University of Michigan

Iterative Optimization 16PAPA2011, University of Michigan

Prior Work ■Ideal Placer −Low runtime without sacrificing solution quality −Simplicity, integration with other optimizations 17PAPA2011, University of Michigan Speed Solution Quality Non-convex optimization mFAR, Kraftwerk2, FastPlace3 Ideal placer mPL6, APlace2, NTUPlace3 Quadratic and force-directed

Key features of SimPL ■Flat quadratic placement ■Primal dual optimization −Closing the gap between upper and lower bounds 18 Final Solution Lower-Bound Solution by Linear System Solver Wirelength Iteration Final Legal Solution Upper-Bound Solution by Look-ahead Legalization Initial WL Opt. PAPA2011, University of Michigan

Common Analytical Placement Flow 19 Placement Instance Converge yes no Global Placement Initial WL Optimization Legalization and Detailed Placement PAPA2011, University of Michigan

SimPL Flow 20 We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005] Placement Instance Legalization and Detailed Placement B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008] yes no Pseudonet Insertion Look-ahead Legalization (Upper-Bound) B2B Graph Building Linear System Solver (Lower-Bound) Converge Global Placement B2B Graph Building Linear System Solver WL Converge yes no Initial WL Optimization

SimPL: Look-ahead Legalization ■Purpose: Produces almost-legal placement (Upper-Bound) while preserving the relative cell ordering given by linear system solver (Lower-Bound) ■Identify target region −Find overflow bin b −Create a minimal wide enough bin cluster B around b ■Perform geometric top-down partitioning −Find cell area median (C c ) and whitespace median (C B ) −Assign cells (C c ) to corresponding partitions (C B ) ■Non-linear scaling −Form stripe regions −Move cells across stripe regions in-order based on whitespace 21PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (1) Performing geometric top-down partitioning Overfilled bin Cell-area median (C c ) B0 B0 B 1 whitespace median (C B ) Bin cluster (B) 22PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2) 23PAPA2011, University of Michigan Cell-area median (C c ) whitespace median (C B ) B0 B0

SimPL: Look-ahead Legalization (2) C B Obstacle borders Uniform cutlines Cell Ordering Per-stripe Linear Scaling C B PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (3) ■Example ( adaptec1 ) Look-ahead legalization stops when target regions become small enough

SimPL: Using legal locations as anchors ■Purpose: Gradually perturb the linear system to generate lower-bound solutions with less overlap ■Anchors and Pseudonets −Look-ahead locations used as fixed, zero-area anchors −Anchors and original cells connected with 2-pin pseudonets −Pseudonet weights grow linearly with iterations 26PAPA2011, University of Michigan

Next illustration: Tug-of-war between low-wirelength and legalized placements 27PAPA2011, University of Michigan

SimPL Iterations on Adaptec1 (1) Iteration=0 (Init WL Opt.)Iteration=1 (Upper Bound) Iteration=2 (Lower Bound)Iteration=3 (Upper Bound) 28

SimPL Iterations on Adaptec1 (2) Iteration=11 (Upper Bound) Iteration=20 (Lower Bound)Iteration=21 (Upper Bound) Iteration=11 (Upper Bound) Iteration=20 (Lower Bound)Iteration=21 (Upper Bound) Iteration=10 (Lower Bound) 29

SimPL Iterations on Adaptec1 (3) 30 Iteration=31 (Upper Bound)Iteration=30 (Lower Bound) Iteration=40 (Lower Bound)Iteration=41 (Upper Bound)

Convergence of SimPL ■Legal solution is formed between two bounds 31PAPA2011, University of Michigan

Empirical Results: ISPD05 Benchmarks ■Experimental setup −Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation −HPWL is computed by GSRC Bookshelf Evaluator < 5000 lines of code in C++, including CG solver for sparse linear systems (w Jacobi preconditioner) 32PAPA2011, University of Michigan

Empirical Results: Scalability Study ■Take an existing design (ISPD 2005) and split each movable cell into two cells of smaller size −Each connection to the original cell is inherited by one of two split cells, which are connected by a 2-pin net 33

Speeding Up Placement Using Parallelism ■SimPL has very few components (5KLOC) ■Each bottleneck is amenable to some form of ||-ism −Thread-level −Instruction-level 34PAPA2011, University of Michigan

Parallelism in Conjugate Gradient Solver ■Coarse-grain row partitioning −Implemented using OpenMP3.0 compiler intrinsic ■SSE2 (Streaming SIMD Extensions) instructions −Process 4 multiple data with a single instruction −Marginal runtime improvement in SpMxV ■Reducing memory bandwidth demand of SpMxV −CSR (Compressed Sparse Row) format Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM PAPA2011, University of Michigan

Parallelism in CG Solver - Example 36PAPA2011, University of Michigan

Parallelism in B2B Mode Update ■B2B net model update – B2B model is separable – Can process the x and y cases in parallel −Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads. 37PAPA2011, University of Michigan

SSE optimization affects Runtime Profile 38PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (1) ■Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime ■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization −Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel −After each level of T&N on bin cluster, each thread generates two sub-clusters with similar numbers of cells 39PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (2) ■LAL keeps the global queue of bin clusters Q ■Static partitioning −Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start ■Subtask updates −Thread t i processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q ■Dynamic task scheduling −When thread t i is idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1) 40PAPA2011, University of Michigan

Empirical Results – Overall Speed-ups ■Experimental setup −Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM −Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache 41PAPA2011, University of Michigan

Empirical Results – Component Speed-ups 42PAPA2011, University of Michigan

Empirical Results – Component Speed-ups 43PAPA2011, University of Michigan

Extending the Routability-driven Placement ■Ongoing work: simultaneous place-and-route 44PAPA2011, University of Michigan

Simultaneous Place-and-Route ■After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR) −Integrate an in-house router through clean API −Cell locations in, accurate congestion maps out −The placer accounts for congestion in addition to density (slightly modified formulas, almost no extra work) ■ISPD 2011 contest organized by IBM Research −New, large benchmarks −Placements evaluated by a common global router 45PAPA2011, University of Michigan

SimPL  SimPLR ■Key metric is #overflows (OF) ■Also shown – routed WL (RtWL) 46PAPA2011, University of Michigan

Conclusions ■New flat quadratic placement algorithm: SimPL −Novel primal-dual based approach −Amenable to integration with physical synthesis ■Self-contained, compact implementation −Fastest among available academic placers −Highly competitive solution quality −Amenable to parallelism −Easy to extend to simultaneous place-and-route 47PAPA2011, University of Michigan

Questions and Answers Thank you! Time for Questions 48PAPA2011, University of Michigan