Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1PAPA2011, University of Michigan
Complexities of Parallel Algorithms & SW 1.Objectives of parallelization A. Improve completion time by using multiple cores in || B. Improve throughput by using stream processing (latency may increase and become less predictable) C. Improve power consumption (by decreasing clk rate) 2.Not an objective (a pitfall) −Come up with a slow algorithm that is easy to parallelize ■In this talk: how to accomplish 1.A without 2 −Take a leading algorithm and speed up its bottlenecks −Design a new algorithm that is (a) better, (b) easy to parallelize 2PAPA2011, University of Michigan
CAD Algorithms ■Sequence of optimizations −Subject to Amdahl’s law −The more the stages, the harder to parallelize effectively ■Additional complications −Elaborate data structures may entail overhead for parallel access −When processing is light, memory bandwidth may become a bottleneck (with 4+ threads) ■Recommendations −A simpler algorithm is often either to parallelize (fewer stages, simpler data structures) −Using standard solvers, e.g., linear algebra helps reuse previous work on parallelization 3PAPA2011, University of Michigan
Global Placement: Motivation ■Interconnect lagging in performance while transistors continue scaling −Circuit delay, power dissipation and area dominated by interconnect −Routing quality highly controlled by placement ■Circuit size and complexity rapidly increasing −Scalable placement algorithm is critical −Simplicity, integration with other optimizations 4PAPA2011, University of Michigan Unloaded Coupling IR drop RC delay
Goals in Placement ■Find good relative ordering of cells −Minimize wire length and congestion −Maximize timing slack ■Find good spacing of cells −Eliminate wiring congestion problems −Provide space for post placement stages –clock trees –buffer insertion –timing correction ■Find good global position 5PAPA2011, University of Michigan
ABC Optimize Relative Order 6PAPA2011, University of Michigan
ABC To spread... 7PAPA2011, University of Michigan
ABC.. or not to spread 8PAPA2011, University of Michigan
ABC Place to the left 9PAPA2011, University of Michigan
ABC … or to the right 10PAPA2011, University of Michigan
ABC Optimize Relative Order Without whitespace, placement is dominated by ordering 11PAPA2011, University of Michigan
Example of Global Placement (APlace 2.04 from UCSD)
Example of Global Placement (mFar from UCSB)
Placement Formulation ■Objective: Minimize estimated wirelength −Half-perimeter wirelength (HPWL) − (max X – min X) + (max Y – min Y) ■Subject to constraints: −Legality: Row-based placement with no overlaps −Routability: Limiting local interconnect congestion for successful routing −Timing: Meeting performance target of a design 14PAPA2011, University of Michigan xx yy
Quadratic Placement ■Consider a graph first, not a hypergraph ■Minimize Σ(x i -x j ) 2 +(y i -y j ) 2 (the sum is over e ij ) −Seems unrelated to Σ |x i -x j |+|y i -y j | but can still be separated into x- and y- components ■Physical analogy: Hooke’s law −Consider an elastic spring, spread by x −Force F=-kx (k is the spring constant) −Energy E=kx 2 −Our goal: minimize the energy of the system A system of springs will only settle in a minimum 15PAPA2011, University of Michigan
Iterative Optimization 16PAPA2011, University of Michigan
Prior Work ■Ideal Placer −Low runtime without sacrificing solution quality −Simplicity, integration with other optimizations 17PAPA2011, University of Michigan Speed Solution Quality Non-convex optimization mFAR, Kraftwerk2, FastPlace3 Ideal placer mPL6, APlace2, NTUPlace3 Quadratic and force-directed
Key features of SimPL ■Flat quadratic placement ■Primal dual optimization −Closing the gap between upper and lower bounds 18 Final Solution Lower-Bound Solution by Linear System Solver Wirelength Iteration Final Legal Solution Upper-Bound Solution by Look-ahead Legalization Initial WL Opt. PAPA2011, University of Michigan
Common Analytical Placement Flow 19 Placement Instance Converge yes no Global Placement Initial WL Optimization Legalization and Detailed Placement PAPA2011, University of Michigan
SimPL Flow 20 We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005] Placement Instance Legalization and Detailed Placement B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008] yes no Pseudonet Insertion Look-ahead Legalization (Upper-Bound) B2B Graph Building Linear System Solver (Lower-Bound) Converge Global Placement B2B Graph Building Linear System Solver WL Converge yes no Initial WL Optimization
SimPL: Look-ahead Legalization ■Purpose: Produces almost-legal placement (Upper-Bound) while preserving the relative cell ordering given by linear system solver (Lower-Bound) ■Identify target region −Find overflow bin b −Create a minimal wide enough bin cluster B around b ■Perform geometric top-down partitioning −Find cell area median (C c ) and whitespace median (C B ) −Assign cells (C c ) to corresponding partitions (C B ) ■Non-linear scaling −Form stripe regions −Move cells across stripe regions in-order based on whitespace 21PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (1) Performing geometric top-down partitioning Overfilled bin Cell-area median (C c ) B0 B0 B 1 whitespace median (C B ) Bin cluster (B) 22PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (2) 23PAPA2011, University of Michigan Cell-area median (C c ) whitespace median (C B ) B0 B0
SimPL: Look-ahead Legalization (2) C B Obstacle borders Uniform cutlines Cell Ordering Per-stripe Linear Scaling C B PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (3) ■Example ( adaptec1 ) Look-ahead legalization stops when target regions become small enough
SimPL: Using legal locations as anchors ■Purpose: Gradually perturb the linear system to generate lower-bound solutions with less overlap ■Anchors and Pseudonets −Look-ahead locations used as fixed, zero-area anchors −Anchors and original cells connected with 2-pin pseudonets −Pseudonet weights grow linearly with iterations 26PAPA2011, University of Michigan
Next illustration: Tug-of-war between low-wirelength and legalized placements 27PAPA2011, University of Michigan
SimPL Iterations on Adaptec1 (1) Iteration=0 (Init WL Opt.)Iteration=1 (Upper Bound) Iteration=2 (Lower Bound)Iteration=3 (Upper Bound) 28
SimPL Iterations on Adaptec1 (2) Iteration=11 (Upper Bound) Iteration=20 (Lower Bound)Iteration=21 (Upper Bound) Iteration=11 (Upper Bound) Iteration=20 (Lower Bound)Iteration=21 (Upper Bound) Iteration=10 (Lower Bound) 29
SimPL Iterations on Adaptec1 (3) 30 Iteration=31 (Upper Bound)Iteration=30 (Lower Bound) Iteration=40 (Lower Bound)Iteration=41 (Upper Bound)
Convergence of SimPL ■Legal solution is formed between two bounds 31PAPA2011, University of Michigan
Empirical Results: ISPD05 Benchmarks ■Experimental setup −Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation −HPWL is computed by GSRC Bookshelf Evaluator < 5000 lines of code in C++, including CG solver for sparse linear systems (w Jacobi preconditioner) 32PAPA2011, University of Michigan
Empirical Results: Scalability Study ■Take an existing design (ISPD 2005) and split each movable cell into two cells of smaller size −Each connection to the original cell is inherited by one of two split cells, which are connected by a 2-pin net 33
Speeding Up Placement Using Parallelism ■SimPL has very few components (5KLOC) ■Each bottleneck is amenable to some form of ||-ism −Thread-level −Instruction-level 34PAPA2011, University of Michigan
Parallelism in Conjugate Gradient Solver ■Coarse-grain row partitioning −Implemented using OpenMP3.0 compiler intrinsic ■SSE2 (Streaming SIMD Extensions) instructions −Process 4 multiple data with a single instruction −Marginal runtime improvement in SpMxV ■Reducing memory bandwidth demand of SpMxV −CSR (Compressed Sparse Row) format Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM PAPA2011, University of Michigan
Parallelism in CG Solver - Example 36PAPA2011, University of Michigan
Parallelism in B2B Mode Update ■B2B net model update – B2B model is separable – Can process the x and y cases in parallel −Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads. 37PAPA2011, University of Michigan
SSE optimization affects Runtime Profile 38PAPA2011, University of Michigan
Parallelism in Look-ahead Legalization (1) ■Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime ■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization −Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel −After each level of T&N on bin cluster, each thread generates two sub-clusters with similar numbers of cells 39PAPA2011, University of Michigan
Parallelism in Look-ahead Legalization (2) ■LAL keeps the global queue of bin clusters Q ■Static partitioning −Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start ■Subtask updates −Thread t i processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q ■Dynamic task scheduling −When thread t i is idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1) 40PAPA2011, University of Michigan
Empirical Results – Overall Speed-ups ■Experimental setup −Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM −Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache 41PAPA2011, University of Michigan
Empirical Results – Component Speed-ups 42PAPA2011, University of Michigan
Empirical Results – Component Speed-ups 43PAPA2011, University of Michigan
Extending the Routability-driven Placement ■Ongoing work: simultaneous place-and-route 44PAPA2011, University of Michigan
Simultaneous Place-and-Route ■After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR) −Integrate an in-house router through clean API −Cell locations in, accurate congestion maps out −The placer accounts for congestion in addition to density (slightly modified formulas, almost no extra work) ■ISPD 2011 contest organized by IBM Research −New, large benchmarks −Placements evaluated by a common global router 45PAPA2011, University of Michigan
SimPL SimPLR ■Key metric is #overflows (OF) ■Also shown – routed WL (RtWL) 46PAPA2011, University of Michigan
Conclusions ■New flat quadratic placement algorithm: SimPL −Novel primal-dual based approach −Amenable to integration with physical synthesis ■Self-contained, compact implementation −Fastest among available academic placers −Highly competitive solution quality −Amenable to parallelism −Easy to extend to simultaneous place-and-route 47PAPA2011, University of Michigan
Questions and Answers Thank you! Time for Questions 48PAPA2011, University of Michigan