Multi-Commodity Flow-Based Spreading in a Commercial Analytic Placer

Slides:



Advertisements
Similar presentations
Using Trees to Depict a Forest
Advertisements

ITS THE FINAL LECTURE! (SING IT, YOU KNOW YOU WANT TO) Operations Research.
Wen-Hao Liu1, Yih-Lang Li, and Cheng-Kok Koh Department of Computer Science, National Chiao-Tung University School of Electrical and Computer Engineering,
MINIMUM COST FLOWS: NETWORK SIMPLEX ALGORITHM A talk by: Lior Teller 1.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Shuai Li and Cheng-Kok Koh School of Electrical and Computer Engineering, Purdue University West Lafayette, IN, Mixed Integer Programming Models.
Ripple: An Effective Routability-Driven Placer by Iterative Cell Movement Xu He, Tao Huang, Linfu Xiao, Haitong Tian, Guxin Cui and Evangeline F.Y. Young.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
CAPS RoutePro Routing Environment. Solution Methods. Backhauls. Dispatcher Interface. Demonstration.
Network Optimization Models: Maximum Flow Problems In this handout: The problem statement Solving by linear programming Augmenting path algorithm.
On Legalization of Row-Based Placements Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La Jolla, CA 92093
Rewiring – Review, Quantitative Analysis and Applications Matthew Tang Wai Chung CUHK CSE MPhil 10/11/2003.
NetworkModel-1 Network Optimization Models. NetworkModel-2 Network Terminology A network consists of a set of nodes and arcs. The arcs may have some flow.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
CRISP: Congestion Reduction by Iterated Spreading during Placement Jarrod A. Roy†‡, Natarajan Viswanathan‡, Gi-Joon Nam‡, Charles J. Alpert‡ and Igor L.
Power Reduction for FPGA using Multiple Vdd/Vth
March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,
Analytic Placement. Layout Project:  Sending the RTL file: −Thursday, 27 Farvardin  Final deadline: −Tuesday, 22 Ordibehesht  New Project: −Soon 2.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
1 Branch and Bound Searching Strategies Updated: 12/27/2010.
1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Unsupervised Learning
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Andreas Klappenecker [partially based on the slides of Prof. Welch]
Floating-Point FPGA (FPFPGA)
Maximum Flow c v 3/3 4/6 1/1 4/7 t s 3/3 w 1/9 3/5 1/1 3/5 u z 2/2
Introduction to Polygons
HeAP: Heterogeneous Analytical Placement for FPGAs
David K. Y. Yau Department of Computer Science Purdue University
Network Simplex Animations
Delay Optimization using SOP Balancing
DS595/CS525 Team#2 - Mi Tian, Deepan Sanghavi, Dhaval Dholakia
Date of download: 1/1/2018 Copyright © ASME. All rights reserved.
Andy Ye, Jonathan Rose, David Lewis
Anne Pratoomtong ECE734, Spring2002
Hidden Markov Models Part 2: Algorithms
Instructor: Shengyu Zhang
Maximum Flow c v 3/3 4/6 1/1 4/7 t s 3/3 w 1/9 3/5 1/1 3/5 u z 2/2
SAT-Based Area Recovery in Technology Mapping
Performance Optimization Global Routing with RLC Crosstalk Constraints
SAT-Based Optimization with Don’t-Cares Revisited
of the Artificial Neural Networks.
3.4 Push-Relabel(Preflow-Push) Maximum Flow Alg.
MURI Kickoff Meeting Randolph L. Moses November, 2008
PRESENTATION COMPUTER NETWORKS
Integrating Logic Synthesis, Technology Mapping, and Retiming
A Semi-Persistent Clustering Technique for VLSI Circuit Placement
Branch and Bound Searching Strategies
ECE 352 Digital System Fundamentals
Successive Shortest Path Algorithm
Successive Shortest Path Algorithm
EE5900 Advanced Embedded System For Smart Infrastructure
Flow Feasibility Problems
Maximum Flow c v 3/3 4/6 1/1 4/7 t s 3/3 w 1/9 3/5 1/1 3/5 u z 2/2
Delay Optimization using SOP Balancing
Continuous Density Queries for Moving Objects
Network Simplex Animations
Maximum Flow Neil Tang 4/8/2008
Topological Signatures For Fast Mobility Analysis
Fast Min-Register Retiming Through Binary Max-Flow
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Chapter 2: Building a System
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Unsupervised Learning
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Multi-Commodity Flow-Based Spreading in a Commercial Analytic Placer Nima Karimpour Darav Andrew Kennings Kristofer Vorwerk Arun Kundu Physical & Logical Design Group

Outline Background Motivation Implementation Details Experimental Results Conclusions

Background Our FPGA hardware & software products are becoming increasingly complex: 2004: ProASICPLUS 2006: IGLOO 2010: SmartFusion 2012: SmartFusion2 2015: RTG4 2017: PolarFire Increasing demand for both high- performance … High-speed SERDES, DDR4, AXI buses. FIR filters, CNNs, DSP-intensive tasks. … and “group-based” design patterns: Software region constraints. Fixed cells (“obstacles”). Tight timing budgets, and sophisticated timing constraints.

Motivation (1 of 2) Analytic Placer Circuit Clustering Analytic Placer Timing, power, routability objectives. Lower-bound (LB) Placement Cluster Placement Goal of UB: preserve the LB objectives by not perturbing placement. Upper-bound (UB) Spreading Can Stop? Detailed Improvement The closer the UB to Full Legalization, the better the accuracy of timing predictions & decisions. Full Legalization Done

Less perturbation of LB Motivation (2 of 2) Our UB strategy is based on network flow. A grid can be imposed over the placement area. Cells can be snapped into their nearest grid “bin”. The bins are analogous, in many ways, to nodes in a flow problem. The concept of flow can be used to push over-capacitated cells into under-capacitated sites. Our approach Max movement Avg Movement Complex regions Cell order Different cell types Better timing Better HPWL Less perturbation of LB Handling complex regions No deadlock Robust Efficient runtime Not limited to architecture Scalable

High-Level Overview of the Spreading Algorithm (Part 1) Circuit + Architecture Comment The structure of our approach is similar to Edmonds-Karp. Construct Flow Network Assign cells to closest compatible bin Identify the overflowed bins Overflowed bins remain? No Yes Establish limit on amount of movement End For each bin, calculate the set of augmenting paths for the bin using BFS, then sort by cost Next bin All bins explored For each path, move cells along path

Step 1: Build the Flow Network (1 of 6) Architecture Grid DSP Site URAM Site LSRAM Site LUT & Carry Site DSP Grid LUT (and Carries) Grid URAM Grid LSRAM Grid

Step 1: Build the Flow Network (2 of 6) DSP Site URAM Site LSRAM Site LUT & Carry Site DSP Grid DSP Flow Network URAM Grid URAM Flow Network zoom-in … LSRAM Grid LSRAM Flow Network 4-direction connections

Step 1: Build the Flow Network (3 of 6) Fused Flow Network Architecture Grid DSP Site URAM Site LSRAM Site LUT & Carry Site LUT (and Carry) Network DSP Flow Network URAM Flow Network LSRAM Flow Network

Step 1: Build the Flow Network (4 of 6) Distinct grids lend themselves to a convenient problem separation. This may be the case for some FPGA architectures… In Microsemi’s FPGA products, non-carry LUTs can occupy locations within the URAM, DSP, and LSRAM sites, if those sites are unused. This is referred to as using “shadow logic”. Good for logic utilization. Let’s see how this is supported… DSP Site URAM Site LSRAM Site LUT & Carry Site LUT (and Carry) Network LUT (and Carry) Grid

Step 1: Build the Flow Network (5 of 6) This required us to support a notion of “compatibility” to model cell-to-site legality. The LUTs need to be connected to the nearest URAM, LSRAM, and DSP sites because of the ability to occupy shadow locations. LUT Flow Node DSP Flow Nodes LSRAM Flow Nodes URAM Flow Nodes Emphasizes multi-commodity nature of this problem. Fused Flow Network

Step 1: Build the Flow Network (6 of 6) Region1 Region constraints impact “compatibility”. Region2 LUT Flow Node DSP Flow Nodes LSRAM Flow Nodes URAM Flow Nodes The regions require different connections vs. previous slide. Fused Flow Network

High-Level Overview of the Spreading Algorithm (Part 2) Circuit + Architecture Supply Nodes These are the over-capacitated (overfilled) grid sites. We want to push flow through these sites toward empty bins. Construct Flow Network Assign cells to closest compatible bin Identify the overflowed bins Overflowed bins remain? No Yes Establish limit on amount of movement End In practice, the quality of the final solution is mostly independent of the order of resolving overflowed bins because cell movements are limited by Ψ ; however, sorting overflowed bins Λ in order of ascending supply values has been shown to reduce runtime. This is because resolving slightly-overflowed bins first provides more candidate paths for highly-overflowed bins later. For each bin, calculate the set of augmenting paths for the bin using BFS, then sort by cost Next bin All bins explored For each path, move cells along path

Step 2: Find Augmenting Paths (1 of 4) Each augmenting path is a queue of bins whose first element is the overflowed bin. While traversing bins using the BFS, partial paths are costed and saved into the queue. The BFS is repeated until the queue of potential path choices is emptied or the total demand provided by the identified paths can sink the source bin. LUT & Carry Flow Nodes DSP Flow Nodes LSRAM Flow Nodes URAM Flow Nodes First Level Search

Step 2: Find Augmenting Paths (2 of 4) A valid augmenting path starts with the supply node and ends with a sink node. C2 C1 C3 C4 C5 Valid Path Suppose that these are carry chain LUTs which are in overfilled LUT locations. In Microsemi’s FPGA, these cells can only be placed in (non-shadow) LUT locations.

Step 2: Find Augmenting Paths (3 of 4) A valid augmenting path starts with the supply node and ends with a sink node. For adding a node into the end of path (compatibility conditions): There must be at least one object in the source node with the same color as the destination node. C2 C1 C3 C4 C5 Invalid Path A bin is compatible with an object iff the object is not in a region constraint, or the region-color of the object is included in the region-color vector of the bin; and, the object can be legally placed into the bin of the corresponding type.

Step 2: Find Augmenting Paths (4 of 4) A valid augmenting path starts with the supply node and ends with a sink node. For adding a node into the end of path (compatibility conditions): There must be at least one object in the source node with the same color as the destination node. There must be at least one object from the source node to the destination node that must : Not violate the max movement constraint. Be allowed to be placed at the destination node. C2 C1 C3 C4 C5 Valid Path Non-carry LUT; can go into DSP shadow location.

Step 3: Move the Cells (1 of 3) For a given augmenting path: Source Bin In this figure, v1 and v2 are overflowed bins with 2 objects each, while v3 is a sink bin. Bins v1 and v3 are overlapping with region R2 while v2 is overlapping with both regions R1 and R2. Sink Bin

Step 3: Move the Cells (2 of 3) For a given augmenting path: Loop until there are no bins left unexplored in the path… Pop & examine the last bin from the path. Choose the cell, in the bin, which is closest to & compatible with the sink. Move the cell. Update sink to the current bin. Sink Bin First, the closest object to bin v3 is moved to sink bin v3 such that the object must be compatible with the target bin v3. c3 is not compatible with bin v3 because c3 is assigned to Red region while v3 is overlapping with only the Blue region.

Step 3: Move the Cells (3 of 3) For a given augmenting path: Loop until there are no bins left unexplored in the path… Pop & examine the last bin from the path. Choose the cell, in the bin, which is closest to & compatible with the sink. Move the cell. Update sink to the current bin. No newly-overflowed bins are created. Then, the closest compatible object to bin v2 is moved to v2. No new overflowed bin is introduced and the source bin (v1) in the augmentation path is no longer overflowed. The location of each logical cluster is updated based on the location (centre of gravity) of its representing object (or objects). Hence, the solution provided is well-spread as clusters represented by single objects are fully-legalized, while minor overlaps are permitted for clusters represented by multiple objects.

Experimental Results (1 of 5) Conducted on a suite with: 557 designs which we have used internally for benchmarking for many years. A variety of sizes and design utilizations. A variety of timing constraints. Each design was placed into the smallest available PolarFire die which could fit it. Each design was run with 5 different starting seeds, and averages reported here. The placer was set to optimize the timing as best as it could (i.e., don’t stop early). Timing is estimated using the pre-routing timing estimator embedded in the placer. Results are compared after analytic placement, to our previous well-tuned approach based on geometric spreading.

Experimental Results (2 of 5)

Experimental Results (3 of 5)

Experimental Results (4 of 5)

Experimental Results (5 of 5)

Conclusions Our approach 6.4% reduction in HPWL Limits on max movement Handles complex regions Multi-commodity cell types 389ps, or ~6 LUT, improvement in delay Comparable Runtime to Geometric Approaches Generic approach, useful for other architectures

Questions?