ECE 697F Reconfigurable Computing Lecture 4 FPGA Placement

Slides:



Advertisements
Similar presentations
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR Topics n Logic synthesis. n Placement and routing.
Advertisements

ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
ISQED’2015: D. Seemuth, A. Davoodi, K. Morrow 1 Automatic Die Placement and Flexible I/O Assignment in 2.5D IC Design Daniel P. Seemuth Prof. Azadeh Davoodi.
Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Nature’s Algorithms David C. Uhrig Tiffany Sharrard CS 477R – Fall 2007 Dr. George Bebis.
Placement 1 Outline Goal What is Placement? Why Placement?
Reconfigurable Computing (EN2911X, Fall07)
Lecture 4: FPGA Placement September 12, 2013 ECE 636 Reconfigurable Computing Lecture 4 FPGA Placement.
EDA (CS286.5b) Day 7 Placement (Simulated Annealing) Assignment #1 due Friday.
Partitioning 1 Outline –What is Partitioning –Partitioning Example –Partitioning Theory –Partitioning Algorithms Goal –Understand partitioning problem.
Introduction to Simulated Annealing 22c:145 Simulated Annealing  Motivated by the physical annealing process  Material is heated and slowly cooled.
ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement.
ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.
Elements of the Heuristic Approach
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
The Basics and Pseudo Code
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #14 – Placement.
Analytic Placement. Layout Project:  Sending the RTL file: −Thursday, 27 Farvardin  Final deadline: −Tuesday, 22 Ordibehesht  New Project: −Soon 2.
Force-Directed Placement Heuristics A Comparative Study.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
Simulated Annealing.
Modern VLSI Design 3e: Chapter 10 Copyright  1998, 2002 Prentice Hall PTR Topics n CAD systems. n Simulation. n Placement and routing. n Layout analysis.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 10: February 6, 2002 Placement (Simulated Annealing…)
Optimization Problems
FPGA CAD 10-MAR-2003.
Ramakrishna Lecture#2 CAD for VLSI Ramakrishna
An Introduction to Simulated Annealing Kevin Cannons November 24, 2005.
Intro. ANN & Fuzzy Systems Lecture 37 Genetic and Random Search Algorithms (2)
Computational Physics (Lecture 10) PHY4370. Simulation Details To simulate Ising models First step is to choose a lattice. For example, we can us SC,
RTL Design Flow RTL Synthesis HDL netlist logic optimization netlist Library/ module generators physical design layout manual design a b s q 0 1 d clk.
Placement and Routing Algorithms. 2 FPGA Placement & Routing.
Hidden Markov Models BMI/CS 576
Optimization Problems
Partial Reconfigurable Designs
CSCI 4310 Lecture 10: Local Search Algorithms
Computational Physics (Lecture 10)
Department of Computer Science
Adapted by Dr. Adel Ammar
Give qualifications of instructors: DAP
Give qualifications of instructors: DAP
Local Search Algorithms
Artificial Intelligence (CS 370D)
Subject Name: Operation Research Subject Code: 10CS661 Prepared By:Mrs
Optimization Problems
CSE 589 Applied Algorithms Spring 1999
EE5780 Advanced VLSI Computer-Aided Design
ساختمان داده ها پشته ها Give qualifications of instructors: DAP
ساختمان داده ها لیستهای پیوندی
Topics Logic synthesis. Placement and routing..
School of Computer Science & Engineering
Lecture 9: Tabu Search © J. Christopher Beck 2005.
Algorithms for Budget-Constrained Survivable Topology Design
More on Search: A* and Optimization
Xin-She Yang, Nature-Inspired Optimization Algorithms, Elsevier, 2014
ECE 697F Reconfigurable Computing Lecture 5 FPGA Routing
Give qualifications of instructors: DAP
NAND and XOR Implementation
More on HW 2 (due Jan 26) Again, it must be in Python 2.7.
First Exam 18/10/2010.
Local Search Algorithms
Greg Knowles ECE Fall 2004 Professor Yu Hu Hen
Circuit Analysis Procedure by Dr. M
CprE / ComS 583 Reconfigurable Computing
Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.
Local Search Algorithms
CS 151 Digital Systems Design Lecture 1 Course Overview
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

ECE 697F Reconfigurable Computing Lecture 4 FPGA Placement Give qualifications of instructors: DAP teaching computer architecture at Berkeley since 1977 Co-athor of textbook used in class Best known for being one of pioneers of RISC currently author of article on future of microprocessors in SciAm Sept 1995 RY took 152 as student, TAed 152,instructor in 152 undergrad and grad work at Berkeley joined NextGen to design fact 80x86 microprocessors one of architects of UltraSPARC fastest SPARC mper shipping this Fall

Outline Brief review of important architecture info for homework Basic clustering of BLEs into clusters Timing-driven analysis and clustering Placement techniques (simulated annealing) Timing driven placement

Quality metrics for layout: Placement metrics Quality metrics for layout: Area Delay Dynamic power consumption (relatively recently) Ideally placement and routing would be performed together Some have tried! Both problems are NP-hard For practical considerations placement and routing must be performed separately

Before Placement: Clustering Need to group BLEs into groups Goals: Minimize number of clusters Minimize inter-cluster wiring Minimize critical path (timing-driven) How do we do this Take advantage of cluster architecture

Basic Clustering (Betz) – CICC 97 Iterate until all BLEs consumed Start new cluster by selecting a random BLE Add BLE with most shared inputs with current cluster to cluster Keep adding until either cluster full or input pins used up Hill climbing – if some cluster BLEs unused Add another BLE even if cluster input count temporarily overflowed If input count not eventually reduced select best choice from before hill climbing Does this do anything for timing performance?

netlist with delay for each gate Timing Analysis PO1 PO2 PO3 PI1 PI2 PI3 1 3 4 6 44 5 7 netlist with delay for each gate 1 7 13 18 PI1 1 4 6 5 PO1 9 3 arrival times 15 22 PI2 3 6 6 7 PO2 1 44 7 14 18 PI3 1 5 4 PO3 7 Source: David Pan

arrival time/required time slack = required time - arrival time Timing Analysis 0/4 1/5 7/9 13/15 18/22 PI1 1 4 6 5 PO1 0/0 9/9 3/3 15/15 arrival time/required time 22/22 PI2 3 6 6 7 PO2 0/8 1/9 44 7/15 14/18 18/22 PI3 1 5 4 PO3 7/13 4 4 2 2 4 PI1 1 4 6 5 PO1 slack = required time - arrival time PI2 3 6 6 7 PO2 8 8 44 8 4 4 PI3 1 5 4 PO3 6

Example with interconnect delay 22 3 2 1 1 5 5 5 19 F F 2 1 4 4 4 2 1 3 2

Timing-Driven Clustering – T-VPACK Cost metric now considers both connectivity and timing criticality Perform an analysis of criticality at beginning considering all wires to be inter-cluster As clustering progresses consider the following timing weight ratios LUT delay: 0.1 Intra-cluster delay Inter-cluster delay Determine “Base” BLE criticality

How to break ties? Initially, many paths may have the same number of BLEs Include “tie-breaking” in performance cost function

Results for T-VPACK versus VPACK Timing driven place and route also used for these results.

Wire length measures Estimate wire length by distance between components. Possible distance measures: Euclidean distance (sqrt(x2 + y2)); Manhattan distance (x + y). Multi-point nets must be broken up into trees for good estimates. Manhattan Euclidean

Placement Placement has a set of competing goals. Can’t optimize locally and globally simultaneously. Use heuristic approaches to evaluate quality. A B LUT1 LUT2 C E D C D F A B E 1 2

Placement Algorithms Constructive methods: begin from netlist and generate an initial placement. Partitioning methods: mincut and Kernighan-Lin methods Clustering Iterative improvement Begin with random or constructive placement. Iterate to improve it. Hill-climbing

Iterative Placement Algorithms Pairwise interchange methods Force-directed methods FD relaxation FD pairwise exchange Simulated annealing Generates best results Can be time consuming Macro-based approaches Genetic algorithms Quad swaps

Iterative Improvement Algorithms Force-directed: (classical mechanics) Force vector computed on each module corresponding to all nets Solve set of non-linear differential equations. Simulated annealing: (statistical mechanics) Model a physical annealing process which optimizes energy. Similar to “quenching” metal.

Formulating Force Equations Use Hooke’s Law Modules 1, 2, … N mi mass of module i xi x position of module i Kij Attractive constant between module i and j Fi Net force on module i from rest of modules

Need a repulsive force between modules to prevent overlap Minimization Using previous formulation will collapse all locations to a single point X1 = X2 … Xn Need a repulsive force between modules to prevent overlap repel attract R too small -> modules too close R too large -> modules far apart Pads have fixed Xi

Force-directed (cont.) We know that for the steady state d2Xi / dT2 = 0 Determine set of non-linear equations and solve simultaneously using Newton’s Method. Problem size can grow quite large as size of device increases. Interaction between X and Y coords 3D Device?

Example 0 = R [ 1/X1 + 1/X1-L + 1/X1-X2 + 1/X1-X3] 4 5 7 6 1 2 3 X4 = X5 = 0 X6 = X7 = L 0 = R [ 1/X1 + 1/X1-L + 1/X1-X2 + 1/X1-X3] - [ X1 + (X1-L) + 3(X1-X3) + (X1-X2)] Different results for different R values Solve equations simulataneously Both for X and Y -> different repulsive forces?

Force-Directed Relaxation Start with random placement. Compute forces on each module. Pick the module with the largest force on it Compute zero-force position with Newton’s method Attempt to move to unoccupied position Or swap with existing module Or move to nearest open position Continue until final locations determined.

Hill Climbing Algorithms To avoid getting trapped in local minima, consider “hill-climbing” approach Need to accept worse solutions or make “bad” moves to get global minima. Acceptance is probabalistic. Only accept cost-increasing moves some of the time. Cost Solution space

Physical Annealing Take a metal and heat to high temperature Allow it to cool slowly; metal is annealed to a low temperature Atoms in the metal are at lower energy states after annealing Higher the temperature initially and slower the cooling, the tougher the metal becomes. Atoms transition to high energy states and then move to low energy.

Simulated Annealing Optimization strategy based on physical annealing process Generate random moves. Initially, accept moves that decrease and increase cost. As temperature decreases, the probability of accepting bad moves decreases. Eventually, default to greedy algorithm Only accept positive moves Determine when to terminate.

Annealing Algorithm N = # blocks B = scaling factor T = temperature T = StartingT Moves_per_iteration = BN4/3 While (stopping_criteria(T) = false) { While (Move_Count < Moves_per_Iter) { swap blocks evaluate Δcost if (Accept < Δcost) Move block to new location } T = update(T) N = # blocks B = scaling factor T = temperature

Accept Function Δcost = new cost – initial cost If (Δcost <= 0) return (yes) Else { y = exp ( - Δcost/T) r = random (0, 1) if (r < y) else return (no) }

Annealing Criteria Contemporary FPGA packages use the following parameters: Starting temp – 20 * stand_dev(cost of N swaps) Cost function – weighted sum of wire length and delay Inner loop – B * N4/3 Beta cost function Stopping criteria – T < [.005 * Cost/Nnets]

Range Limiting As temperature drops, limit scope of swaps Increased likelihood of acceptance. Can also used to secure critical path performance.

Timing-driven Placement Take both wire length and critical path into account Problem Critical path changes as I move blocks How do I balance the two objectives How do we go about modeling routing delay during placement?

Estimating delays Marquardt and Betz approach (T-VPlace) Perform initial delay analysis to determine shortest delays between each pair of X, Y locations Store information in table for quick look-up Assumption that router will probably find the minimum delay path (a leap of faith!)

Determining Criticality Same basic approach as used for clustering criticality For each (i, j) connection from source i and sink j Determine arrival times (pre-order BFS) Determine required arrival times (post-order BFS) Determine slack -> required_arrival_time – arrival_time Criticality(i, j) = [1- slack(i, j)]/ (Max slack) What is the purpose of the criticality exponent?

Balancing Wiring and Timing Cost Need to determine relative changes in timing and wiring based on moves Idea: Use relative changes from previous calculation Both values less than 1 Helps balance effect based on scaling parameter This still doesn’t help address changes in delay

Updated Annealing Algorithm

How often to recalculate delay? Recalculating delay once per temperature is good. Also simplifies programming somewhat

How important is timing-driven placement? Run time Penalty – 2.5X

Summary Placement and clustering of modules critically important for subsequent routing step Often initial placement performed and then iteratively improved Mincut partitioning approaches sometimes used for initial placement Efficient timing analysis a key to successful placement Island-style devices benefit from simulated annealing approaches Accurate cost function is the key to success Issues related to power consumption remain