High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

ECE-777 System Level Design and Automation Hardware/Software Co-design

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Simulated Annealing Premchand Akella. Agenda Motivation The algorithm Its applications Examples Conclusion.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Performance Analysis of Multiprocessor Architectures

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

ISQED’2015: D. Seemuth, A. Davoodi, K. Morrow 1 Automatic Die Placement and Flexible I/O Assignment in 2.5D IC Design Daniel P. Seemuth Prof. Azadeh Davoodi.

Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality Course Project Presentation: Mustafa Imran Ali Ali Mustafa.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Reconfigurable Computing (EN2911X, Fall07)

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

EDA (CS286.5b) Day 7 Placement (Simulated Annealing) Assignment #1 due Friday.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,

Threshold Voltage Assignment to Supply Voltage Islands in Core- based System-on-a-Chip Designs Project Proposal: Gall Gotfried Steven Beigelmacher 02/09/05.

Router modeling using Ptolemy Xuanming Dong and Amit Mahajan May 15, 2002 EE290N.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

The Pursuit for Efficient S/C Design The Stanford Small Sat Challenge: –Learn system engineering processes –Design, build, test, and fly a CubeSat project.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Performance Evaluation of Parallel Processing. Why Performance?

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Hartman1P1004 Leo Hartman Canadian Space Agency A VHDL Implementation of an On-board Autonomy Solution.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia.

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

Hardware Accelerator for Combinatorial Optimization Fujian Li Advisor: Dr. Areibi.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

1June 9, 2006Connections 2006 FPGA-based Prototyping of the Multi-Level Computing Architecture presented by Davor Capalija Supervisor: Prof. Tarek S. Abdelrahman.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.

ISPASS th April Santa Rosa, California

A Methodology for System-on-a-Programmable-Chip Resources Utilization

Parallel Algorithm Design

Juan Rubio, Lizy K. John Charles Lefurgy

CPSC 531: System Modeling and Simulation

Fault Injection: A Method for Validating Fault-tolerant System

Fault-Tolerant Architecture Design for Flow-Based Biochips

CARP: Compression-Aware Replacement Policies

FPGA Interconnection Algorithm

More on HW 2 (due Jan 26) Again, it must be in Python 2.7.

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov 10, 2009

Overview Motivation Review simulated annealing Approaches Summary

Motivation

Simulated Annealing Placement Probabilistic approach to finding optimal solution Behavior  Moves through solution space Greedily Randomly  Balance between greediness and randomness is controlled by a temperature  Temperature evolves through time based on a cooling schedule

Simulated Annealing Placement For a single move  Compute change in cost: ΔC  Accept move: ΔC < 0 ΔC > 0, with probability e -ΔC/T Repeat while gradually decreasing T and window size c4c1 c5 c2 c3 t3

Constraints Runs on commodity hardware Good quality of results  Robust Determinism  Bug reporting  Consistent regression results

Selected Previous Work Close related  Move acceleration  Parallel moves Other methods  Independent sets  Partitioned placements  Speculative

Algorithm #1

Algorithm #2

Objective Determine efficacy Analyze runtime and categorize  Memory  Synchronization  Infrastructure  Evaluation  Proposal

Methodology Parallel equivalent flow  Serial flow which mimic parallel flow  Emulates behavior of multithreaded application by using only one thread/core Useful for comparison  Accounts for infrastructure overhead

Methodology Attributing runtime Two types of measurements  Bottom up (bu) measure each component of a move  End-to-end (e2e) measure runtime for entire run

Methodology

Test sets  Set of 11 Stratix® II FPGA benchmark designs IP and customer circuits 10k to 100k logic cells  Also tested on 40 Stratix II FPGA circuits Obtained similar result

Results for Algorithm #1

Moves attribution

Overhead analysis

Observations Theoretical speedup 1.7x  Measured: 1.3x (best) Increase in evaluation runtime  Due to reduced cache locality Proposal time is “hidden”

Analysis Time spent on stall is negligible Evaluation accounts for most of overhead Little to gain by removing determinism  Serial equivalency is less than 3% runtime

Summary for Algorithm #1 Speedup: 1 – 1.3x Memory inefficiency is the biggest bottleneck Theoretically algorithm should scale  However, difficult to partition and balance two stages

Speedups for Algorithm #2

Attribution on 2 cores

Attribution on 4 core

Attribution on 4 cores

Observations Memory latency due to inter-processor communication  Worsens with more cores

Summary for Algorithm #2 Parallel moves has better scalability than pipelined moves Bottleneck is still memory Again serial equivalency costs little

Take Home Messages Memory is important Good algorithms are even more important