Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.

Slides:



Advertisements
Similar presentations
FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
Interconnect Complexity-Aware FPGA Placement Using Rent’s Rule G. Parthasarathy Malgorzata Marek-Sadowska Arindam Mukherjee Amit Singh University of California,
BSPlace: A BLE Swapping technique for placement Minsik Hong George Hwang Hemayamini Kurra Minjun Seo 1.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.
A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research Cindy Mark Prof. Steve Wilton University of British Columbia Supported.
Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA?
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
Reconfigurable Computing (EN2911X, Fall07)
© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Threshold Voltage Assignment to Supply Voltage Islands in Core- based System-on-a-Chip Designs Project Proposal: Gall Gotfried Steven Beigelmacher 02/09/05.
ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Supporting GPU Sharing in Cloud Environments with a Transparent
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
Titan: Large and Complex Benchmarks in Academic CAD
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
Incremental Placement Algorithm for Field Programmable Gate Arrays David Leong Advisor: Guy Lemieux University of British Columbia Department of Electrical.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia.
QCAdesigner – CUDA HPPS project
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Parallel Routing for FPGAs based on the operator formulation
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
FPGA CAD 10-MAR-2003.
An Introduction to Simulated Annealing Kevin Cannons November 24, 2005.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.
May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM IBM Contact – Karl Erickson.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing
HeAP: Heterogeneous Analytical Placement for FPGAs
Incremental Placement Algorithm for Field Programmable Gate Arrays
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Chin Hau Hoo, Akash Kumar
FPGA Interconnection Algorithm
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang

3.8X gap over the past 5 years 1 6X 1.6X Motivation

Solution Trend suggests multicore processors versus faster processors Employ parallel algorithms to utilize multicore CPUs speed up FPGA CAD algorithms Specifically, this thesis targets the parallelization of simulated-annealing based placement algorithm 2

Thesis Contributions Parallel Placement on Multicore CPUs – Implemented in VPR5.0.2 using Pthreads Deterministic – Result reproducible when same # of threads used Timing-Driven Scalability – Runtime: scales to 25 threads – Quality: independent of the number of threads used – 161X speed up over VPR with 13%, 10% and 7% in post-routing min. channel width, wirelength, and critical-path delay – Can scale beyond 500X with <30% quality degradation 3

Publications [1] C. C. Wang and G.G.F. Lemieux. Scalable and deterministic timing-driven parallel placement for FPGAs. In FPGA, pages , 2011 – Core parallel placement algorithm presented in this thesis – Best paper award nomination (top 3) [2] C.C. Wang and G.G.F. Lemieux. Superior quality parallel placement based on individual LUT placement. Submitted for review. – Placement of individual LUTs directly and avoid clustering to improve quality Related work inspired by [1] J.B. Goeders, G.G.F. Lemieux, and S.J.E. Wilton. Deterministic timing-driven parallel placement by simulated annealing using half-box window decomposition. To appear in ReConFig,

Overview Motivation Background Parallel Placement Algorithm Result Future Work Conclusion 5

Background FPGA Placement: NP-complete problem 6

Background - continued FPGA placement algorithms choice: “… simulated-annealing based placement would still be in dominate use for a few more device generations … ” -- H. Bian et al. Towards scalable placement for FPGAs. FPGA 2010 Versatile Place and Route (VPR) has became the de facto simulated-annealing based academic FPGA placement tool 7

Background - continued 8 e a ic fl d m k hg n b j 1. Random Placement

Background - continued 9 e a ic fl d m k hg n b j 2. Propose swap

Background - continued 10 e a ic fl d m k hg n b j

Background - continued 11 e a ic fl d m k hg n b j

Background - continued 12 e a ic fl d m k hg n b j 3. Evaluate swap

Background - continued 13 e a ic fl d m k hg n b j If rejected …

Background - continued 14 e a ic fl d m k hg n b j If accepted… And repeat for another block…

Background - continued Swap evaluation 1.Calculate change in cost (Δc) Δc is a combination of targeting metrics 2.Compare random(0,1) > e (-Δc/T) ? where Temperature has a big influence on the acceptance rate If Δc is negative, it’s a good move, and will always be accepted 15

Background - continued Simulated-anneal schedule – Temperature correlates directly to acceptance rate – Starts at a high temperature and gradually lowers – Simulated-annealing schedule is a combination of carefully tuned parameters: initial condition, exit condition, temperature update factor, swap range … etc – A good schedule is essential for a good QoR curve 16

Background - continued Important FPGA placement algorithm properties: 1.Determinism: For a given constant set of inputs, the outcome is identical regardless of the number of time the program is executed. Reproducibility – useful for code debugging, bug reproduction/customer support and regression testing. 2. Timing-driven (in addition to area-driven): 42% improvement in speed while sacrificing 5% wire length. Marquardt et al. Timing-driven placement for FPGAs. FPGA

Background - continued Name (year)HardwareDeterm- inistic? Timing- driven? Result Casotto (1987)Sequent Balance 8000 No 6.4x on 8 processors Kravitz (1987)VAX 11/784No < 2.3x on 4 processors Rose (1988)National 32016No ~4 on 5 processors Banerjee (1990)Hypercube MPNo ~8 on 16 processors Witte (1991)Hypercube MPYesNo3.3x on 16 processors Sun (1994)Network of machines No 5.3x on 6 machines Wrighton (2003)FPGAsNo 500x-2500x over CPUs Smecher (2009)MPPAsNo 1/256 less swaps needed with 1024 cores Choong (2010)GPUNo 10x on NVIDIA GTX280 Ludwin (2008/10)MPsYes 2.1x and 2.4x on 4 and 8 processors This workMPsYes 161x using 25 processors 18

Background - continued 19 e a ic fl d m k hg n b j Main difficulty with parallelizing FPGA placement is to avoid conflicts

Background - continued 20 e a ic fl d m k hg n b j

Background - continued 21 e a ic fl d m k hg n b j

Background - continued 22 e a ic fl d m k hg n b j l Hard-conflict – must be avoided

Background - continued 23 e a ic fl d m k hg n b j

Background - continued 24 e a ic fl d m k hg n b j

Background - continued 25 e a ic fl d m k hg n b j

Background - continued 26 el a ig f d m k hc n b j Soft-conflict – allowed but degrades quality

Overview Motivation Background Parallel Placement Algorithm Result Future Work Conclusion 27

28 ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ CLB ↔ I/O Parallel Placement Algorithm

↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Parallel Placement Algorithm Partition for 4 threads 29 CLB ↔ I/O

↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Parallel Placement Algorithm 30 CLB ↔ I/O

T1 T2 T4 T3 Parallel Placement Algorithm 31 CLB ↔ I/O ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Parallel Placement Algorithm 32 CLB ↔ I/O

Parallel Placement Algorithm 33 CLB ↔ I/O Swap from Swap to ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Parallel Placement Algorithm 34 CLB ↔ I/O Swap from ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Create local copies of global data Create local copies of global data

Parallel Placement Algorithm 35 CLB ↔ I/O Swap from ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Parallel Placement Algorithm 36 CLB ↔ I/O Swap from ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Parallel Placement Algorithm 37 CLB ↔ I/O Swap from ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Swap from Parallel Placement Algorithm ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Swap from 38 CLB ↔ I/O

↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ Parallel Placement Algorithm Broadcast placement changes Continue to next swap from/to region… Continue to next swap from/to region… 39 CLB ↔ I/O

Parallel Placement Algorithm 40 CLB ↔ I/O Swap from Swap to ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Parallel Placement Algorithm 41 CLB ↔ I/O Swap from Swap to ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Parallel Placement Algorithm 42 CLB ↔ I/O Swap from Swap to ↕↕↕↕↕↕↕↕↕↕↕↕↕↕ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↕↕↕↕↕↕↕↕↕↕↕↕↕↕

Overview Motivation Background Parallel Placement Algorithm Result Future Work Conclusion 43

Result 7 synthetic circuits from Un/DoPack flow Clustered with T-Vpack Dell R815 4-sockets, each with an 8-core AMD Opteron 2.0 GHz, 32GB of memory Baseline: VPR –place_only Only placement time – Exclude netlist reading…etc 44

Quality – Post Routing Wirelength 45

Quality – Post Routing Wirelength 46

Quality – Post Routing Wirelength 47

Quality – Post Routing Wirelength 48

Quality – Post Routing Wirelength 49

Quality – Post Routing Wirelength 50

Quality – Post Routing Wirelength 51

Quality – Post Routing Wirelength 52

Quality – Post Routing Minimum Chan Width 53

Quality – Post Routing Critical-Path Delay 54

Quality – speed up over VPR 55

Quality - speed up over VPR 56

Effect of scaling on QoR inner_num= 1

Overview Motivation Background Parallel Placement Algorithm Result Future Work Conclusion 58

Further runtime scaling Can we scale beyond 25 threads? Better load balance techniques – Improved region partitioning New data structures – Support fully parallelizable timing updates – Reduce inter-processor communication Incremental timing analysis update – May benefit QoR as well! 59

Future Work - LUT placement 60 e a ic fl d m k hg n b j

Future Work - LUT placement 61 e a ic fl d m k hg n b j

Future Work - LUT placement 62 21%

Future Work - LUT placement 63 28%

Future Work - LUT placement %

Conclusion Determinism without fine-grain synchronization – Split work into non overlapping regions – Local (stale) copy of global data Runtime scalable, timing-driven Quality unaffected by number of threads Speedup: – >500X over VPR with <30% quality degradation – 161X speed up over VPR with 13%, 10% and 7% in post-routing min. channel width, wirelength, and critical-path delay Limitation – cannot match VPR’s quality – LUT placement is a promising approach to mitigate this issue 65

Questions? 66