High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov 10, 2009

Overview Motivation Review simulated annealing Approaches Summary

Motivation

Simulated Annealing Placement Probabilistic approach to finding optimal solution Behavior  Moves through solution space Greedily Randomly  Balance between greediness and randomness is controlled by a temperature  Temperature evolves through time based on a cooling schedule

Simulated Annealing Placement For a single move  Compute change in cost: ΔC  Accept move: ΔC < 0 ΔC > 0, with probability e -ΔC/T Repeat while gradually decreasing T and window size c4c1 c5 c2 c3 t3

Constraints Runs on commodity hardware Good quality of results  Robust Determinism  Bug reporting  Consistent regression results

Selected Previous Work Close related  Move acceleration  Parallel moves Other methods  Independent sets  Partitioned placements  Speculative

Algorithm #1

Algorithm #2

Objective Determine efficacy Analyze runtime and categorize  Memory  Synchronization  Infrastructure  Evaluation  Proposal

Methodology Parallel equivalent flow  Serial flow which mimic parallel flow  Emulates behavior of multithreaded application by using only one thread/core Useful for comparison  Accounts for infrastructure overhead

Methodology Attributing runtime Two types of measurements  Bottom up (bu) measure each component of a move  End-to-end (e2e) measure runtime for entire run

Methodology

Test sets  Set of 11 Stratix® II FPGA benchmark designs IP and customer circuits 10k to 100k logic cells  Also tested on 40 Stratix II FPGA circuits Obtained similar result

Results for Algorithm #1

Moves attribution

Overhead analysis

Observations Theoretical speedup 1.7x  Measured: 1.3x (best) Increase in evaluation runtime  Due to reduced cache locality Proposal time is “hidden”

Analysis Time spent on stall is negligible Evaluation accounts for most of overhead Little to gain by removing determinism  Serial equivalency is less than 3% runtime

Summary for Algorithm #1 Speedup: 1 – 1.3x Memory inefficiency is the biggest bottleneck Theoretically algorithm should scale  However, difficult to partition and balance two stages

Speedups for Algorithm #2

Attribution on 2 cores

Attribution on 4 core

Attribution on 4 cores

Observations Memory latency due to inter-processor communication  Worsens with more cores

Summary for Algorithm #2 Parallel moves has better scalability than pipelined moves Bottleneck is still memory Again serial equivalency costs little

Take Home Messages Memory is important Good algorithms are even more important

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Similar presentations

Presentation on theme: "High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Similar presentations

Presentation on theme: "High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov."— Presentation transcript:

Similar presentations

About project

Feedback