High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov 10, 2009
Overview Motivation Review simulated annealing Approaches Summary
Motivation
Simulated Annealing Placement Probabilistic approach to finding optimal solution Behavior Moves through solution space Greedily Randomly Balance between greediness and randomness is controlled by a temperature Temperature evolves through time based on a cooling schedule
Simulated Annealing Placement For a single move Compute change in cost: ΔC Accept move: ΔC < 0 ΔC > 0, with probability e -ΔC/T Repeat while gradually decreasing T and window size c4c1 c5 c2 c3 t3
Constraints Runs on commodity hardware Good quality of results Robust Determinism Bug reporting Consistent regression results
Selected Previous Work Close related Move acceleration Parallel moves Other methods Independent sets Partitioned placements Speculative
Algorithm #1
Algorithm #2
Objective Determine efficacy Analyze runtime and categorize Memory Synchronization Infrastructure Evaluation Proposal
Methodology Parallel equivalent flow Serial flow which mimic parallel flow Emulates behavior of multithreaded application by using only one thread/core Useful for comparison Accounts for infrastructure overhead
Methodology Attributing runtime Two types of measurements Bottom up (bu) measure each component of a move End-to-end (e2e) measure runtime for entire run
Methodology
Test sets Set of 11 Stratix® II FPGA benchmark designs IP and customer circuits 10k to 100k logic cells Also tested on 40 Stratix II FPGA circuits Obtained similar result
Results for Algorithm #1
Moves attribution
Overhead analysis
Observations Theoretical speedup 1.7x Measured: 1.3x (best) Increase in evaluation runtime Due to reduced cache locality Proposal time is “hidden”
Analysis Time spent on stall is negligible Evaluation accounts for most of overhead Little to gain by removing determinism Serial equivalency is less than 3% runtime
Summary for Algorithm #1 Speedup: 1 – 1.3x Memory inefficiency is the biggest bottleneck Theoretically algorithm should scale However, difficult to partition and balance two stages
Speedups for Algorithm #2
Attribution on 2 cores
Attribution on 4 core
Attribution on 4 cores
Observations Memory latency due to inter-processor communication Worsens with more cores
Summary for Algorithm #2 Parallel moves has better scalability than pipelined moves Bottleneck is still memory Again serial equivalency costs little
Take Home Messages Memory is important Good algorithms are even more important