Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia.

Similar presentations


Presentation on theme: "© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia."— Presentation transcript:

1 © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

2 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 2 FPGA Size vs CPU Performance CPUs: 7x faster FPGAs: 33x bigger

3 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 3 Our Contributions Parallelized existing high-quality placer  Routability, timing and power driven  Deterministic  Good speedups with identical quality Present results on multicore PCs Identify and quantify bottlenecks

4 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 4 Non-Determinism Extremely difficult to test for correctness Extremely difficult to reproduce problems Very unpopular with customers  Some outright refuse to use ND algorithms  All customers value reproducible results We show that making our algorithms deterministic has a relatively small impact on performance.

5 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 5 Serial Equivalency Any number of cores returns same result  Including a single core (hence “serial”) Easy if algorithm is already deterministic Even easier to test than determinism Serial equivalency has no additional overhead over determinism in our algorithms.

6 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 6 Algorithm Runtimes The placer algorithms in this paper are a significant portion of overall runtime, but are not a majority

7 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 7 Agenda Part I: Pipelined Moves Part II: Parallel Moves

8 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 8 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Proposal Evaluation Algorithm Pseudo-Code

9 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 9 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Proposal Evaluation 40% time 60% time Expected speedup: 1/0.6 ≈ 1.7x Effect of Pipelining Proposals

10 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 10 Core 0Core 1 Evaluation Proposal Simplistic Implementation

11 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 11 Evaluation (C1) Proposal (C0) Move 1 Move 1 Move 0 In this example, C1 has just started evaluating a move, while C0 has just started proposing the next one. Since proposals are faster than evaluations (at least in theory), C0 will finish before C1. It then stalls until C0 is ready to take the move. Simplistic Implementation When C1 is ready, it grabs the proposed move and starts evaluating it, and C0 can begin proposing the next move.

12 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 12 Proposal (C0) Evaluation (C1) Move 2 Move 1 Move 2 Simplistic Implementation

13 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 13 Naïve Pipelined Problems 1. Proposal/evaluation runtime variability If evaluation is faster than proposal, then the stall happens on the critical path 2. Large penalty for stalling After C0 stalls, it takes almost as long to wake it up as it does to propose the move in the first place!

14 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 14 Proposal (C0) Evaluation (C1) Better Implementation

15 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 15 Move Evaluation Queue Evaluation (C1) Move Proposal (C0) Better Implementation

16 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 16 Evaluation Queue Move Proposal (C0) Evaluation (C1) Better Implementation The queue buffers proposal/evaluation runtime variability and “hides” the stalls on C0 from the critical path on C1.

17 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 17 Evaluation Queue Move Proposal (C0) Evaluation (C1) Accepted Moves Queue Proposal State Updates

18 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 18 Block 1 Block 2 Move 1 Move 5 Proposal Example In this example, we propose a move for block 1 to an empty locationSince we don’t know if it will ultimately be accepted by the evaluation stage, we assume (for the time being) that it will be rejected. Some time later, if we haven’t heard back from the evaluation stage, it might be reasonable to propose a move for another block to the same “empty” location.

19 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 19 Block 1 Block 2 Move 5? Move 1 accepted Evaluation Example In the meantime, however, the evaluation stage has accepted Move 1 – it just wasn’t able to tell the proposal stage about it in time (race condition!) But the later move to the no-longer-empty location is already in the pipe. It can no longer be performed as proposed; what should we do about this?

20 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 20 Resolving Collisions When two moves have collided, we can:  Abandon the later moves (non-deterministic)  Attempt to “fix” colliding moves We fix it by reproposing it  In this example, Move 5 becomes a swap  This gives the same move as in the serial flow Therefore, the placer is serially equivalent

21 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 21 mem ctrl Platforms C0C1 $0$1 mem ctrl 2 GB C0C1 $0$1 C2C3 $2$3 mem ctrl 4 GB C0C1 $0/1 C2C3 $2/3 16 GB mem ctrl nbopt-mcc2-mc opt-dcopt-dpc2-dcc2-dp Netburst x2 (Pentium 4) Dual-core Opteron x2Core 2 Duo x2 To test a two-core algorithm on a four-core machine, we can either use two cores on the same package (“dc” = “dual core”) … … or we can use one core on each package (“dp” = “dual processor”). This decision has a large influence on the performance of the algorithm.

22 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 22 Pipelined Results - 11 Circuits The results are far lower than the 1.7x ideal. Note that the best and worst results are both on the same platform (Core 2). Where is the runtime going on c2-dp?

23 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 23 Algorithm Components – c2-dp This is the pipelined algorithm, but with both stages taking turns on the same core. This uses high- resolution timers to show the runtime of each stage. For the pipelined algorithm, we ignore the proposal time since it’s “hidden.” But why has the evaluation time gotten so big?

24 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 24 Explaining the Results Reproposals, stalls are very fast Memory is bottleneck on 4/5 platforms  Exception: c2-dc has large, shared cache  Many, many more details are in the paper

25 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 25 Pipelined Moves Summary Poor inherent scalability, memory usage Reasonable speedups for amount of work  Far less work than fully parallel moves

26 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 26 Agenda Part I: Pipelined Moves Part II: Parallel Moves

27 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 27 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Processing (propose and evaluate) Finalization (resolve collisions and commit) 99% time 1% time Stages with Thread-Safe Code

28 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 28 Core 0Core 1Core 2Core 3 Queue Finalize Process (C2) Process (C3) Process (C0) Process (C1) Finalization (resolve collisions and commit) Processing (propose and evaluate) Processing (propose and evaluate) Processing (propose and evaluate) Processing (propose and evaluate)

29 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 29 Queue Finalize Process (C2) Process (C3) Process (C0) Process (C1) Move 0 Move 1 Move 0 Move 1 Move 2 Move 3 Move 4 Finalize (C0) All four cores begin processing moves at the same time. Since finalizing moves is so fast, it would be a waste to devote a core to that task. Instead, all cores have the ability to finalize moves at the appropriate time, as this example will show. If one finishes out of order, it sits in the priority queue until the earlier moves are finished. Meanwhile, the core that processed it goes onto the next move. It does not stall and wait for any other cores. The priority queue now has two moves ready to be finalized. The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the move it inserted went to the front of the queue.

30 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 30 Supervisor (2) Queue Finalize Process (C2) Process (C1) Process (C3) Move 0 Move 1 Finalize (C0) Move 2 Move 3 Move 4 The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the move it inserted went to the front of the queue.

31 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 31 Supervisor (3) Queue Finalize Process (C0) Process (C1) Process (C2) Process (C3) Process (C2) Move 2 Move 3 Finalize (C2) Move 2 Move 3 Move 4 Move 6 Move 5 Process (C2) Move 7 Once a core has finished finalizing moves, it immediately goes back to processing them. The algorithm continues, with any core being able to finalize moves whenever it’s appropriate.

32 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 32 Parallel Results - 11 Circuits opt-mcc2-mc

33 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 33 Algorithm Components – c2-mc

34 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 34 Parallel Moves Summary Memory still bottleneck  Especially at 4 cores  But less than in pipelined Much more scalable (N instead of 1.7x)

35 © 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 35 Conclusions Significant parallelism in existing placer  Believe sufficient parallelism for 8-16 cores  More independent moves could scale further Determinism has a relatively low cost Memory is largest parallel bottleneck  Better hardware will help  A first-order concern for algorithm developers


Download ppt "© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia."

Similar presentations


Ads by Google