© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia.

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

CSCI 4717/5717 Computer Architecture
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ASIC vs. FPGA – A Comparisson Hardware-Software Codesign Voin Legourski.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
Parallelism Processing more than one instruction at a time. Pipelining
CMPE 421 Parallel Computer Architecture
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
INSTRUCTION PIPELINE. Introduction An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase.
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
© 2010 Altera Corporation—Public Easily Build Designs Using Altera’s Video and Image Processing Framework 2010 Technology Roadshow.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Transactional Lee’s Algorithm 1 A Study of a Transactional Routing Algorithm Ian Watson, Chris Kirkham & Mikel Lujan School of Computer Science University.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lab 2 Parallel processing using NIOS II processors
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Parallelism without Concurrency Charles E. Leiserson MIT.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Lecture 20: Consistency Models, TM
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Processor Fundamentals
Planning and Scheduling
CS 3410, Spring 2014 Computer Science Cornell University
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Why we have Counterintuitive Memory Models
EE 155 / Comp 122 Parallel Computing
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 2 FPGA Size vs CPU Performance CPUs: 7x faster FPGAs: 33x bigger

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 3 Our Contributions Parallelized existing high-quality placer  Routability, timing and power driven  Deterministic  Good speedups with identical quality Present results on multicore PCs Identify and quantify bottlenecks

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 4 Non-Determinism Extremely difficult to test for correctness Extremely difficult to reproduce problems Very unpopular with customers  Some outright refuse to use ND algorithms  All customers value reproducible results We show that making our algorithms deterministic has a relatively small impact on performance.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 5 Serial Equivalency Any number of cores returns same result  Including a single core (hence “serial”) Easy if algorithm is already deterministic Even easier to test than determinism Serial equivalency has no additional overhead over determinism in our algorithms.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 6 Algorithm Runtimes The placer algorithms in this paper are a significant portion of overall runtime, but are not a majority

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 7 Agenda Part I: Pipelined Moves Part II: Parallel Moves

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 8 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Proposal Evaluation Algorithm Pseudo-Code

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 9 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Proposal Evaluation 40% time 60% time Expected speedup: 1/0.6 ≈ 1.7x Effect of Pipelining Proposals

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 10 Core 0Core 1 Evaluation Proposal Simplistic Implementation

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 11 Evaluation (C1) Proposal (C0) Move 1 Move 1 Move 0 In this example, C1 has just started evaluating a move, while C0 has just started proposing the next one. Since proposals are faster than evaluations (at least in theory), C0 will finish before C1. It then stalls until C0 is ready to take the move. Simplistic Implementation When C1 is ready, it grabs the proposed move and starts evaluating it, and C0 can begin proposing the next move.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 12 Proposal (C0) Evaluation (C1) Move 2 Move 1 Move 2 Simplistic Implementation

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 13 Naïve Pipelined Problems 1. Proposal/evaluation runtime variability If evaluation is faster than proposal, then the stall happens on the critical path 2. Large penalty for stalling After C0 stalls, it takes almost as long to wake it up as it does to propose the move in the first place!

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 14 Proposal (C0) Evaluation (C1) Better Implementation

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 15 Move Evaluation Queue Evaluation (C1) Move Proposal (C0) Better Implementation

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 16 Evaluation Queue Move Proposal (C0) Evaluation (C1) Better Implementation The queue buffers proposal/evaluation runtime variability and “hides” the stalls on C0 from the critical path on C1.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 17 Evaluation Queue Move Proposal (C0) Evaluation (C1) Accepted Moves Queue Proposal State Updates

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 18 Block 1 Block 2 Move 1 Move 5 Proposal Example In this example, we propose a move for block 1 to an empty locationSince we don’t know if it will ultimately be accepted by the evaluation stage, we assume (for the time being) that it will be rejected. Some time later, if we haven’t heard back from the evaluation stage, it might be reasonable to propose a move for another block to the same “empty” location.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 19 Block 1 Block 2 Move 5? Move 1 accepted Evaluation Example In the meantime, however, the evaluation stage has accepted Move 1 – it just wasn’t able to tell the proposal stage about it in time (race condition!) But the later move to the no-longer-empty location is already in the pipe. It can no longer be performed as proposed; what should we do about this?

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 20 Resolving Collisions When two moves have collided, we can:  Abandon the later moves (non-deterministic)  Attempt to “fix” colliding moves We fix it by reproposing it  In this example, Move 5 becomes a swap  This gives the same move as in the serial flow Therefore, the placer is serially equivalent

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 21 mem ctrl Platforms C0C1 $0$1 mem ctrl 2 GB C0C1 $0$1 C2C3 $2$3 mem ctrl 4 GB C0C1 $0/1 C2C3 $2/3 16 GB mem ctrl nbopt-mcc2-mc opt-dcopt-dpc2-dcc2-dp Netburst x2 (Pentium 4) Dual-core Opteron x2Core 2 Duo x2 To test a two-core algorithm on a four-core machine, we can either use two cores on the same package (“dc” = “dual core”) … … or we can use one core on each package (“dp” = “dual processor”). This decision has a large influence on the performance of the algorithm.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 22 Pipelined Results - 11 Circuits The results are far lower than the 1.7x ideal. Note that the best and worst results are both on the same platform (Core 2). Where is the runtime going on c2-dp?

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 23 Algorithm Components – c2-dp This is the pipelined algorithm, but with both stages taking turns on the same core. This uses high- resolution timers to show the runtime of each stage. For the pipelined algorithm, we ignore the proposal time since it’s “hidden.” But why has the evaluation time gotten so big?

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 24 Explaining the Results Reproposals, stalls are very fast Memory is bottleneck on 4/5 platforms  Exception: c2-dc has large, shared cache  Many, many more details are in the paper

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 25 Pipelined Moves Summary Poor inherent scalability, memory usage Reasonable speedups for amount of work  Far less work than fully parallel moves

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 26 Agenda Part I: Pipelined Moves Part II: Parallel Moves

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 27 move = propose(place); cost = evaluate(place, move); if(cost < 0) { accept(place, move); } Processing (propose and evaluate) Finalization (resolve collisions and commit) 99% time 1% time Stages with Thread-Safe Code

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 28 Core 0Core 1Core 2Core 3 Queue Finalize Process (C2) Process (C3) Process (C0) Process (C1) Finalization (resolve collisions and commit) Processing (propose and evaluate) Processing (propose and evaluate) Processing (propose and evaluate) Processing (propose and evaluate)

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 29 Queue Finalize Process (C2) Process (C3) Process (C0) Process (C1) Move 0 Move 1 Move 0 Move 1 Move 2 Move 3 Move 4 Finalize (C0) All four cores begin processing moves at the same time. Since finalizing moves is so fast, it would be a waste to devote a core to that task. Instead, all cores have the ability to finalize moves at the appropriate time, as this example will show. If one finishes out of order, it sits in the priority queue until the earlier moves are finished. Meanwhile, the core that processed it goes onto the next move. It does not stall and wait for any other cores. The priority queue now has two moves ready to be finalized. The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the move it inserted went to the front of the queue.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 30 Supervisor (2) Queue Finalize Process (C2) Process (C1) Process (C3) Move 0 Move 1 Finalize (C0) Move 2 Move 3 Move 4 The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the move it inserted went to the front of the queue.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 31 Supervisor (3) Queue Finalize Process (C0) Process (C1) Process (C2) Process (C3) Process (C2) Move 2 Move 3 Finalize (C2) Move 2 Move 3 Move 4 Move 6 Move 5 Process (C2) Move 7 Once a core has finished finalizing moves, it immediately goes back to processing them. The algorithm continues, with any core being able to finalize moves whenever it’s appropriate.

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 32 Parallel Results - 11 Circuits opt-mcc2-mc

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 33 Algorithm Components – c2-mc

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 34 Parallel Moves Summary Memory still bottleneck  Especially at 4 cores  But less than in pipelined Much more scalable (N instead of 1.7x)

© 2008 Altera Corporation - Public Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 35 Conclusions Significant parallelism in existing placer  Believe sufficient parallelism for 8-16 cores  More independent moves could scale further Determinism has a relatively low cost Memory is largest parallel bottleneck  Better hardware will help  A first-order concern for algorithm developers