Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

H EURISTIC S OLVER  Builds and tests alternative fuel treatment schedules (solutions) at each iteration  In each iteration:  Evaluates the effects of.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research Cindy Mark Prof. Steve Wilton University of British Columbia Supported.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

ISQED’2015: D. Seemuth, A. Davoodi, K. Morrow 1 Automatic Die Placement and Flexible I/O Assignment in 2.5D IC Design Daniel P. Seemuth Prof. Azadeh Davoodi.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Placement 1 Outline Goal What is Placement? Why Placement?

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Reconfigurable Computing (EN2911X, Fall07)

A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.

Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

FPGA CAD 10-MAR-2003.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

Sunpyo Hong, Hyesoon Kim

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.

Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Rapid Synthesis and Simulation of Computational Circuits on an MPPA David Grant Dr. Guy Lemieux, Graeme Smecher, Rosemary Francis (U. Cambridge)

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

ESE532: System-on-a-Chip Architecture

HeAP: Heterogeneous Analytical Placement for FPGAs

Parallel Algorithm Design

The University of British Columbia

Centar ( Global Signal Processing Expo

Verilog to Routing CAD Tool Optimization

A High Performance SoC: PkunityTM

HIGH LEVEL SYNTHESIS.

Performance And Scalability In Oracle9i And SQL Server 2000

Research: Past, Present and Future

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009

Landscape Massively Parallel Processor Arrays – 2D array of processors Ambric: 336, PicoChip: 273, AsAP: 167, Tilera: 100 – Processor-to-processor communication Placement (locality) matters – Tools/algorithms immature 2

Opportunity MPPAs track Moore’s Law – Array size grows E.g. Ambric:336, Fermi:512 Opportunity for FPGA-like CAD? – Compiler-esque speed needed – Self-hosted parallel placement M x N array of CPUs computes placement for M x N programs Inherently scalable 3

Overview Architecture Placement Problem Self-Hosted Placement Algorithm Experimental Results Conclusions 4

MPPA Architecture 32 x 32 = 1024 PEs PE = RISC + Router RISC core – In-order pipeline – More powerful PE than prev talk Router – 1-cycle per hop 5

MPPA Architecture (cont’d) Simple RISC core – More capable than RVEArch Small local RAM 6

Overview Architecture Placement Problem Self-Hosted Placement Algorithm Experimental Results Conclusions 7

Placement Problem Given: netlist graph – Set of “cluster” programs – One per PE – Communication paths Find: good 2D placement – Use simulated annealing – E.g., minimum total Manhattan wirelength 8 C C C C C C C C C C C C C C C C C C C C C C C C CCC C C C C C C C C C C C C C C C C C C C C

Overview Architecture Placement Problem Self-Hosted Placement Algorithm Experimental Results Conclusions 9

Self-Hosted Placement Idea from Wrighton and DeHon, FPGA03 – Use FPGA to place itself – Imbalanced: tiny problem size needs HUGE FPGA – N-FPGAs needed to place 1-FPGA design 10

Self-Hosted Placement Use MPPA to place itself – PE powerful enough to place itself – Removes imbalance – 2 x 3 PEs to place 6 “clusters” into 2 x 3 array 11 C C C C C C 5 0 C C 3 1 CC C C C C C C C C C 2 4 C C C C C C C C C C 1 4 C C 0 3 C C C C C C C C C C C C 2 5 C C C C

Regular Simulated Annealing 1.initial: random placement 2.for T in {temperatures} 1.for n in 1..N clusters 1.Randomly select 2 blocks 2.Compute swap cost 3.Accept swap if i) cost decreases, or ii) random trial succeeds 12 PE

Modified Simulated Annealing 1.initial: random placement 2.for T in {temperatures} 1.for n in 1..N clusters 1.Consider all pairs in neighbourhood of n 2.Compute swap cost 3.Accept swap if i) cost decreases, or ii) random trial succeeds 13 PE

Self-Hosted Simulated Annealing 1.initial: random placement 2.for T in {temperatures} 1.for n in 1..N clusters 1.Update position chain 2.Consider all pairs in neighbourhood of n 3.Compute swap cost 4.Accept swap if i) cost decreases, or ii) random trial succeeds 14

Algorithm Data Structures Place-to-block maps Net-to-block maps Nets Blocks (programs) PEs nbm bnm pbm bpm STATIC DYNAMIC 15

pbm static bpm bnm nbm 16 Full map in each PEPartial map in each PE Algorithm Data Structures

Swap Transaction PEs pair up – Deterministic order, hardcoded in algorithm Each PE computes cost for own BlockID – Current placement cost – After cost if BlockID was swapped PE 1 sends cost of swap to PE 2 – PE 2 adds costs, determines if swap accepted – PE 2 sends decision back to PE 1 – PE 1 and PE2 exchange data structures if swap 17

Data Structure Updates 18 Dynamic structures Local : update on swap Other : update chain Static structures Exchanged with swap

Data Communication Swap Transaction 19 PEs exchange BlockIDs PEs exchange nets for their BlockIDs PEs exchange BlockIDs for their nets (already updated)

Overview Architecture Placement Problem Self-Hosted Placement Algorithm Experimental Results Conclusions 20

Methodology Three versions of Simulated Annealing (SA) – Slow sequential SA Baseline, generates “ideal” placement Very slow schedule (200k swaps per T drop) Impractical, but nearly optimal – Fast Sequential SA Vary parameters across practical range – Fast Self-Hosted SA 21

Benchmark “Programs” Behavioral Verilog dataflow circuits – Courtesy Deming Chen, UIUC – Compiled using RVETool into parallel programs Hand-coded Motion Estimation kernel – Handcrafted in RVEArch – Not exactly a circuit 22

Benchmark Characteristics 23 Up to 32 x 32 array size

Result Comparisons Investigate options – Best neighbourhood size: – Update chain frequency – Stopping temperature 24

4-Neighbour Swaps CC 25

8-Neighbour Swaps CC 26

12-Neighbour Swaps CC 27

Update-chain Frequency CC 28

Stopping Temperature CC 29

Limitations and Future Work These results were simulated on a PC – Need to target real MPPA – Performance in vs vs Need to model limited RAM per PE – We assume complete netlist, placement state can be divided among all PEs – Incomplete state if memory is limited? e.g., discard some nets? 30

Conclusions Self-Hosted Simulated Annealing – High-quality placements (within 5%) – Excellent parallelism and speed Only 1/256 th number of swaps needed – Runs on target architecture itself Eat you own dog food Computationally scalable Memory footprint may not scale to uber-large arrays 31

Conclusions Self-Hosted Simulated Annealing – High-quality placements (within 5%) – Excellent parallelism and speed Only 1/256 th number of swaps needed – Runs on target architecture itself Eat you own dog food Computationally scalable Memory footprint may not scale to uber-large arrays Thank you! 32

EOF 33