Frank Vahid, 1 Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation Bailey Miller, Frank Vahid, Tony Givargis**

Slides:

Advertisements

Similar presentations

Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Frank Vahid, UCR 1 Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Support provided by NSF, SRC, and CareFusion.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

IUCEE Workshop presentation-YVJoshi VLSI Signal Processing Y. V. Joshi SGGS Institute of Engineering and Technology, Nanded.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Application-Specific Codesign Platform Generation for Digital Mockups in Cyber- Physical Systems Bailey Miller *, Frank Vahid *†, Tony Givargis † *Dept.

1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Frank Vahid, UCR 1 Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Support provided by NSF, SRC, Dept. of Educ. Also.

Implementation of Digital Front End Processing Algorithms with Portability Across Multiple Processing Platforms September 20-21, 2011 John Holland, Jeremy.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.

CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Efficient FPGA Implementation of QR

Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

Using Cycle Efficiency as a System Designer Metric to Characterize an Embedded DSP and Compare Hard Core vs. Soft Core Advisor Dr. Vishwani D. Agrawal.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Final Presentation Final Presentation OFDM implementation and performance test Performed by: Tomer Ben Oz Ariel Shleifer Guided by: Mony Orbach Duration:

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Parallel Routing for FPGAs based on the operator formulation

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

FPGA CAD 10-MAR-2003.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.

Philipp Gysel ECE Department University of California, Davis

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Optimization of Time-Partitions for Mixed-Criticality Real-Time Distributed Embedded Systems Domițian Tămaș-Selicean and Paul Pop Technical University.

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Anne Pratoomtong ECE734, Spring2002

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Portable SystemC-on-a-Chip

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Presentation transcript:

Frank Vahid, 1 Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation Bailey Miller*, Frank Vahid*, Tony Givargis** *Univ. of California, Riverside **Univ. of California, Irvine Supported by the NSF (CNS , CPS ), and Semiconductor Research Corporation (GRC ) Intern at SpaceX

Bailey Miller, UCR 2

Frank Vahid, UCR 3 Models of physical world that run in real-time Test cyber-physical systems Physical mockup Transducer models Environment model Digital mockup

Issue: Real-time achieved via inaccuracy Frank Vahid, UCR 4 Weibel lung complexity 4 gen: 32 ODEs 6 gen: 128 ODEs 8 gen: 512 ODEs 10 gen: 2048 ODEs “2-3 minutes to simulate one breath accurately” V[1],R[1] V[2],R[2] V[7],R[7]

Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic Weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS / FPGAs Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS/FPGA: 3.2x Earlier results Parallel computations + Neighbor communication  Seem like great match for FPGAs

Frank Vahid, UCR 6 Network of synchronized PEs on FPGAs FPGA Digital mockup General Processing Element Iterative ODE solver (Euler/RK4) 0.1 ms / 0.01 ms timestep PE 1 PE: 300 MHz

Frank Vahid, UCR Weibel Neuron Weibel + gas Hemodynamic weibel + hemo Performance (ms) PC(1) PC(4) GPU HLS General PEs Speedup vs real-time PC(1):0.8x PC(4):3.1x GPU:1.6x HLS:3.2x General PE:4.9x Earlier results

Frank Vahid, UCR 8 Current work Problem: More PEs  Lower frequency FPGA Inter-PE critical path FPGA DSP INST MEM DATA MEM Internal PE critical path 11-gen Weibel model, Virtex6 240T FPGA, general PEs Real ODEs/sec Lost ODEs/sec due to freq drop

Earlier approach Frank Vahid, UCR 9 10K iterations 150K iterations Convert virtual PEs to physical circuits using FPGA place-route 1 2 Phase Maps ODEs to virtual PEs using simulated annealing

Frank Vahid, UCR 10 FPGA This work: Main idea Graph embedding: Map guest graph to host graph, minim. max wire length Guest Host Virtual PEs Physical PEs Use physical model structure to avoid using FPGA placement (Phase 2)

Frank Vahid, UCR 11 FPGA 12 3 … Phase 2 – Map virtual PEs to physical PEs Embedding algorithm H-tree embedding Linear embedding Direct map embedding Guest Host [1] Zienicke, P Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90). [2] Aleliunas, R., and Rosenberg, A.L On Embedding Rectangular Grids in Square Grids. (Computers ‘82). [3] Berman, F., and Snyder, L On mapping parallel algorithms into parallel architectures, (PDC, ‘87).

2D grid of physical PEs Frank Vahid, UCR 12 EqP1 EqV1 EqP2 EqV2 EqP3 EqV3 EqP4 EqV4 EqP7 EqV7 EqP5 EqV5 EqP6 EqV6 FPGA Bypass FPGA placement EqP1 EqV1 EqP4 EqV4 EqP2 EqV2 EqP5 EqV5 EqP2 EqV2 EqP6 EqV6 EqP7 EqV7 (Phase 1: May require "graph folding" first to reduce #PEs)

FPGA Compare/backup: Simulated annealing Frank Vahid, UCR 13 Cost function: C = w1*sum + w2*max + w3*gaps Sum = sum of wire distances Max = max wire length (Euclidean dist.) Gaps = wires across architectural features FPGA P1 P2 Neighbor function: Swap PEs based on distance to neighbors

Results Frank Vahid, UCR 14 No placement strategy Simulated annealing placement Embedding placement 4 generations shown5 generations shown

Results Frank Vahid, UCR 15 Not routable Strategy# LUTS# BRAM#DSPEquivalent LUTs None SA Embed No impact on size 2D Neuron model - 256PE – Xilinx Virtex6 StrategyTotal power (mW) Dynamic power (mW) Static power (mW) None SA Embed % more power Using Xilinx XPower Analyzer

Results/Conclusions Frank Vahid, UCR 16 Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x Heterog PE: 34.5x (Grph emb(HPE): 48.5x) Graph emb. gives additional speedup –FPGAs: Fastest cost-effective execution of physical models – Future –Manycore device –Beyond testing CPS Implement end-products Speedup / dollar Heterog PE 3.0x better than PC(4) 4.5x better than GPU CPU (I Intel X58 board): $480 GPU(GTX460 + I H55 board):$380 FPGA (Xilinx Virtex6 240T-2 board):$1800

Questions? Frank Vahid, UCR 17