FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
The 3D FDTD Buried Object Detection Forward Model used in this project was developed by Panos Kosmas and Dr. Carey Rappaport of Northeastern University.
*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
RUN-TIME RECONFIGURATION FOR AUTOMATIC HARDWARE/SOFTWARE PARTITIONING Tom Davidson, Karel Bruneel, Dirk Stroobandt Ghent University, Belgium Presenting:
Hardware/Software Partitioning Greg Stitt ECE Department University of Florida.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.
Gedae, Inc. Implementing Modal Software in Data Flow for Heterogeneous Architectures James Steed, Kerry Barnes, William Lundgren Gedae, Inc.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Configurable, reconfigurable, and run-time reconfigurable computing.
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
May 16-18, Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Wang Chen, Dr. Miriam Leeser, Dr. Carey Rappaport Goal Speedup 3D Finite-Difference Time-Domain.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs Direction Filter Design FPGA FIREBIRD BOARD Framegrabber PCI Bus Host Data Packing Design.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Implementing Fast Image Processing Pipelines in a Codesign Environment Accelerate image processing tasks through efficient use of FPGAs. Combine already.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Compiler Research How I spent my last 22 summer vacations Philip Sweany.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Generalized and Hybrid Fast-ICA Implementation using GPU
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Hiba Tariq School of Engineering
Dynamo: A Runtime Codesign Environment
Dynamo: A Runtime Codesign Environment
A Methodology for System-on-a-Programmable-Chip Resources Utilization
Objective of This Course
Presentation transcript:

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained parallelism, FPGAs provide fine-grained parallelism Our Environment  Software processing F Designed in Java F Done on a single host  Hardware Processing F Designed with JHDL (from BYU) F Input and output image in FPGA’s on-board memory X Input image communicated from host to FPGA at beginning of processing X Output image communicated from FPGA to host at end of processing X Host and FPGA connected through PCI Bus Motivation: Accelerate image processing tasks through efficient use of FPGAs. Combine already designed components at runtime to implement series of transformations (pipelines) Implementing Image Processing Pipelines in a Hardware/Software Environment Heather Quinn 1, Dr. Miriam Leeser Dr. Laurie Smith King Northeastern University College of the Holy Cross 1 Image Processing Pipelines  Series of image processing algorithms applied to an image  Each algorithm has a software and hardware implementation  Finding the crossover point for a pipeline is complicated F Exponential number of implementations F Reprogramming costs  Need a strategy to find a fast pipeline implementation at runtime Hardware Systems  Using hardware incurs execution costs not present in software systems F Hardware initialization, F Communicating image, and F Reprogramming Efficient Use of FPGAs  Software algorithm’s runtime for small images less than the hardware costs F Profiling the hardware and software runtimes for different image sizes determines the crossover point F Deciding at runtime to execute in software or hardware is simple for one algorithm processing one image Assumptions Problem Statement  Inputs: a profiled library of image processing implementations (components), a pipeline, and an image  Output: an assignment of each component to a hardware or software implementation The Library of Components  Each algorithm has two implementations: hardware and software  Each implementation has known runtimes for a set of images F Interpolation used for rest of image sizes  Each hardware implementation has a known area size  All components are image in/image out  Reprogramming and Communication costs incurred at sw/hw boundaries  Might need to fix image edge in between components  Problems sizes of 20 or fewer stages  500 ms to make a decision Solving Pipeline Assignment  Exhaustive Search F Find: Optimal solutions F How: Search entire problem space F Algorithm Runtime: O(2 N ), where N is the number of pipeline stages  ILP F Find: Optimal solutions F How: AMPL model running on CPLEX F Need: ILP formulation of the problem statement F Algorithm Runtime: Unknown  Greedy F Find: Sub-optimal solutions F How: Make optimal decisions for each pipeline stage based on hardware area usage and speedup values F Algorithm Runtime: O(N), where N is the number of pipeline stages  Local Search F Find: Sub-optimal solutions F How: Improve upon initial solutions (found through Greedy or randomly) F Algorithm Runtime: runs for user supplied amount of time Experiments  Synthetic components arranged into pipelines of length 1 to 20  Exhaustive algorithm run to completion F Used as a baseline for solution quality F Timed to find 500 ms boundary  ILP solver constrained to 500 ms F Ability to solve dependent on components  Local Search returns best solution found within time limit Results  Exhaustive: optimal solutions for 11 stages  Optimal solutions in 500 ms F Exhaustive: up to 11 stages F ILP: all pipelines up to 13 stages, some pipelines up to 18 stages, and none larger  Sub-optimal solutions in 500 ms F Greedy and local search: all problem sizes  Strategy F Exhaustive and ILP up to 13 stages F Greedy or local search for more than 13 stages Future Work  ADAPT: Algorithm that calls exhaustive, ILP and local search algorithms to solve pipeline assignment problem based on problem size  Decision Time: Study how the amount of time allotted affects ADAPT results  Virtex II Pro: Add scheduling support for using embedded Power PC cores Publications L. Smith King, H. Quinn, M. Leeser, D. Galatopoullos and E. S. Manolakos, “Run-time Execution of Reconfigurable Hardware in a Java Environment”, International Conference on Computer Design, September H Quinn, M. Leeser, and L. Smith King, “Accelerating Image Processing in a Software/Hardware Environment”, MAPLD International Conference, September An Example Need pipeline implementations that minimize reprogramming and communication costs Start AppHW InitSend Data MedianRepgmEdge DetGet DataDisplay  Blue boxes are hw/sw boundaries  Red boxes are fixing image edges  Green Boxes are reprogramming Possible Implementations Median Filter and Edge Detection Profiles Median Filter  Edge Detection Profiles Median Filter & Edge Detection