K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Slides:



Advertisements
Similar presentations
Reconfigurable Computing After a Decade: A New Perspective and Challenges For Hardware-Software Co-Design and Development Tirumale K Ramesh, Ph.D. Boeing.
Advertisements

WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
High-Level Synthesis for FPGAs: From Prototyping to Deployment (To appear in IEEE TCAD 2011) Course Presentation By: Murtaza Merchant 1.
Chapter 15 Digital Signal Processing
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
ECE 699: Lecture 2 ZYNQ Design Flow.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GanesanP91 Synthesis for Partially Reconfigurable Computing Systems Satish Ganesan, Abhijit Ghosh, Ranga Vemuri Digital Design Environments Laboratory.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
1 DSP Implementation on FPGA Ahmed Elhossini ENGG*6090 : Reconfigurable Computing Systems Winter 2006.
GPGPU platforms GP - General Purpose computation using GPU
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Delevopment Tools Beyond HDL
Highest Performance Programmable DSP Solution September 17, 2015.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Architectural Optimizations David Ojika March 27, 2014.
Dr. Alireza Ghorshi Dr. Mohammad Mortazavi Dr. Mohammad Khansari Dr. Alireza Nemany Pour.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
Automated Design of Custom Architecture Tulika Mitra
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
J. Christiansen, CERN - EP/MIC
Heterogeneous FPGA architecture and CAD Peter Jamieson Supervisor: Jonathan Rose.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
R2D2 team R2D2 team Reconfigurable and Retargetable Digital Devices  Application domains Mobile telecommunications  WCDMA/UMTS (Wideband Code Division.
Design Objectives The design should fulfill the functional requirements listed below Functional Requirements Hardware design – able to calculate transforms.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Sub-Nyquist Sampling Algorithm Implementation on Flex Rio
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
2D/3D Integration Challenges: Dynamic Reconfiguration and Design for Reuse.
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Dynamo: A Runtime Codesign Environment
Ph.D. in Computer Science
ENG3050 Embedded Reconfigurable Computing Systems
FPGAs in AWS and First Use Cases, Kees Vissers
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
Anne Pratoomtong ECE734, Spring2002
Optimizing stencil code for FPGA
1CECA, Peking University, China
DSP Architectures for Future Wireless Base-Stations
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Presentation transcript:

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT, Inverse FFT, DSP Filters DNA and Protein SequencingBioinformaticsThroughputSmith Waterman Advanced Encryption StandardCryptographyThroughputMatrix Substitutions and Permutations Monte Carlo Option-PricingFinancial AnalysisThroughput / LatencyBlack Scholes, Mersenne Twister, Box Muller Digit RecognitionMachine LearningThroughputPopulation-Count, K-NN, K-Means Convolutional Neural NetworksMachine LearningThroughputConvolution, Soft Max and Max Pool Layers Face DetectionVideo ProcessingThroughputViola Jones Algorithm Lane DetectionVideo ProcessingLatency (Real-Time)Edge Detection Rosetta: A Realistic Benchmark Suite for Software Programmable FPGAs Udit Gupta, Steve Dai, Zhiru Zhang Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University, Ithaca, NY Proposed Initial Benchmarks for Rosetta Realistic benchmarks are important for the development of sophisticated yet scalable high-level synthesis (HLS) tools that allow FPGAs to be programmed in software while achieving high-performance designs. ▪We propose a suite of realistic software applications with enforceable system-level hardware constraints to model hardware accelerators targeting heterogeneous FPGAs. ▪We present a C and OpenCL based design and verification flow that accommodates both the sequential and parallel programming models commonly supported by HLS. Overview Our Research Productivity Performance C-to-gates Manual RTL CPU/GPU Programming ~2-3X Efficiency gap ~5-10X Productivity gap Programming FPGAs? Software Programming Hardware Design Image courtesy of Xilinx, 2015 ▪Modern FPGAs are heterogeneous SoCs composed of multi-core processors, hardened IPs, and reconfigurable fabric. ▪Calls for an unprecedented amount of software programmability as FPGA emerges from a logic to computing device. Modern FPGAs and HLS Tools ▪Insufficient QoR compared to hand-written RTL and significant productivity gap from pure software programming. ▪Needs significant amount of manual code transformations and lacks automatic compiler and synthesis optimizations for performance. ▪Realistic: Representative application domains with large applications and user enforceable design constraints. ▪Reducible: Users can profile applications and sub-kernels for effectively benchmarking QoR. ▪Retargetable: Supports various HLS tools which target multiple substrates and FPGA devices. The Three R’s of Rosetta ▪Span a limited subset of computing domains that FPGAs target and consist of small kernels (less than 1000 lines) which lack the complexity representative of common HLS designs. ▪Provide no support for multiple programming models and require significant user overhead in adding HLS-specific optimization directives. ▪Do not allow convenient parameterization of designs by user. Existing HLS Benchmark Suites for digit in range (0, 10): for training_inst in training_data[digit]: diff = training_inst xor test_inst distance = population_count(diff) if (distance < min): nearest_neighbor = training_inst min = distance return nearest_neighbor ▪Efficient hardware for population count kernel. ▪Parallel execution (by loop unrolling) across digits 0-9. ▪Pipelined execution over training instances. ▪Partitioned training data increases memory bandwidth. Synthesized System ArchitectureSource Data Source Code Baseline Software Design OpenCL Harness OpenCL Kernels C/C++ Harness C/C++ Kernels Rosetta Harness HLS Software Simulation HLS Optimizations Realistic Design Constraints Hardware/Software Co-Simulation Retargetable FPGA Implementation Reconfigurable Fabric Host Processing System Synthesize d Hardware Accelerator Host Application Rosetta Development Flow and Application Architecture Development Flow ▪C-based application-specific kernels can easily be added to the generic harnesses. ▪Synthesis results are used to optimize designs given design constraints. FPGA Implementation ▪Host Application - Responsible for controlling the system and accelerator ▪Kernels - Consist of the compute intensive portions of the applications to be accelerated.