Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Slides:



Advertisements
Similar presentations
Improved Census Transforms for Resource-Optimized Stereo Vision
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Real-Time Video Analysis on an Embedded Smart Camera for Traffic Surveillance Presenter: Yu-Wei Fan.
Communication Support for Task-Based Runtime Reconfiguration in FPGAs Shannon Koh COMP4211 Advanced Computer Architectures Seminar 1 June 2005.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Mahesh Sukumar Subramanian Srinivasan. Introduction Face detection - determines the locations of human faces in digital images. Binary pattern-classification.
Virtual Architecture For Partially Reconfigurable Embedded Systems (VAPRES) Architecture for creating partially reconfigurable embedded systems Module.
Image Processing With FPGAs Zach Fuchs Sarit Patel EEL April 2008.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Jason Li Jeremy Fowers Ground Target Following for Unmanned Aerial Vehicles.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Study of AES Encryption/Decription Optimizations Nathan Windels.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
H.264 Deblocking Filter Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
Graphics on Key by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer, Amnon Stanislavsky and Mike Sumszyk 1.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Paper Review: XiSystem - A Reconfigurable Processor and System
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
Automated Design of Custom Architecture Tulika Mitra
Fan Zhang, Yang Gao and Jason D. Bakos
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.
VLSI Algorithmic Design Automation Lab. 1 Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Reconfigurable architectures ESE 566. Outline Static and Dynamic Configurable Systems –Static SPYDER, RENCO –Dynamic FIREFLY, BIOWATCH PipeRench: Reconfigurable.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Philipp Gysel ECE Department University of California, Davis
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Hiba Tariq School of Engineering
Ph.D. in Computer Science
Evaluating Register File Size
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Improving java performance using Dynamic Method Migration on FPGAs
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Presentation transcript:

Jason Li Jeremy Fowers 1

Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory Dimitroulakos, Costas Goutis University of Patras Galanis, M.D.; Dimitroulakos, G.; Goutis, C.E.;, "Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol.15, no.12, pp , Dec

FPGA-Based Embedded Motion Estimation Sensor Zhaoyi Wei, Dah-Jye Lee, Brent Nelson, James Archibald, Barrett Edwards Brigham Young University Zhaoyi Wei, Dah-Jye Lee, Brent E. Nelson, James K. Archibald, and Barrett B. Edwards, “FPGA-Based Embedded Motion Estimation Sensor,” International Journal of Reconfigurable Computing, vol. 2008, Article ID , 8 pages, doi: /2008/

What is Reconfigurable Computing? Architecture that adapts to specific application Processing unit coupled with reconfigurable hardware After execution, reconfigures hardware for next task Implementing circuits without fabricating device 4

Introduction Reconfigurable Hardware combined with µP µP => noncritical control intensive Hardware => kernels Coarse-grained reconfigurable arrays (CGRA) Array of processing elements (PEs) Word-level parallelism 5

Introduction Seven DSP benchmarks Three 32-bit ARM processors Two CGRA architectures Application Speedup Energy Consumption vs µP/VLIW system 6

Architecture Overview 7 µP with instruction memory for storing program code CGRA with memory for complete configuration Shared data RAM used for communication Load data into PEsPE array with bus connecting rows and columns Manages CGRA execution; set by µP

Design Flow Outline Distinguish kernel from non critical SUIF2 and MachineSUIF compiler Loop Unrolling Scheduler exploits ILP 8

Experimental Setup CGRA1 = 4x4 array, CGRA2 = 6x6 150 MHz CGRA1 power consumption is mW CGRA2 power consumption is 258 mW ARM7 ARM9 ARM MHz, 26.6 mW 325 MHz, 195 mW 250 MHz, mW

Experimental Results - Mapping When II = MII, performance is optimal Optimum performance in 19/23 Average CGRA utilization is 13.3 IPC or 83.1% 71.7% usage in CGRA2 10

Experimental Results - Speedup 11

Experimental Results - Speedup Speedups range from 1.81 to 3.99 ARM7 = 2.86, ARM9 = 2.74, ARM10 = 2.57 ARM7 has highest CPI Speedups are all close to ideal CGRA2 is only 6% less 12

Experimental Results - Energy Energy Estimation Time proc = noncritical software Time CGRA = kernel execution time P mem_icon = shared data RAM and interconnection power consumption 13

Experimental Results - Energy ARM coupled with 4x4 CGRA  ARM coupled with 6x6 CGRA  14

Experimental Results – vs. VLIW CGRA1 is used due to greater energy savings Compared to µP coupled with eight-issue VLIW 15

Experimental Results – vs. VLIW 16 Avg speedup 2.71 compared to 2.53 Avg Energy Savings 57.2% compared to 55%

Conclusion Significant speedup and energy reductions Compared to pure software implementation Compared to VLIW system Reconfigurable Computing Application Optical Flow using FPGA 17

Zhaoyi Wei, Dah-Jye Lee, Brent Nelson, James Archibald, Barrett Edwards Brigham Young University Zhaoyi Wei, Dah-Jye Lee, Brent E. Nelson, James K. Archibald, and Barrett B. Edwards, “FPGA-Based Embedded Motion Estimation Sensor,” International Journal of Reconfigurable Computing, vol. 2008, Article ID , 8 pages, doi: /2008/ /

Optical Flow Measure the motion of pixels between consecutive frames Performed on images’ brightness pattern Major implications for 3D vision, UAVs Applications: Navigation, moving object detection, motion estimation, structure from motion, time-to-impact 19

Optical Flow 20

Real-time Optical Flow Notoriously difficult to execute in real time CPU is too slow, parallelism necessary GPU acceleration works, not embedded friendly FPGAs ideal for embedded optical flow Low power + fast parallel processing Attempted in many previous works 21

Embedded Optical Flow Previous embedded work made compromises Algorithm limitations Ideal algorithm: iterative steps Ideal hardware: data parallel or pipelined Resource limitations Smoothing important, expensive 22

Algorithm (Math Alert) The next slide has all of the equations Use it to get an idea of the computational load Refer to the paper to learn details 23

Algorithm 24

Algorithm Cont. 25

End Math Zone 26

Smoothing Masks Filter every stage to improve accuracy c i is 3x3 spatial, m i is 7x7 spatial w i is 5x5x3 temporal (3 frames) Accessing previous frames very expensive First paper to do this in HW Greatly improves accuracy 27

Hardware Structure SRAM Derivative Module Optical Flow Module High Speed Bus (PLB) Note: Reduced system diagram CameraSDRAM 28

Camera Derivative Module SRAMSDRAMTemporal Gradient Spatial Gradient g t (t) g x (t)g y (t) frame(t-1) frame(t-2) frame(t-3) frame(t-4) frame(t) 29

Optical Flow Module 30

Hardware Platform Virtex-4 FX60 FPGA, 100 MHz Clock 32Mb SDRAM, 4 Mb SRAM 2x built in 400 MHz PowerPC cores CMOS Camera, 30 FPS 640x480 31

Testing Camera data for frame rate MATLAB simulation for accuracy 32

Results Achieved 15 FPS Suitable for some UAV apps Accuracy 2x better than previous work Authors: 6.7 o Previous: 12.7 o SW: 1.0 o Importance of temporal smoothing 10.6 o without it 33