1 Restoration of Star-Field Images Using High-Level Languages and Core Libraries Robin Bruce, Caroline Ruet, Dr Malachy Devlin, Prof Stephen Marshall.

Slides:



Advertisements
Similar presentations
Deep Packet Inspection Which Implementation Platform? Sarang Dharmapurikar Cisco.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Performance Analysis of Multiprocessor Architectures
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 17: Application-Driven Hardware Acceleration (3/4)
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Chapter 4 Assessing and Understanding Performance
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.
1 DSP Implementation on FPGA Ahmed Elhossini ENGG*6090 : Reconfigurable Computing Systems Winter 2006.
DSP-FPGA Based Image Processing System Final Presentation Jessica Baxter  Sam Clanton Simon Fung-Kee-Fung Almaaz Karachi  Doug Keen Computer Integrated.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Department of Electrical Engineering National Cheng Kung University
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
Reconfigurable Devices Presentation for Advanced Digital Electronics (ECNG3011) by Calixte George.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A flexible FGPA based Data Acquisition Module for a High Resolution PET Camera Abdelkader Bousselham, Attila Hidvégi, Clyde Robson, Peter Ojala and Christian.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Introduction Computer Organization and Architecture: Lesson 1.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Image Restoration using Iterative Wiener Filter --- ECE533 Project Report Jing Liu, Yan Wu.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
1 Reconfigurable Acceleration of Microphone Array Algorithms for Speech Enhancement Ka Fai Cedric Yiu, Yao Lu, Xiaoxiang Shi The Hong Kong Polytechnic.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
1 Design and Integration: Part 2. 2 Plus Delta Feedback Reading and lecture repeat Ambiguous questions on quizzes Attendance quizzes Boring white lecture.
Computer Science Theory & Introduction Week 1 Lecture Material – F'13 Revision Doug Hogan Penn State University CMPSC 201 – C++ Programming for Engineers.
Computer Organization & Assembly Language © by DR. M. Amer.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
EEL 4713/EEL 5764 Computer Architecture Spring Semester 2004 Instructor: Dr. Shonda Walker Required Textbook: Computer Organization & Design, by Patterson.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Exploiting Parallelism
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
ECE 3110: Introduction to Digital Systems Introduction (Contd.)
History of Computers and Performance David Monismith Jan. 14, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
Chapter 17 Looking “Under the Hood”
Programmable Logic Devices
Presenter: Darshika G. Perera Assistant Professor
Hiba Tariq School of Engineering
A Streaming FFT on 3GSPS ADC Data using Core Libraries and DIME-C
Improving java performance using Dynamic Method Migration on FPGAs
Computer Performance Read Chapter 4
Minimum Entropy Restoration Using FPGAs And High-Level Techniques
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

1 Restoration of Star-Field Images Using High-Level Languages and Core Libraries Robin Bruce, Caroline Ruet, Dr Malachy Devlin, Prof Stephen Marshall

28/03/2007 – Robin Bruce MRSC Overview Introduction MED – SA algorithm ANSI C implementation DIME-C implementation Comparison software/hardware Application Conclusion & further work

28/03/2007 – Robin Bruce MRSC Introduction There is a need for objective research in reconfigurable computing (RC) Don’t just pick battles you know you’ll win Need to evaluate effectiveness of RC as a general purpose solution How does it work on arbitrarily-selected problems? There is a range of measures that we can apply to determine the performance improvement

28/03/2007 – Robin Bruce MRSC Hardware Comparisons Can Compare FPGAs to: Digital Signal Processors (DSPs) Application-Specific Integrated Circuits (ASICs) Microprocessors Other, could include Graphics Processing Units (GPUs) Cell BE Processor Clearspeed CX600

28/03/2007 – Robin Bruce MRSC Hardware Comparisons II Can compare with respect to: Raw performance Power consumption Unit cost Board footprint Non-Recurring Engineering Cost (NRE) Design time and Design cost The key metrics for a particular application may also include ratios of these metrics, e.g. performance/power, or performance/unit cost.

28/03/2007 – Robin Bruce MRSC Application Choice Implementation of the Minimum Entropy Deconvolution algorithm using Simulated Annealing method: representative of a computationally intense image-processing application Chosen Fairly Arbitrarily Only knew that it was a compute-intensive algorithm Did not know how suitable the algorithm was for implementation on RCs before committing to it

28/03/2007 – Robin Bruce MRSC Chosen Comparison Comparing a 90nm-process commodity microprocessor with a platform based around a 90nm-process FPGA 3.2 GHz Intel Pentium D processor with 2 GB of DRAM, with the gcc compiler Nallatech H101-PCIXM card, with the DIME-C compiler Xilinx Virtex-4 LX160 FPGA, 512 MB of DRAM and 4 banks of 200MHz, 4 MB SRAM. Focussing on design time and raw performance improvement.

28/03/2007 – Robin Bruce MRSC MED – SA algorithm Restore blurred images MED algorithm with SA used to converge towards the globally-optimum solution y = x ٭ h + n y: observed image, x: original image, h: Gaussian filter, n: white Gaussian noise Estimate x from y Computation of 2 gradients: Δx= ∂E/∂x, Δd= ∂E/∂d Estimated imageEstimated filterObserved image * =

28/03/2007 – Robin Bruce MRSC MED – SA algorithm Assumptions PSF is a Gaussian function m1, m2 designates the size of the PSF d corresponds to the width of the PSF (blurring level) γ is a constant to normalise the Gaussian function:

28/03/2007 – Robin Bruce MRSC MED – SA algorithm Algorithm – minimising the Energy E Step 0: Set p=0 and initialise x p, T p, d p and α, β, λ Step 1: Compute the energy E p (x p, h(d p ))

28/03/2007 – Robin Bruce MRSC MED – SA algorithm Step 2: Select a candidate solution x’ p+1 = x p - α Δ x p d’ p+1 = d p - β Δ d p

28/03/2007 – Robin Bruce MRSC MED – SA algorithm Step 3: Compute the energy E’ p+1 (x’ p+1, h(d’ p+1 )) ΔE = E’ p+1 - E p Step 4: If: where r is a random number Є [0,1] Then: Else: where f(.) is a decreasing function Step 5: p = p + 1, #_iterations = #_iterations – 1 Step 6: Output xp+1 is the estimation image

28/03/2007 – Robin Bruce MRSC ANSI C Implementation Algorithm organisation The C program is not initially optimised Functional Hierarchy

28/03/2007 – Robin Bruce MRSC Code modification Loop Fusion Pipelining Spatial parallelism Resource optimisation DIME-C Implementation 5/9 Spatial parallelism Pipelining

28/03/2007 – Robin Bruce MRSC Example Optimisation Filter Application number of cycles: ≈ 480,111 (before optimisation: 12,000,133) number of slices: = (before optimisation: 3184) Graphical representations of the filter implementation

28/03/2007 – Robin Bruce MRSC Core Libraries Made use of single-precision mathematical operators that are integrated into DIME-C Project depended on random number generator and exponential function Not in compute intensive region of algorithm Functions acted as an enabler to full algorithm implementation Hear more about Core Libraries from me later today

28/03/2007 – Robin Bruce MRSC Implementation Procedure 1. Created a DIME-C project using the original source from the ANSI C project 2. Adapted source to allow compilation in both DIME-C and ANSI C environments 3. Took advantage of the most obvious pipelining opportunities to create 1st FPGA implementation 4. Examining the source code and the output of DIME-C, created an equation that expressed the runtime of the algorithm in cycles, as a function of the parameters of the algorithm, divided into key sections. 5. Determined for a typical set of algorithm parameters the section that took up the majority of the runtime, and optimised the DIME-C for this section to create the 2nd FPGA implementation 6. Repeated sections 4&5 to produce the 3rd FPGA implementation,

28/03/2007 – Robin Bruce MRSC Time to Solution Developing the initial ANSI C Implementation 125 Person Hours Developing the DIME-C Implementation 35 Person Hours Most time spent developing the software

28/03/2007 – Robin Bruce MRSC Software versus Hardware Software1 st FPGA2 nd FPGA3 rd FPGA4 th FPGA Cycles 7.98× × × ×10 10 Time in Seconds Speedup vs. Software % Contribution of: De_Dx Filter Rest of Algorithm Several generations of the FPGA implementation compared to software DeDx remains the focus of a 5 th version

28/03/2007 – Robin Bruce MRSC Example Results Simulation of real Black and White pictures 200 x 200 image 7 x 7 filter Observed image: blurred and noisy Restored image 300 iterations Original image

28/03/2007 – Robin Bruce MRSC Conclusion Good performance: speedup ≈ 8 Design productivity was high using DIME-C Increased performance and productivity possible using libraries of low-level IP cores 100-Page report available for those who want to know more

28/03/2007 – Robin Bruce MRSC Microprocessor Speedups

28/03/2007 – Robin Bruce MRSC Speedup in Context Moore’s Law Tells us Performance Doubles Every 18 months Does it really? Hennessey & Patterson (2007) tell us that processor performance improved by 52% a year until Since 2002 it’s been running at around 20% a year

28/03/2007 – Robin Bruce MRSC Speedup in Context II If an FPGA gives you an 8x speedup now, how many years would it take for the microprocessor to catch up? Assumption Alert! Benchmark used to evaluate processors gives a good idea of how our application would perform Comparing two best-effort implementations on the same process node, FPGA and uP 8x Speedup would take years to attain at 20% per annum improvements

28/03/2007 – Robin Bruce MRSC The Magic Numbers How much is being 12 years ahead of the competition worth? Reconfigurable Computing must offer (insert magic number) X improvement over conventional computing to see widespread adoption Such a blanket statement is meaningless Depends on the economics of the application

28/03/2007 – Robin Bruce MRSC Acknowledgements Research is sponsored by Nallatech Institute for System Level Integration made everything possible Caroline did all the hard work

28/03/2007 – Robin Bruce MRSC Quote Alan Perlis - When someone says "I want a programming language in which I need only say what I wish done," give him a lollipop.