Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
Hardware Implementation of Antenna Beamforming using Genetic Algorithm Kevin Hsiue Bryan Teague.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Computer System Overview
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Active Set Support Vector Regression
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 9. Optimization problems.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Pinewood Derby Timing System Using a Line-Scan Camera Rob Ostrye Class of 2006 Prof. Rudko.
FPGA IRRADIATION and TESTING PLANS (Update) Ray Mountain, Marina Artuso, Bin Gui Syracuse University OUTLINE: 1.Core 2.Peripheral 3.Testing Procedures.
FPGA Implementation of Linear Model Predictive Controller for Closed Loop Control of Intravenous Anesthesia Guide:- Prof. D. N. Sonawane By:- Prashant.
Efficient FPGA Implementation of QR
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
CPEN Digital System Design
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)
Computer Hardware The Processing Unit.
Lab 2 Parallel processing using NIOS II processors
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
DIGITAL SIGNAL PROCESSORS. What are Digital Signals? Digital signals have finite precision in both the time (sampled) and amplitude (quantized) domains.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Processor Organization and Architecture Module III.
1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
GCSE Computing - The CPU
Backprojection Project Update January 2002
Embedded Systems Design
FPGAs in AWS and First Use Cases, Kees Vissers
Processor Organization and Architecture
GCSE Computing - The CPU
ADSP 21065L.
Computer Architecture
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology

Image Deblurring Blur Kernel For image deblurring, the solution is constrained to be non-negative  l = 0, u = +∞ 2

Cauchy Point Computation: First local minima along the gradient projected on to the search space Algorithm 3 Gradient (Ax – b)

Optimizations Dimension Reduction Ignore the dimensions that have active constraints by holding their solution to zero till the next outer iteration If all but 100 constraints are active: 100×100 matrix/vector operations instead of 1000× Gradient (Ax – b)

Optimizations Incremental Update Incrementally update matrix/vector product in CP Incrementally update gradient throughout both CP and CG steps, based on incremental changes to x At the end of each CG refinement, recalculate cost using updated gradients Avoids explicit computation of Ax product every outer iteration 5 Gradient (Ax – b)

Optimizations Performance Improvement N outer iterations with M 1 breakpoints checked for CP and M 2 CG iterations per outer iteration Direct implementation: N(3+M 1 +M 2 ) matrix/vector multiplications Optimized implementation:1+N(2+M 2 ) matrix/vector multiplications 6 Gradient (Ax – b) Optimized implementation typically achieves ~ 50% performance improvement

Architecture Control logic determines resource access Memory controller connects the design to external DDR2 memory A, b, x stored in DRAM On-chip SRAMs used for temporary variables Single-precision floating point arithmetic Iterative execution of CP and CG Use non-concurrency of CP and CG to share SRAMs 7

Matrix Multiplier 8 Multiplication in chunks of m: m elements of A are fetched per clock cycle from DRAM One element of x, b can be accessed per clock cycle from SRAM

Matrix Multiplier Active Columns Check if any columns in a group of m columns are active Skip over the group if no active columns Active Rows Check if any rows in a group of m rows are active Skip over the group if no active rows 9

Matrix Multiplier 10

Sort Cauchy Point Computation requires sorting an array of breakpoints Sort implemented using merge sort 11

Main Modules The control logic in both CP and CG modules are FSMs that sequence the external operators Each state corresponds to a discrete step of the algorithm Each step evaluates as many operations as possible concurrently Conjugate Gradient Architecture 12

FPGA Implementation Vitrex-5 LX110T QP Solver design integrated with DDR2 memory using a Request/Response interface Integrated with Sce-Mi to communicate between a processor and the FPGA Verified in simulation Performance after synthesis: 51.3 MHz Total LUTs78743/ % LUTs as Logic 76975/ % LUTs as Memory 1768/179209% FF69485/ % Resource utilization during placement 13

FPGA Implementation Kintex-7 K325T QP Solver design integrated with DDR3 memory using a Request/Response interface Integrated with USB interface to communicate between a processor and the FPGA Performance after synthesis: 67.2 MHz 14

FPGA Implementation Kintex-7 K325T QP Solver design integrated with DDR3 memory using a Request/Response interface Integrated with USB interface to communicate between a processor and the FPGA Performance after synthesis: 67.2 MHz Dual Port RAMs 33 Simple Dual Port RAMs 610 Block RAMs114/14877% DSP48s58/8406% Total LUTs % Resource utilization after synthesis Slice LUTs 64,522/203,80031% Slice Registers 55,406/407,60013% Occupied Slices23,206/50,95045% DSP48E1s58/8406% RAMB36E1/FIFO 36E1s 113/44525% Resource utilization after placement 15

Results Synthetic problem of size 256 Real problem of size 361 from image deblurring 16

Results FPGA implementation is faster for larger problem sizes 17

Conclusions QP Solver module designed and implemented on Kintex-7 FPGA Optimized the implementation to reduce matrix/vector multiplications Maximized concurrent execution of processing steps FPGA implementation verified to be functional for problem sizes ranging from 16 to Acknowledgements Priyanka Raina Richard Uhler, Myron King, Prof. Arvind