Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology
Image Deblurring Blur Kernel For image deblurring, the solution is constrained to be non-negative l = 0, u = +∞ 2
Cauchy Point Computation: First local minima along the gradient projected on to the search space Algorithm 3 Gradient (Ax – b)
Optimizations Dimension Reduction Ignore the dimensions that have active constraints by holding their solution to zero till the next outer iteration If all but 100 constraints are active: 100×100 matrix/vector operations instead of 1000× Gradient (Ax – b)
Optimizations Incremental Update Incrementally update matrix/vector product in CP Incrementally update gradient throughout both CP and CG steps, based on incremental changes to x At the end of each CG refinement, recalculate cost using updated gradients Avoids explicit computation of Ax product every outer iteration 5 Gradient (Ax – b)
Optimizations Performance Improvement N outer iterations with M 1 breakpoints checked for CP and M 2 CG iterations per outer iteration Direct implementation: N(3+M 1 +M 2 ) matrix/vector multiplications Optimized implementation:1+N(2+M 2 ) matrix/vector multiplications 6 Gradient (Ax – b) Optimized implementation typically achieves ~ 50% performance improvement
Architecture Control logic determines resource access Memory controller connects the design to external DDR2 memory A, b, x stored in DRAM On-chip SRAMs used for temporary variables Single-precision floating point arithmetic Iterative execution of CP and CG Use non-concurrency of CP and CG to share SRAMs 7
Matrix Multiplier 8 Multiplication in chunks of m: m elements of A are fetched per clock cycle from DRAM One element of x, b can be accessed per clock cycle from SRAM
Matrix Multiplier Active Columns Check if any columns in a group of m columns are active Skip over the group if no active columns Active Rows Check if any rows in a group of m rows are active Skip over the group if no active rows 9
Matrix Multiplier 10
Sort Cauchy Point Computation requires sorting an array of breakpoints Sort implemented using merge sort 11
Main Modules The control logic in both CP and CG modules are FSMs that sequence the external operators Each state corresponds to a discrete step of the algorithm Each step evaluates as many operations as possible concurrently Conjugate Gradient Architecture 12
FPGA Implementation Vitrex-5 LX110T QP Solver design integrated with DDR2 memory using a Request/Response interface Integrated with Sce-Mi to communicate between a processor and the FPGA Verified in simulation Performance after synthesis: 51.3 MHz Total LUTs78743/ % LUTs as Logic 76975/ % LUTs as Memory 1768/179209% FF69485/ % Resource utilization during placement 13
FPGA Implementation Kintex-7 K325T QP Solver design integrated with DDR3 memory using a Request/Response interface Integrated with USB interface to communicate between a processor and the FPGA Performance after synthesis: 67.2 MHz 14
FPGA Implementation Kintex-7 K325T QP Solver design integrated with DDR3 memory using a Request/Response interface Integrated with USB interface to communicate between a processor and the FPGA Performance after synthesis: 67.2 MHz Dual Port RAMs 33 Simple Dual Port RAMs 610 Block RAMs114/14877% DSP48s58/8406% Total LUTs % Resource utilization after synthesis Slice LUTs 64,522/203,80031% Slice Registers 55,406/407,60013% Occupied Slices23,206/50,95045% DSP48E1s58/8406% RAMB36E1/FIFO 36E1s 113/44525% Resource utilization after placement 15
Results Synthetic problem of size 256 Real problem of size 361 from image deblurring 16
Results FPGA implementation is faster for larger problem sizes 17
Conclusions QP Solver module designed and implemented on Kintex-7 FPGA Optimized the implementation to reduce matrix/vector multiplications Maximized concurrent execution of processing steps FPGA implementation verified to be functional for problem sizes ranging from 16 to Acknowledgements Priyanka Raina Richard Uhler, Myron King, Prof. Arvind