Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GPGPU platforms GP - General Purpose computation using GPU
1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Two Phase Flow using two levels of preconditioning on the GPU Prof. Kees Vuik and Rohit Gupta Delft Institute of Applied Mathematics.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Efficient FPGA Implementation of QR
Sherman Braganza, Miriam Leeser, W.C. Warger II, C.M. Warner, C. A. DiMarzio Goal Accelerate the performance of the.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
Sherman Braganza, Miriam Leeser Goal Accelerate the performance of the minimum L P Norm phase unwrapping algorithm.
Parallel Computing Presented by Justin Reschke
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
2017/12/1 An Architecture Evaluation and Implementation of a Soft GPGPU for FPGAs Kevin Andryc.
These slides are based on the book:
CS203 – Advanced Computer Architecture
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Presenter: Darshika G. Perera Assistant Professor
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Employing compression solutions under openacc
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
Dynamo: A Runtime Codesign Environment
Parallel Plasma Equilibrium Reconstruction Using GPU
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Course Agenda DSP Design Flow.
Matlab as a Development Environment for FPGA Design
All-Pairs Shortest Paths
Chapter 1 Introduction.
Optimizing stencil code for FPGA
Introduction to Scientific Computing II
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Presentation transcript:

Two-Dimensional Phase Unwrapping On FPGAs And GPUs Sherman Braganza Prof. Miriam Leeser ReConfigurable Laboratory Northeastern University Boston, MA

Outline Introduction Algorithms Platforms Implementation Results Motivation Optical Quadrature Microscopy Phase unwrapping Algorithms Minimum LP norm phase unwrapping Platforms Reconfigurable Hardware and Graphics Processors Implementation FPGA and GPU specifics Verification details Results Performance Power Cost Conclusions and Future Work

Motivation – Why Bother With Phase Unwrapping? Used in phase based imaging applications IFSAR, OQM microscopy High quality results are computationally expensive Only difficult in 2D or higher Integrating gradients with noisy data Residues and path dependency Wrapped embryo image 0.1 0.3 -0.1 -0.3 0.1 0.3 -0.1 -0.2 No residues Residues

Algorithms – Which One Do We Choose? Many phase unwrapping algorithms Goldstein’s, Flynns, Quality maps, Mask Cuts, multi-grid, PCG, Minimum LP norm (Ghiglia and Pritt, “Two Dimensional Phase Unwrapping”, Wiley, NY, 1998. We need: High quality (performance is secondary) Abilitity to handle noisy data Choose Minimum LP Norm algorithm: Has the highest computational cost a) Software embryo unwrap Using matlab ‘unwrap’ b) Software embryo unwrap Using Minimum LP Norm

Breaking Down Minimum LP Norm Minimizes existence of differences between measured data and calculated data Iterates Preconditioned Conjugate Gradient (PCG) 94% of total computation time Also iterative Two steps to PCG Preconditioner (2D DCT, Poisson calculation and 2D IDCT) Conjugate Gradient

Platforms – Which Accelerator Is Best For Phase Unwrapping? FPGAs Fine grained control Highly parallel Limited program memory Floating point? High implementation cost Xilinx Virtex II Pro architecture http://www.xilinx.com/

Platforms - GPUs GPUs Data parallel architecture Inter processor communication? Data parallel architecture Lower implementation cost Less flexibility Limited # of execution units Floating point Large program memory G80 Architecture [nvidia.com/cuda]

Platform Comparison FPGAs GPUs Absolute control: Can specific custom bit-widths/architectures to optimally suit application Need to fit application to architecture Can have fast processor-processor communication Multiprocessor-multiprocessor communication is slow Low clock frequency Higher frequency High degree of implementation freedom => higher implementation effort. VHDL. Relatively straightforward to develop for. Uses standard C syntax Small program space. High reprogramming time Relatively large program space. Low reprogramming time.

Platform specifications Platform Description FPGA and GPU on different platforms 4 years apart Effects of Moore’s Law Machine 3 in the Results: Cost section has a Virtex 5 and two Core2Quads Platform specifications Software unwrap execution time

Implementation: Preconditioning On An FPGA Need to account for bitwidth Minimum of 28 bit needed – Use 24 bit + block exponent Implement a 2D 1024x512 DCT/IDCT using 1D row/column decomposition Implement a streaming floating point kernel to solve discretized Poisson equation 27 bit software unwrap 28 bit software unwrap

Minimum LP Norm On A GPU NVIDIA provides 2D FFT kernel Use to compute 2D DCT Can use CUDA to implement floating point solver Few accuracy issues No area constraints on GPU Why not implement whole algorithm? Multiple kernels, each computing one CG or LP norm step One host to accelerator transfer per unwrap

Verifying Our Implementations Look at residue counts as algorithm progresses Less than 0.1% difference Visual inspection: Glass bead gives worst case results Software unwrap GPU unwrap FPGA unwrap

Verifying Our Implementations Differences between software and accelerated version GPU vs. Software FPGA vs. Software

Results: FPGA Implemented preconditioner in hardware and measured algorithm speedup Maximum speedup assuming zero preconditioning calculation time : 3.9x We get 2.35x on a V2P70, 3.69x on a V5 (projected)

Results: GPU Implemented entire LP norm kernel on GPU and measured algorithm speedup Speedups for all sections except disk IO 5.24x algorithm speedup. 6.86x without disk IO

Results: FPGAs vs. GPUs Preconditioning only Similar platform generation. Projected FPGA results. Includes FPGA data transfer, not GPU Buses? Currently use PCI-X for FPGA, PCI-E for GPU

Results: Power GPU power consumption increases significantly FPGA power decreases Power consumption (W)

Cost Machine 2 $2200 Machine 3 $10000 Machine 3 includes an AlphaData board with a Xilinx Virtex 5 FPGA platform and two Core2Quads Performance is given by 1/Texec Proportional to FLOPs

Performance To Watt-Dollars Metric to include all parameters

Conclusions And Future Work For phase unwrapping GPUs provide higher performance Higher power consumption FPGAs have low power consumption High reprogramming time OQM: GPUs are the best fit. Cost effective and faster: Images already on processor FPGAs have a much stronger appeal in the embedded domain Future Work Experiment with new GPUs (GTX 280) and platforms (Cell, Larrabee, 4x2 multicore) Multi-FPGA implementation

Thank You! Any Questions? Sherman Braganza (braganza.s@neu.edu) Miriam Leeser (mel@coe.neu.edu) Northeastern University ReConfigurable Laboratory http://www.ece.neu.edu/groups/rcl