LANMC: LSTM-Assisted Non-Rigid Motion Correction

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.

A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.

Inferring Hand Motion from Multi-Cell Recordings in Motor Cortex using a Kalman Filter Wei Wu*, Michael Black †, Yun Gao*, Elie Bienenstock* §, Mijail.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Progress towards nanometre-level beam stabilisation at ATF2 N. Blaskovic, D. R. Bett, P. N. Burrows, G. B. Christian, C. Perry John Adams Institute, University.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

RAW 2014 Over-Clocking of Linear Projection Designs Through Device Specific Optimisations Rui Policarpo Duarte 1, Christos-Savvas Bouganis

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

FPGA Implementations for Volterra DFEs

Multi-hop-based Monte Carlo Localization for Mobile Sensor Networks

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

Area: VLSI Signal Processing.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

QCAdesigner – CUDA HPPS project

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

A New Method for Automatic Clothing Tagging Utilizing Image-Click-Ads Introduction Conclusion Can We Do Better to Reduce Workload?

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller

Presented by: Idan Aharoni

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.

2018/4/27 PiDFA : A Practical Multi-stride Regular Expression Matching Engine Based On FPGA Author: Jiajia Yang, Lei Jiang, Qiu Tang, Qiong Dai, Jianlong.

Reza Yazdani Albert Segura José-María Arnau Antonio González

Floating-Point FPGA (FPFPGA)

FPGA Acceleration of Convolutional Neural Networks

The Problem Finding a needle in haystack An expert (CPU)

Genomic Data Clustering on FPGAs for Compression

FPGAs in AWS and First Use Cases, Kees Vissers

Evaluating Pre-Processing Pipelines for Thermal-Visual Smart Camera

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Object Recognition in the Dynamic Link Architecture

Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

Jincong He, Louis Durlofsky, Pallav Sarma (Chevron ETC)

Department of Computer Science University of California, Santa Barbara

Milad Hashemi, Onur Mutlu, Yale N. Patt

STUDY AND IMPLEMENTATION

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Architectural Support for Efficient Large-Scale Automata Processing

Final Project presentation

Implementation of a GNSS Space Receiver on a Zynq

Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin

Housam Babiker, Randy Goebel and Irene Cheng

HALO-FREE DESIGN FOR RETINEX BASED REAL-TIME VIDEO ENHANCEMENT SYSTEM

Implementation of a De-blocking Filter and Optimization in PLX

Department of Computer Science University of California, Santa Barbara

August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Sculptor: Flexible Approximation with

DSPs for Future Wireless Base-Stations

Presentation transcript:

LANMC: LSTM-Assisted Non-Rigid Motion Correction on FPGA for Calcium Image Stabilization Zhe Chen1, Hugh T. Blair2, Jason Cong1 1Computer Science Department, 2Department of Psychology, UCLA zhechen@ucla.edu

Research Background Miniscope Calcium Imaging [1] Monitoring neuron activities at large scale in vivo. Challenge Non-uniform motion artifacts Costly and Low Efficient Algorithm Miniscope Calcium Imaging [1] Monitoring neuron activities at large scale in vivo. Motivation Real-Time Non-Rigid motion correction for calcium imaging IN DEMAND. [1] Denise J. Cai, Daniel Aharoni et al., Nature, 2016

Conventional Non-Rigid Motion Correction Method Processing Steps 2D Contrast Filter Remove the bulk of background Filter size: Cell diameter in image Piecewise Rigid Motion Correction Divide overlapping patches Cross correlation based on FFT/IFFT Local Maximum -> Motion Vector Algorithm Inefficiency: The operation needs to be repeated for each single patch. It causes algorithm to be costly and inefficient for real-time application.

Proposed Method based on LSTM Inference METHOD: Use long short-term memory (LSTM) inference to predict motion at overlap patches Offline Training NoRMCorre -> Get training target Online Inference Rigid motion correction + LSTM Inference 95% operation is saved by using 5-node LSTM Accuracy Evaluation:

Implementation: Folding Architecture Leverage the central symmetry of the filter kernel with Folding I0 I1 I2 I3 I4 C0 C1 C2 C1 C0 Save >80% LUT, FF and >60% DSP compared to design w/o folding Performance Evaluation Frequency (MHz) Runtime (ms) Zynq-7045 100 3.73 300 1.25 CPU w/ 4T 1.2-1.5 GHz 134.6 CPU w/ 8T 89.7 CPU w/ 16T 61.9 At 300 MHz, FPGA achieves >40x speedup over the CPU

Implementation: Reuse FFT/IFFT and LSTM Unroll and Pipeline FFT/IFFT Operation Unroll and Pipeline LSTM Inference Acceleration Reuse FFT/IFFT IP for H/V Transformation Vivado HLS Reuse LSTM for H/V Direction and All Patches

Performance Evaluation Processing Latency Energy Efficiency compared to Xeon E52620 CPU Low power high efficient Ultra96 board Consistent speedup of acceleration kernels Simplify algorithm by LSTM inference 82x Speedup Close to 4 orders Gain Conclusion FPGA design realizes real-time non-rigid motion correction for calcium image. Low latency and high energy efficiency suitable for closed-loop feedback stimulation.

Acknowledgments Thank you!