Three-Dimensional Template Correlation: Object Recognition in 3D Voxel Data Tom VanCourtBoston University Yongfeng GuECE Department Martin Herbordt CAAD.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

The University of Adelaide, School of Computer Science
Distributed Arithmetic
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
Efficient Representation of Data Structures on Associative Processors Jalpesh K. Chitalia (Advisor Dr. Robert A. Walker) Computer Science Department Kent.
Robert Barnes Utah State University Department of Electrical and Computer Engineering Thesis Defense, November 13 th 2008.
BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations Tom VanCourt Martin Herbordt {tvancour, bu.edu.
Field-Programmable Logic and its Applications INTERNATIONAL CONFERENCE August 30 – September 01, 2004 Albert A. Conti, Tom Van Court, Martin C. Herbordt.
Programmable logic and FPGA
Processor Memory Networks Based on Steiner Systems Tom VanCourtBoston University Martin C. HerbordtECE Department.
Image Processing With FPGAs Zach Fuchs Sarit Patel EEL April 2008.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,
Highest Performance Programmable DSP Solution September 17, 2015.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
SVT workshop October 27, 1998 XTF HB AM Stefano Belforte - INFN Pisa1 COMMON RULES ON OPERATION MODES RUN MODE: the board does what is needed to make SVT.
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Paper Review: XiSystem - A Reconfigurable Processor and System
Efficient FPGA Implementation of QR
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Chapter One Introduction to Pipelined Processors.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
A Front End and Readout System for PET Overview: –Requirements –Block Diagram –Details William W. Moses Lawrence Berkeley National Laboratory Department.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
The Correlators ( Spectrometers ) Mopra Induction - May 2005.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
SLAAC SLD Update Steve Crago USC/ISI September 14, 1999 DARPA.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Exploiting Parallelism
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.
® Virtex-E Extended Memory Technical Overview and Applications.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
Cray XD1 Reconfigurable Computing for Application Acceleration.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
B0110 Fabric and Trust ENGR xD52 Eric VanWyk Fall 2013.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Memory Buffering Techniques
Low-power Digital Signal Processing for Mobile Phone chipsets
Cache Memory Presentation I
Pipelining and Vector Processing
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Wavelet “Block-Processing” for Reduced Memory Transfers
Presentation transcript:

Three-Dimensional Template Correlation: Object Recognition in 3D Voxel Data Tom VanCourtBoston University Yongfeng GuECE Department Martin Herbordt CAAD lab BOSTON UNIVERSITY

BOSTON UNIVERSITY CAMP `05 3D Template Matching 2 Increasing use of volumetric data sets Increasing use of volumetric data sets MRI / CAT, confocal microscopy, molecule structure MRI / CAT, confocal microscopy, molecule structure Increased complexity of correlation Increased complexity of correlation 2D: O(n 2 ) (x,y)  O(n 1 ) rotations = O(n 3 ) 2D: O(n 2 ) (x,y)  O(n 1 ) rotations = O(n 3 ) 3D : O(n 3 ) (x,y,z)  O(n 3 ) rotations = O(n 6 ) 3D : O(n 3 ) (x,y,z)  O(n 3 ) rotations = O(n 6 ) Transform techniques help a little: Transform techniques help a little: O(n 3 )  O(n 2 ) log nO(n 6 )  O(n 4 ) log n O(n 3 )  O(n 2 ) log nO(n 6 )  O(n 4 ) log n Solution: Application-specific accelerators Solution: Application-specific accelerators Programmable off-the-shelf hardware Programmable off-the-shelf hardware Custom logic design, unique to each application Custom logic design, unique to each application

BOSTON UNIVERSITY CAMP `05 3D Template Matching 3 Volumetric Data Sets Complex data types Complex data types Multiple fluorescence channels Multiple fluorescence channels Oriented data: flow vectors Oriented data: flow vectors Nonlinear scoring models Nonlinear scoring models True 3D data acquisition True 3D data acquisition Medical imaging (MRI, PET, CAT, …) Medical imaging (MRI, PET, CAT, …) Confocal microscopy Confocal microscopy Emerging techniques: Emerging techniques: Diffusion tensor tomography

BOSTON UNIVERSITY CAMP `05 3D Template Matching 4 COTS AND Custom? How? Field Programmable Gate Arrays Field Programmable Gate Arrays 1000s of uncommitted elements 1000s of uncommitted elements Custom processor built on demand Custom processor built on demand On-chip RAM bandwidth: >1TBit/sec On-chip RAM bandwidth: >1TBit/sec Massive parallelism: 100s-1000s of PEs Massive parallelism: 100s-1000s of PEs Accelerator is tailored to each application Accelerator is tailored to each application ~100% payload computation cycles ~100% payload computation cycles No load/store cycles No loop overhead cycles No address arithmetic cycles ~0% logic dedicated to unused features ~0% logic dedicated to unused features

BOSTON UNIVERSITY CAMP `05 3D Template Matching 5 Acceleration Strategy Standard approach: Standard approach: Accelerated approach: Accelerated approach: Transform Per Channel Rotated Image Molecule Grid Products of Transforms Correlation Result Molecule Grid Correlation Result FFT x FFT -1 Direct Correlation by Systolic Array Rotated Addressing

BOSTON UNIVERSITY CAMP `05 3D Template Matching 6 Correlation Pipeline Systolic 3D Correlation Voxel Value Rotation Rotated Image Access Data Reduction Filtering Customizable functions Customizable functions High data reuse High data reuse Direct correlation Direct correlation Beats FFT for modest problems Beats FFT for modest problems Generalizes correlation sum: Σ ijk F(A xyz, T ijk ) Generalizes correlation sum: Σ ijk F(A xyz, T ijk ) Natural for FPGA implementation Natural for FPGA implementation Regular structure Regular structure Simple data elements Simple data elements

BOSTON UNIVERSITY CAMP `05 3D Template Matching 7 Rotated Memory Access Load image once & reuse Load image once & reuse Access image in rotated order Access image in rotated order via index transformation via index transformation x i x j x k i x y i y j y k j = y z i z j z k k z Allows axis scaling, mirror reversal Allows axis scaling, mirror reversal Anisotropic: e.g. X,Y resolution ≠ Z No need for resampling ~0 delay & buffer overhead ~0 delay & buffer overhead Strength reduction eliminates multiplication Strength reduction eliminates multiplication Arithmetic cost hidden by pipelining Arithmetic cost hidden by pipelining x y i j

BOSTON UNIVERSITY CAMP `05 3D Template Matching 8 Voxel Value Rotation Not needed for scalar data (RGB, gray scale, etc) Not needed for scalar data (RGB, gray scale, etc) Step exists architecturally, as identity transform Step exists architecturally, as identity transform For spatially oriented data (e.g. fluid flow in brain tissue) For spatially oriented data (e.g. fluid flow in brain tissue) Perform rigid rotation of image … Perform rigid rotation of image … Then rotate oriented voxel values Then rotate oriented voxel values

BOSTON UNIVERSITY CAMP `05 3D Template Matching 9 Correlation Array 3D extension of conventional array 3D extension of conventional array Custom unit cell Custom unit cell Holds constant value for template Custom F(a, b) … 1D array + line buffer … 1D array + line buffer Extend line to result width … 2D array + plane buffer … 2D array + plane buffer Extend plane to result size … 3D array … 3D array One input voxel per cycle, padded One output correlation point per cycle A S in S out + F T A S in RAM FIFO

BOSTON UNIVERSITY CAMP `05 3D Template Matching 10 3D Correlation Result Template is stored in computation array Template is stored in computation array FIFOs hold partial correlation sums FIFOs hold partial correlation sums Template data and Computation array 3D Correlation result Whole volume shown FIFO line buffers Pad to result width FIFO plane buffers Pad to result depth Correlation complete Result passed to data reduction filter

BOSTON UNIVERSITY CAMP `05 3D Template Matching 11 Peak Capture / Data Reduction 3D result ≥ image size 3D result ≥ image size Full result would slow host Full result would slow host Template may occur > 1x Template may occur > 1x Find multiple maxima Find multiple maxima Reporting N highest points is not effective Reporting N highest points is not effective Instead: Local max by region 8x8x8 region– 512:1 reduction More maxima, less redundancy Record exact (x,y,z) in region B UT may miss close maxima Region  template size may be OK Broad maximum reported redundantly Local maxima missed

BOSTON UNIVERSITY CAMP `05 3D Template Matching 12 Why Reconfigurable? Massive parallelism, modest cost Massive parallelism, modest cost COTS hardware, tracks technology COTS hardware, tracks technology Application-optimized processing Application-optimized processing Tracks application changes Tracks application changes Ex: 1, 2, 3-channel fluorescence Flexible performance tradeoffs Flexible performance tradeoffs Allows non-linear scoring Allows non-linear scoring Available now Available now PC add-ins PC add-ins SGI Altix SGI Altix Cray XD1 Cray XD1 24 bit RGB 8 bit Mono 4 bit

BOSTON UNIVERSITY CAMP `05 3D Template Matching 13 Performance Results Voxel value Voxel bits Logic per PE (slices) Number of PEs Clock MHz Speed: 10 9 SAC/sec 2-tuple = tuple = tuple(nonlinear) = tuple tuple(oriented) = Xilinx Virtex-II Pro VP70 Xilinx Virtex-II Pro VP70 Measured: Score-accumulate per sec (SAC/sec) Measured: Score-accumulate per sec (SAC/sec) Complex models not limited in number of bits Complex models not limited in number of bits Simple models not limited by worst-case speed Simple models not limited by worst-case speed

BOSTON UNIVERSITY CAMP `05 3D Template Matching 14 Conclusions Accelerators enable 3D template matching Accelerators enable 3D template matching >100x speedup over 3D FFT (n~100) >100x speedup over 3D FFT (n~100) Complex data types, including vector values Complex data types, including vector values Nonlinear comparisons supported Nonlinear comparisons supported Programmability avoids common limitations Programmability avoids common limitations No penalty due to over-generalization No penalty due to over-generalization No limit due to data/function restrictions No limit due to data/function restrictions 3D data and FPGA coprocessors match well 3D data and FPGA coprocessors match well Both are emerging and expanding Both are emerging and expanding FPGAs three years ago couldn’t do it! FPGAs three years ago couldn’t do it!