Interleaved Pixel Lookup for Embedded Computer Vision

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.

OverviewOverview Motion correction Smoothing kernel Spatial normalisation Standard template fMRI time-series Statistical Parametric Map General Linear.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.

3. Introduction to Digital Image Analysis

A Study of Approaches for Object Recognition

Direct Methods for Visual Scene Reconstruction Paper by Richard Szeliski & Sing Bing Kang Presented by Kristin Branson November 7, 2002.

Computer Vision Introduction to Image formats, reading and writing images, and image environments Image filtering.

Introduction to Operating Systems What is an operating system? Examples How do many programs run at the same time, with one processor?

Zach Allen Chris Chan Ben Wolpoff Shane Zinner Project Z: Stereo Range Finding Based on Motorola Dragonball Processor.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Summer Project Presentation Presented by:Mehmet Eser Advisors : Dr. Bahram Parvin Associate Prof. George Bebis.

Presented by Pat Chan Pik Wah 28/04/2005 Qualifying Examination

A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:

Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

1 Electronics Lab, Physics Dept., Aristotle Univ. of Thessaloniki, Greece 2 Micro2Gen Ltd., NCSR Demokritos, Greece 17th IEEE International Conference.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Multimodal Interaction Dr. Mike Spann

Computer Architecture and Organization Introduction.

CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

Efficient FPGA Implementation of QR

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

FPGA Implementations for Volterra DFEs

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Non-Euclidean Example: The Unit Sphere. Differential Geometry Formal mathematical theory Work with small ‘patches’ –the ‘patches’ look Euclidean Do calculus.

Convolution and Filtering

LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

STRING SEARCHING ENGINE FOR VIRUS SCANNING Author ： Derek Pao, Xing Wang, Xiaoran Wang, Cong Cao, Yuesheng Zhu Publisher ： TRANSACTIONS ON COMPUTERS, 2012.

ELE 488 Fall 2006 Image Processing and Transmission ( )

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs Direction Filter Design FPGA FIREBIRD BOARD Framegrabber PCI Bus Host Data Packing Design.

Rick Parent - CIS681 Motion Capture Use digitized motion to animate a character.

V ISION -B ASED T RACKING OF A M OVING O BJECT BY A 2 DOF H ELICOPTER M ODEL : T HE S IMULATION Chayatat Ratanasawanya October 30, 2009.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

® Virtex-E Extended Memory Technical Overview and Applications.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Midterm Review. Tuesday, November 3 7:15 – 9:15 p.m. in room 113 Psychology Closed book One 8.5” x 11” sheet of notes on both sides allowed Bring a calculator.

Lucas-Kanade Image Alignment Iain Matthews. Paper Reading Simon Baker and Iain Matthews, Lucas-Kanade 20 years on: A Unifying Framework, Part 1

Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.

Motion estimation Digital Visual Effects, Spring 2005 Yung-Yu Chuang 2005/3/23 with slides by Michael Black and P. Anandan.

Image Enhancement Band Ratio Linear Contrast Enhancement

CMSC5711 Image processing and computer vision

Backprojection Project Update January 2002

CS4670 / 5670: Computer Vision Kavita Bala Lecture 20: Panoramas.

CS262: Computer Vision Lect 09: SIFT Descriptors

Hardware Implementation of CTIS Reconstruction Algorithms

Design for Embedded Image Processing on FPGAs

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Regular Expression Acceleration at Multiple Tens of Gb/s

CMSC5711 Image processing and computer vision

network of simple neuron-like computing elements

Filtering Things to take away from this lecture An image as a function

D. Hernández Expósito, J. P. Cobos Carrascosa, J. L. Ramos Mas, M

BASIC IMAGE PROCESSING OPERATIONS FOR COMPUTER VISION

Presentation transcript:

Interleaved Pixel Lookup for Embedded Computer Vision Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro, Masatoshi Ishikawa

Outline Introduction Problems to apply interleaving Techniques Example: Lucas-Kanade Conclusion

Purpose To find a technique to efficiently implement a parallel memory for pixel lookup operations Interleaving Image Processing Computer Vision Tasks … Model objects, Feature space (e.g. Pose, Shape) Camera captures … Images

Motivation Strong influence to downstream performance Massive memory operations Always a headache for embedded designers Image Processing Computer Vision Tasks … Model objects, Feature space (e.g. Pose, Shape) Camera captures … Images

Motivation Interleaving in graphics hardware Texram [Schilling, 96] Texture memory in Recent GPUs Is it also beneficial to an embedded computer vision hardware? Yes, if appropriately implemented

Pixel lookup operations Geometry-to-pixel conversion Geometry stream Pixel stream … … xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) … … … Input images as a lookup table

Straightforward implementation Random access memory Expensive and slow Geometry stream Pixel stream RAM … … xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) … … Input images

Interleaved implementation Higher throughput with same capacity But, suffers from partitioning and alignment issues Geometry stream Pixel stream Interleaved Memory … … Packed words Input images

Partitioning issue Parallel word does not match to operations e.g. packing neighboring 1x4 pixels into a word, but required 4x1 pixels at each operation Pixel read read read align read

Misalignment issue Unaligned access requires multiple reads and sub-word alignment Word boundary read align read

Techniques 2D partitioning Indirect addressing Data switching

2D partitioning See an entire image as tiled spatial patterns Packed word = spatial pattern required Avoids partitioning issue Memory banks Spatial Pattern Packed word

Spatial pattern Certain pattern present in a lookup sequence E.g. - 2x2 block for interpolation - 3x3 block for convolution (i’, j’) (i’+1, j’) (i, j) (i+1, j) … (i’+1, j’) (i’+1, j’+1) (i ,j+1) (i+1, j+1) … … Input images

2D partitioning and misalignment Tiled patterns guarantee data elements in a word are always distributed even if an access overlaps address boundaries Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1

Indirect addressing Generating patterned addresses for each bank removes multiple reads for misaligned access Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator

Data switching Switch removes throughput decrease caused by sub-word alignment Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator

Techniques overview Indirect addressing Data switching … … Geometry stream Address generator Pixel stream … Memory banks … 2D partitioning Input images

Example: Lucas-Kanade Image registration algorithm Non-linear least squares to solve for parameters of affine transformation between input and template [Baker & Matthews, 04] Input image Gauss-Newton method Affine parameters Template image

LK data flow Bottleneck: for-each-x for-each-iteration stack Includes pixel lookup For each iteration For each

Pixel lookup in LK Affine warped coordinates to pixels conversion Lookup neighboring 4x4 pixels for each output Raw pixels Warped gradient pixels Warped coordinates Pixel lookup table … … … … … Interpolation Warped input pixels Input images

Straightforward implementation Filter Kernels Raw pixels RAM Multiply-Adds … … … … … Input images

Interleaved implementation Filter Kernels Raw pixels Interleaved memory Multiply-Adds Address generator … Memory banks … … … … Input images 4x4 block partitioning

Comparison of memory configurations Single port 4x4 multi-port 4x4 interleaved (SIMD) 4x4 interleaved with alignment support Throughput 1 16 5-6 Capacity requirement Peripherals None Switch Address generator and Switch Easier to implement peripherals than increasing memory capacity

FPGA implementation of LK pipeline Just interleaving contributes to 16x larger throughput for the dedicated pipeline Dedicated hardware pipeline FPU Affine Warp Calculator Filter Kernel Generator Gradient / Interpolation Filter Jacobian Filter Hessian Matrix Calculator FP ALU Input Pixel Table SDPU Calculator Error Calculator FP Register Template Pixel Table For each x For each iteration

HDL synthesis 16x larger throughput, but still same capacity requirement and feasible hardware costs Estimated performance: 200 fps for registration of 5 pieces of 64x64 8-bit image patches at 100 MHz Assumption: all registration converge within 10 iterations FPGA Xilinx Virtex-4 XC4VLX200 Maximum freq. 264.890 MHz Slices DSP slices RAM blocks 3,108 / 890,833 (3%) 75 / 96 (79%) 266 / 336 (78%) (4,788 Kb)

Summary Interleaved pixel lookup Techniques Example: Lucas-Kanade Sub-word parallel memory operations utilizing spatial pattern in lookup sequences Techniques 2D partitioning Indirect addressing Data switching Example: Lucas-Kanade 16x larger throughput with same memory capacity and feasible hardware cost