Optimization of the Earth-Mover’s Distance algorithm for a GPU-based real-time multispectral computer vision system Andrew Shaffer High.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Memory.
Part IV: Memory Management
Lecture 19: Parallel Algorithms
Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.
The Binary Numbering Systems
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Lecture 8 Tuesday, 11/19/02 Linear Programming.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Image (and Video) Coding and Processing Lecture 5: Point Operations Wade Trappe.
Virtual Dart: An Augmented Reality Game on Mobile Device Supervisor: Professor Michael R. Lyu Prepared by: Lai Chung Sum Siu Ho Tung.
1 Linear Programming Jose Rolim University of Geneva.
Lecture 7: Linear Programming in Excel AGEC 352 Spring 2011 – February 9, 2011 R. Keeney.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
EE 7730 Image Segmentation.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Lecture 9 Wednesday, 11/15/06 Linear Programming.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Virtual Dart – An Augmented Reality Game on Mobile Device Supervised by Prof. Michael R. Lyu LYU0604Lai Chung Sum ( )Siu Ho Tung ( )
Stereo Computation using Iterative Graph-Cuts
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Content-Based Image Retrieval using the EMD algorithm Igal Ioffe George Leifman Supervisor: Doron Shaked Winter-Spring 2000 Technion - Israel Institute.
LINEAR PROGRAMMING SIMPLEX METHOD.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Identifying Reversible Functions From an ROBDD Adam MacDonald.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Automated Face Detection Peter Brende David Black-Schaffer Veni Bourakov.
Implementing a Speech Recognition System on a GPU using CUDA
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
September 5, 2013Computer Vision Lecture 2: Digital Images 1 Computer Vision A simple two-stage model of computer vision: Image processing Scene analysis.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Kevin Cherry Robert Firth Manohar Karki. Accurate detection of moving objects within scenes with dynamic background, in scenarios where the camera is.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Tracking CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Soccer Video Analysis EE 368: Spring 2012 Kevin Cheng.
21 June 2009Robust Feature Matching in 2.3μs1 Simon Taylor Edward Rosten Tom Drummond University of Cambridge.
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
Robust Object Tracking by Hierarchical Association of Detection Responses Present by fakewen.
1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Visual Computing Computer Vision 2 INFO410 & INFO350 S2 2015
Video Tracking G. Medioni, Q. Yu Edwin Lei Maria Pavlovskaia.
Simplex Method Simplex: a linear-programming algorithm that can solve problems having more than two decision variables. The simplex technique involves.
Intelligent Robotics Today: Vision & Time & Space Complexity.
Semantic Alignment Spring 2009 Ben-Gurion University of the Negev.
Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.
Martin Kruliš by Martin Kruliš (v1.0)1.
GOOD MORNING CLASS! In Operation Research Class, WE MEET AGAIN WITH A TOPIC OF :
1 Munther Abualkibash University of Bridgeport, CT.
1-Introduction (Computing the image histogram).
Coding Theory Dan Siewiorek June 2012.
Lecture 5: GPU Compute Architecture
Histogram—Representation of Color Feature in Image Processing Yang, Li
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
What Is Spectral Imaging? An Introduction
Chapter 4 Linear Programming: The Simplex Method
Lecture 5: GPU Compute Architecture for the last time
Practical Database Design and Tuning
Main Memory Background Swapping Contiguous Allocation Paging
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Optimization of the Earth-Mover’s Distance algorithm for a GPU-based real-time multispectral computer vision system Andrew Shaffer High Performance Embedded Computing Workshop Sept. 22, 2011 The Pennsylvania State University Applied Research Laboratory

Outline Overview Color-Based Detection and Tracking Metamerism Multispectral Video Earth Mover’s Distance (EMD) Algorithm EMD Optimizations System Design Experimental Setup Performance Results Future Work Questions

Problem: Develop a robust real-time detection and tracking system to identify camouflaged objects Solution: Exploit multispectral video to identify camouflaged objects using GPU-accelerated algorithm Approach: Develop high-performance GPU implementation of Earth Mover Distance algorithm Multispectral Video Data NVIDIA C2050 GPU Detect Camouflaged Objects Overview

Color-Based Detection and Tracking Compare color distribution histograms from regions of a scene with color distribution histogram of the target High correlation between target and scene histograms suggests the location of the target Correlation easy to compute for RGB cameras because only 3 color channels to compare Red highlights indicate potential targets of interest (e.g. faces)

Metamerism Objects with different spectral intensity distributions are perceived as having the same RGB values Occurs because RGB cameras integrate a wide range of light wavelengths to form each color channel Can prevent overlapping objects from being correctly separated during image analysis CRT MonitorLCD Monitor CRT & LCD displaying same RGB value Spectroscopic response of 3 metameric white lights (at right): fluorescent (top), LED (middle), and metal-halide (bottom)

Multispectral video cameras use more than three color channels to describe scene colors more precisely Multispectral video cameras have higher data rates than comparable RGB cameras Determining correlation of histograms with more than 3 color channels is more difficult Multispectral Video FluxData Multispectral Video Camera

Earth Mover’s Distance EMD allows calculation of intuitive correlation measure between arbitrary histograms Conceptual Description: Each histogram bin corresponds to a pile of “earth” Moving 1 unit of earth from histogram bin a to bin b costs f(a,b) EMD finds the minimum cost to convert a candidate histogram into a specified target histogram Lower cost corresponds to higher correlation between the candidate and target histograms How similar?

Earth Mover’s Distance (cont) EMD is an instance of the Transportation Problem Minimum cost flow routing can be solved using linear programming Candidate histogram bins are sources, target histogram bins are sinks, and cost of moving flow along each edge corresponds to distance between histogram bins …but it is still slow: ~55 hrs to solve 10-bin EMDs for a single 1920x1080 pixel frame using Matlab on Intel Core GHz SourcesSinks Red Green Blue EMD Flow Network

Algorithmic Optimizations Create only one source or sink for each histogram bin pair CandBin[n] > TgtBin[n]: Bin n is source generating flow CandBin[n] – TgtBin[n] CandBin[n] < TgtBin[n]: Bin n is sink accepting flow TgtBin[n] – CandBin[n] Eliminate histogram bins where the candidate and target histogram bins have exactly the same value CandBin[n] == TgtBin[n]: Bin n does not appear in flow network SourcesSinks Red Green Blue EMD Flow Network

Algorithmic Optimizations (cont) Bypass Simplex search to find initial valid flow assignment 1. Add column of zeros for “unassigned” flow to tableaux with higher cost than any other path in the cost matrix 2. Add redundant constraint row enforcing that total assigned flow plus “unassigned” flow equals total flow across network 3. Perform initial pivot with new column and new row as pivot point 4. Perform row operations to eliminate redundancy in new row 5. Newly added row and column will not be involved in any subsequent pivots, so we can apply steps 1-4 to the initial tableaux without materializing the extra row or column Start Iterative search to find any valid flow assignment Iteratively optimize flow assignment Solution Force valid initial flow assignment using steps 1-4

Storage & Computation Optimizations All entries in main body of tableaux always remain in {-1, 0, 1} Eliminates multiplication during pivoting Main tableaux content entries can be stored in bitmap format using only 2 bits per entry All tableaux contents can be stored directly in registers with one row stored by each thread Bitmap allows math to be replaced by bitwise logical operations All entries remain in {-1, 0, 1} Bitmap Storage Format ValuePos BitNeg Bit Example: Val += Addend 1. Val Pos. |= Addend Pos. 2. Val Neg. |= Addend Neg. 3. Same = Val Pos. ^ Val Neg. 4. Val Pos &= Same 5. Val Neg &= Same When using 32-bit variables, the 5 instructions in the example perform the += operation for up to 32 values at a time (For -=, swap Addend Pos & Addend Neg) When using 32-bit variables, the 5 instructions in the example perform the += operation for up to 32 values at a time (For -=, swap Addend Pos & Addend Neg)

Workload Distribution Optimizations For problems with 11 or fewer histogram bins, 16 or fewer threads are needed to store and solve each problem For these problems, each physical warp can be divided into 2 or more “virtual warps” (VW) that operate in parallel The reduced size of these virtual warps allows fewer stages for binary reduction operations (e.g. find min across all threads) Because all threads within a physical warp are implicitly synchronized with each other, there is no need for explicit synchronization calls at any point within the EMD solver routine 11 bins/2 VW: 10 bins/3 VW: 8 bins/4 VW: 6 bins/5 VW: 5 bins/6 VW: Each physical warp contains 32 threads (Thread color indicates virtual warp assignment; red indicates unused threads)

System Design EMD Cache All Input Duplicate Problem Prefiltering EMD Solver Routine GPU Device Memory Cached results returned immediately New results stored in cache Uncached Unique Computed results returned Copy and return results for duplicate problems Results Red = Input data;Green = Results;Arrow size is proportionate to data being moved

System Design (cont) EMD Cache exploits temporal locality Pixels that appear in one frame are likely to reappear in subsequent frames due to object persistence Uses difference histogram as key to allow multiple targets Uses large GPU device memory to store many EMD values Duplicate Prefiltering exploits spatial locality Any given pixel is likely to be surrounded by other pixels with similar histograms due to objects having regions of similar colors Implemented as open-addressed hash table protected by atomically manipulated lock variables for high performance Exploitation of locality minimizes workload required to process each frame and maximizes utilization of GPU resources

Experimental Setup Generate simulated multispectral input and target images Input and target images generated by computing intensity histograms for windowed regions of grayscale images Process simulated multispectral input image twice using NVIDIA C2050 GPU: Data moved from CPU to GPU and back again for each frame First run gives worst-case performance (i.e. no cache hits) Second run gives best-case performance (i.e. maximal cache hits) Exclude one-time setup and cleanup processing from timing to obtain steady-state timing values NVIDIA C2050 GPU Experiment designed to estimate the performance of a real-world multispectral object detection and tracking system

Performance Results 1280x720 pixel input image1920x1080 pixel input image Processing Time vs. Number of Color Channels Overall system performance is between frames per second for high definition input data, making the EMD algorithm practical for many real-time computer vision applications

Future Work Build fully operational real-time system using multispectral video camera Support increased number of histogram bins to compute histogram similarities for hyperspectral video input Further improve performance by distributing workload across multiple GPUs Support other applications that require fast histogram comparisons Multispectral Video Camera NVIDIA C2050 GPU Detect Camouflaged Objects

Questions? Andrew Shaffer