Fast Background Subtraction using CUDA Janaka CDA 6938.

Slides:



Advertisements
Similar presentations
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
1 Video Processing Lecture on the image part (8+9) Automatic Perception Volker Krüger Aalborg Media Lab Aalborg University Copenhagen
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Vision Based Control Motion Matt Baker Kevin VanDyke.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.
Object Inter-Camera Tracking with non- overlapping views: A new dynamic approach Trevor Montcalm Bubaker Boufama.
Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
Improved Adaptive Gaussian Mixture Model for Background
CSE 291 Final Project: Adaptive Multi-Spectral Differencing Andrew Cosand UCSD CVRR.
Effective Gaussian mixture learning for video background subtraction Dar-Shyang Lee, Member, IEEE.
Multi-camera Video Surveillance: Detection, Occlusion Handling, Tracking and Event Recognition Oytun Akman.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
GPGPU platforms GP - General Purpose computation using GPU
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Martin Kruliš by Martin Kruliš (v1.0)1.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.
Cg Programming Mapping Computational Concepts to GPUs.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Kevin Cherry Robert Firth Manohar Karki. Accurate detection of moving objects within scenes with dynamic background, in scenarios where the camera is.
CUDA - 2.
CS654: Digital Image Analysis
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Expectation-Maximization (EM) Case Studies
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Igor Jánoš. Goal of This Project Decode and process a full-HD video clip using only software resources Dimension – 1920 x 1080 pixels.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CUDA programming Performance considerations (CUDA best practices)
Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.
Zhaoxia Fu, Yan Han Measurement Volume 45, Issue 4, May 2012, Pages 650–655 Reporter: Jing-Siang, Chen.
Generalized and Hybrid Fast-ICA Implementation using GPU
CS427 Multicore Architecture and Parallel Computing
A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology
Recitation 2: Synchronization, Shared memory, Matrix Transpose
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
EM Algorithm and its Applications
6- General Purpose GPU Programming
Presentation transcript:

Fast Background Subtraction using CUDA Janaka CDA 6938

What is Background Subtraction? Identify foreground pixels Preprocessing step for most vision algorithms

Applications Vehicle Speed Computation from Video

Why is it Hard? Naïve Method | frame i – background | > Threshold 1.Illumination Changes Gradual (evening to night) Sudden (overhead clouds) 2.Changes in the background geometry Parked cars (should become part of the background) 3.Camera related issues Camera oscillations (shaking) Grainy noise 4.Changes in background objects Tree branches Sea waves

Current Approaches Frame Difference | frame i – frame (i-1) |> Threshold Background as the running average – B i+ 1 = α* F i + (1 -α) * B i Gaussian Mixture Models Kernel Density Estimators

Gaussian Mixture Models Each pixel modeled with a mixture of Gaussians Flexible to handle variations in the background

GMM Background Subtraction Two tasks performed real-time – Learning the background model – Classifying pixels as background or foreground Learning the background model – The parameters of Gaussians Mean Variance and Weight – Number of Gaussians per pixel Enhanced GMM is 20% faster than the original GMM* * Improved Adaptive Gaussian Mixture Model for Background Subtraction, Zoran Zivkovic, ICPR 2004

Classifying Pixels = value of a pixel at time t in RGB color space. Bayesian decision R – if pixel is background (BG) or foreground (FG): = Background Model = Estimated model, based on the training set X Initially set p(FG) = p(BG), therefore if decide background

Definitions and Assumptions = value of a pixel at time t in RGB color space. Pixel-based background subtraction involves a decision if the pixel belongs to background (BG) or foreground object (FG). Bayesian decision R is made by: We set p(FG) = p(BG), assuming we don’t know anything about the foreground objects and assume uniform distribution for the foreground object appearance. Therefore, if decide that pixel belongs to background = Background Model = Estimated model, based on the training set X

For each new sample update the training data set Re-estimate The GMM Model Choose a reasonable time period T and at time t we have Full scene model (BG + FG) GMM with M Gaussians where - estimates of the means - estimates of the variances - mixing weights non-negative and add up to one.

The Update Equations Given a new data sample update equations An on-line clustering algorithm. Discarding the Gaussians with small weights - approximate the background model : If the Gaussians are sorted to have descending weights : where c f is a measure of the maximum portion of data that can belong to FG without influencing the BG model and is used to limit the influence of old data (learning rate). where, is set to 1 for the ‘close’ Gaussian and 0 for others

Background Subtraction Results Original Video Foreground Pixels

CPU/GPU Implementation Treat each pixel independently Use the “Update Equations” to change GMM parameters

How to Parallelize? Simple: One thread per pixel Each pixel has different # of Gaussians Divergence inside a warp

Preliminary Results Speedup: mere 1.5 X – QVGA(320 x 240) Video Still useful since CPU is offloaded

Optimization Constant Memory Pinned (non pageable) Memory Memory Coalescing – Structure of Arrays Vs Array of Structures – Packing and Inflating Data – 16x16 block size Asynchronous Execution – Kernel Invocation – Memory Transfer – CUDA Streams

Memory Related Constant Memory – Cached – Used to store all the configuration parameters Pinned Memory – Required for Asynchronous transfers – Use “CudaMallocHost” rather than “malloc” – Transfer BW for GeForce 8600M GT using “bandwidthTest” PageablePinned CPU to GPU981 MB/s2041 MB/s GPU to CPU566 MB/s549 MB/s

CUDA Memory Coalescing (recap)* A coordinated read by 16 threads (a half-warp) A contiguous region of global memory: – 64 bytes - each thread reads a word: int, float, … – 128 bytes - each thread reads a double-word: int2, float2 – 256 bytes – each thread reads a quad-word: int4, float4, … Starting address must be a multiple of region size * Optimizing CUDA, Paulius Micikevicius

Memory Coalescing Compaction – uses less registers Inflation – for coalescing

Memory Coalescing SoA over AoS – for coalescing

Asynchronous Execution

Asynchronous Invocation int cuda_update(CGMMImage2* pGMM, pUINT8 imagein, pUINT8 imageout) { //wait for the previous memory operations to finish cudaStreamSynchronize(pGMM->copyStream); //copy into and from pinned memory memcpy(pGMM->pinned_in, imagein,....); memcpy(imageout, pGMM->pinned_out,....); //make sure previous exec finished before next memory transfer cudaStreamSynchronize(pGMM->execStream); //swap pointers swap(&(pGMM->d_in1), &(pGMM->d_in2)); swap(&(pGMM->d_out1), &(pGMM->d_out2)); //copy the input image to device cudaMemcpyAsync(pGMM->d_in1, pGMM->pinned_in,...., pGMM->copyStream); cudaMemcpyAsync(pGMM->pinned_out, pGMM->d_out2,...., pGMM->copyStream); //call kernel backSubKernel execS>>>(pGMM->d_in2, pGMM->d_out1,...); return 0; }

Gain from Optimization Observe how the running time improved with each optimization technique Naïve Version (use constant memory) seconds Partial Asynchronous Version (use pinned memory) Memory coalescing (use SoA) More coalescing with inflation and compaction Complete Asynchronous

Experiments - Speedup Final speedup 3.7 X on GeForce 8600M GT

Frame Rate 481 fps – 256 x 256 video on 8600M GT HD Video Formats – 720p (1280 x 720) – 40 fps – 1080p (1920 x 1080) – 17.4 fps

Foreground Fraction Generate video frames with varying numbers of random pixels GPU version is stable compared to CPU version

Matlab Interface (API) Interface for developers Initialize h = BackSubCUDA(frames{1}, 0, [0.01 5* gpu]); Add new frames for i=1:numImages output = BackSubCUDA(frames{i}, h); end; Destroy clear BackSubCUDA

Conclusions Advantages of the GPU version (recap) – Speed – Offloading CPU – Stability Overcoming the Host/Device transfer overhead Need to understand optimization techniques