Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian.

Slides:

Advertisements

Similar presentations

Distinctive Image Features from Scale-Invariant Keypoints

Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation

Speed, Accurate and Efficient way to identify the DNA.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Presented by Xinyu Chang

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Fast Background Subtraction using CUDA Janaka CDA 6938.

ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

Real-time Embedded Face Recognition for Smart Home Fei Zuo, Student Member, IEEE, Peter H. N. de With, Senior Member, IEEE.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

MULTIPLE MOVING OBJECTS TRACKING FOR VIDEO SURVEILLANCE SYSTEMS.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Department of Electrical Engineering National Cheng Kung University

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Implementing a Speech Recognition System on a GPU using CUDA

Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

SIFT DESCRIPTOR K Wasif Mrityunjay

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

1 Shape Descriptors for Maximally Stable Extremal Regions Per-Erik Forss´en and David G. Lowe Department of Computer Science University of British Columbia.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

6- General Purpose GPU Programming

Presentation transcript:

Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian Institute of Technology Madras, Chennai, INDIA Contributi ons: Application Domain: Foreground-Background Separation GPU implementation Problem Statement: Open Issues Addressed: Foreground-Background separation using order-based approaches and speedup using GPUs. Transform from intensity to order space for illumination and noise invariance. Use Stable Monotonic Change Invariant Feature Descriptor. Implementation in GPU(NVIDIA Tesla C1060 Processor) to achieve up to 25x speed compared to CPU. Computation of Extremal Regions, the regions(R +,R - ) that have intensities above or below a given threshold, R + = Thresh + (I, T 1 ) R - = Thresh - (I, T 2 ) where, thresholds T 1, T 2 differ by a small value δ I. [1] R. Gupta and A. Mittal,Smd:A locally stable monotonic change invariant feature descriptor. In Computer Vision – ECCV Foreground Background Classification Increase in robustness of Foreground - Background separation w.r.t. illumination and noise. Improvement of speed and stability of the process. Offload the CPU for subsequent processes (tracking, object recognition etc.) Fig. Overlapped patches in an image Fig. An example of pairs extracted in a patch. Matching Final weighted Matching Score = weighted sum of order flips: I 0 (p): intensity of the second patch at the point corresponding to p in first patch. Augmenting results with patch information If (either of the two overlapping patches are foreground), common region of the two is declared as foreground. else common region is declared as background. Total Stability Measure: the sum of squares of stability factors of all pixels in a patch. Homogeneous patches: patches with low value of total stability measure. Compare their average intensities for matching..Non-homogeneous patches: Total stability measure is high, Compare their Matching score M for matching. Total no. of patches: (W i / W p * H i /W p ) + ( (W i / W p – 1)* (H i /W p -1)) Independent processing for each patch. Each block, has 256 threads, each thread processes one patch. Each patch is shared by atmost 4 regions. Total no. of overlapped regions : 4 * (H i /W p – 1)*(W i /W p – 1) Declare a grid of (H i /W p – 1) blocks with 4 *(W i /W p – 1) blocks each. Keep the data in the shared memory for speed. Facilitates concurrent copying the binary image data from device to host. Results. R1, R2: indoor scene on the PETS database. R3: outdoor scene on a sunny day with (shadows and leaf-movements.) R4: outdoor scene on a cloudy day with less illumination. (a)Input Image, background subtraction using (b) LTP (c)LBP (d) using GMM (e)using monotonic change invariant method Conclusion Foreground Background Separation Method using the order of intensities. Robust to noise, distortion in spacial domain of pixels, to fluctuations in background (shadows, clouds, weather change etc.) Good results in both indoor and outdoor environments. Implemented in GPU to achieve high throughput.(Speed up of 25X for 960x720 image resolution ). Computation of Point Pairs Compute point-pairs (one each from R + and R - ) using distance transform Select the most stable point-pairs based on the distance transform. stability factor ( s i ) of each point-pair: minimum of the distance values of the two points in the point-pair. {(p i 1, p i 2,s i ), i = 1 … n} form the feature descriptor for matching the patches of the two frames (the current frame and the background model.) Extremal regionsExtract Stable Points Learning the Background The background model is dynamically upgraded as follows: bk = bk * (1 - α) + fg * α where, α : learning rate bk: pixel intensity of the existing background model fg: pixel intensity of the current frame Form Point Pairs for comparison at different thresholds NVIDIA Tesla C1060 follows 10 series NVIDIA architecture and has 30 multiprocessors. Each multiprocessor has 8 cores, a double precision unit and an on- chip shared memory. interfaced by CUDA Performance Optimization: Maximizing parallel execution Optimizing memory Transfer Optimizing Memory Usage Optimizing Instruction Usage GPU Implementation Fig. Physical Memory Layout of Tesla Processors Fig. An overview of memory management and program flow on the GPU. Copy command: Host  device Memory Transfer Red Arrow: concurrent transfer and kernel execution. Synchronize command : synchronizes all the data transfer and kernel execution before further execution. Fig.Thread Batching Fig. (a) Parallelizing over overlapped regions. Each thread in a block process for one respective region (yellow). (b)Timeline for serial execution and copy vs concurrent asynchronous copy. both the grid and the block are declared as 2D Kernel Grid :2D group(of blocks) of size (W i /W p, H i /W p ) corr. to each patch Block ;2D group (of threads) of size (W p, W p ) corr. to each pixel. Each thread makes coaleasced access to the pixel information. Results