A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao.

Slides:



Advertisements
Similar presentations
Distinctive Image Features from Scale-Invariant Keypoints
Advertisements

SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
GPU-accelerated fractal imaging Jeremy Ehrhardt CS 81 - Spring 2009.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Digital Image Processing
Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian.
September 10, 2013Computer Vision Lecture 3: Binary Image Processing 1Thresholding Here, the right image is created from the left image by thresholding,
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Fractal Image Compression
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
GPU vs. CPU computations Árni Einarsson Jacek Kolodziej.
GCSE Computing - The CPU
Video Compression Concepts Nimrod Peleg Update: Dec
(Fri) Young Ki Baik Computer Vision Lab.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Overview Introduction to local features
Department of Electrical Engineering National Cheng Kung University
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.
Course Syllabus 1.Color 2.Camera models, camera calibration 3.Advanced image pre-processing Line detection Corner detection Maximally stable extremal regions.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
1 Bioinspired Compression Schemas 16/07/2009 Khaled MASMOUDI Pierre KORNPROBST INRIA Marc ANTONINII3S.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Roee Litman, Alexander Bronstein, Michael Bronstein
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
图像处理技术讲座(11) Digital Image Processing (11) 灰度的数学形态学(3) Mathematical morphology in gray scale (3) 顾 力栩 上海交通大学 计算机系
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Image Enhancement in Spatial Domain Presented by : - Mr. Trushar Shah. ME/MC Department, U.V.Patel College of Engineering, Kherva.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps Lin Cheng, Hyunsu Cho and Peter A. Yoon Trinity College.
Image features and properties. Image content representation The simplest representation of an image pattern is to list image pixels, one after the other.
Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Applications and Rendering pipeline
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Class Exercise 1B.
Signal and Image Processing Lab
Progressive Clustering of Big Data with GPU Acceleration and Visualization Jun Wang1, Eric Papenhausen1, Bing Wang1, Sungsoo Ha1, Alla Zelenyuk2, and Klaus.
Augmented von Neumann Processors
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Presented by: Isaac Martin
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Maximally Stable Extremal Regions
Maximally Stable Extremal Regions
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
GPGPU: Parallel Reduction and Scan
Parallel Computation Patterns (Reduction)
Maximally Stable Extremal Regions
A Block Based MAP Segmentation for Image Compression
The Design and Implementation of a Log-Structured File System
Geometric Transformations
Presentation transcript:

A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao

Review Invariant to affine transformation, such as rotation, translation, and scale change; Denotes a set of stable connected components that are detected in gray scale image;

Review MSER is a stable Connected Component of thresholded image All pixels inside the MSER have higher or lower intensities than in the surrounding regions Regions are selected to be stable over intensity range

Sequential and Parallel Approach Sequential { Parallel { bucketSort(); buildDirectedGraph( ); Find ( ); blockReduction( ); Union( ); parentCompression( ); Update( ); // already get regions GetRegion( ); computeVariation( ); computeVariation( ); findRoot( ); leastVariation( ); } } leastVariation( );

buildDirectedGraph A parent’s value of each pixel should no less than its current value local memory: visited, members Shared memory

buildDirectedGraph Memory Usage: local memory: visited, members Shared memory Also process edge for next step

Block Reduction 16*16, 8*8

Block Reduction 16*16, 8*8

Block Reduction 16*16, 8*8

Block Reduction totally 3 iterations are needed log 2 4 log 2 2

Block Reduction If (horizontal_pixelUpdate) Load edge information to each pixel

Block Reduction History buffer

Parent Compression Shared memory based on parent locality

FindRegion FindRoot, so that we can process each region’s tree respectively Find region’s parent and child based on the delta, so that variation can be computed. var = (area(parent) – area(child))/area(current region); Send the region information to CPU Scan every region’s tree, find the minival variation, which is MSER regions. Filter the region

Performance Analysis For 256*256 image,

Performance Analysis For 1024*768 image,

Performance Analysis Why 8*8 better than 16*16? local memory usage recursion times block execution block reduction times parent locality

Performance Analysis GPU vs CPU timing intermidiate values Synchronization record information memory transfer

Conclusion Very large data dependancy, still can be solved. Should be suitable to multicore microprocessor, whose individual core is strong enough than the single thread in GPU. The bottenleck is still memory.

Future Work More efficient block reduction. (decoder and encoder) Memory random access GPU code effciency