Multimodal Templates for Real-Time Detection of Texture-less Objects in Heavily Cluttered Scenes Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart,

Slides:

Advertisements

Similar presentations

Feature Detection. Description Localization More Points Robust to occlusion Works with less texture More Repeatable Robust detection Precise localization.

Advertisements

Real-Time Template Tracking

Distinctive Image Features from Scale-Invariant Keypoints David Lowe.

Shapelets Correlated with Surface Normals Produce Surfaces Peter Kovesi School of Computer Science & Software Engineering The University of Western Australia.

DTAM: Dense Tracking and Mapping in Real-Time

3D Model Matching with Viewpoint-Invariant Patches(VIP) Reporter ：鄒嘉恆 Date ： 10/06/2009.

A Fast Local Descriptor for Dense Matching Engin Tola, Vincent Lepetit, Pascal Fua Computer Vision Laboratory, EPFL Reporter ： Jheng-You Lin 1.

Presented by Xinyu Chang

RGB-D object recognition and localization with clutter and occlusions Federico Tombari, Samuele Salti, Luigi Di Stefano Computer Vision Lab – University.

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

3D Object Recognition Pipeline Kurt Konolige, Radu Rusu, Victor Eruhmov, Suat Gedikli Willow Garage Stefan Holzer, Stefan Hinterstoisser TUM Morgan Quigley,

TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.

MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

Automatic Feature Extraction for Multi-view 3D Face Recognition

Vision Based Control Motion Matt Baker Kevin VanDyke.

Cambridge, Massachusetts Pose Estimation in Heavy Clutter using a Multi-Flash Camera Ming-Yu Liu, Oncel Tuzel, Ashok Veeraraghavan, Rama Chellappa, Amit.

Robust Object Tracking via Sparsity-based Collaborative Model

A Study of Approaches for Object Recognition

EE663 Image Processing Edge Detection 2 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Distinctive Image Feature from Scale-Invariant KeyPoints

Distinctive image features from scale-invariant keypoints. David G. Lowe, Int. Journal of Computer Vision, 60, 2 (2004), pp Presented by: Shalomi.

Object Recognition Using Geometric Hashing

Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.

Real-Time Face Detection and Tracking Using Multiple Cameras RIT Computer Engineering Senior Design Project John RuppertJustin HnatowJared Holsopple This.

Presenter: Stefan Zickler

(Fri) Young Ki Baik Computer Vision Lab.

Matthias Wimmer, Bernd Radig, Michael Beetz Chair for Image Understanding Computer Science Technische Universität München Adaptive.

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Computer vision.

CS 376b Introduction to Computer Vision 04 / 29 / 2008 Instructor: Michael Eckmann.

Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.

KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.

Reporter: Fei-Fei Chen. Wide-baseline matching Object recognition Texture recognition Scene classification Robot wandering Motion tracking.

Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.

Generalized Hough Transform

The University of Texas at Austin Vision-Based Pedestrian Detection for Driving Assistance Marco Perez.

03/05/03© 2003 University of Wisconsin Last Time Tone Reproduction If you don’t use perceptual info, some people call it contrast reduction.

CS332 Visual Processing Department of Computer Science Wellesley College Binocular Stereo Vision Region-based stereo matching algorithms Properties of.

Learning Object Representation Andrej Lúčny Department of Applied Informatics Faculty of Mathematics, Physics and Informatics Comenius University, Bratislava.

21 June 2009Robust Feature Matching in 2.3μs1 Simon Taylor Edward Rosten Tom Drummond University of Cambridge.

Computer Vision Lecture #10 Hossam Abdelmunim 1 & Aly A. Farag 2 1 Computer & Systems Engineering Department, Ain Shams University, Cairo, Egypt 2 Electerical.

Expectation-Maximization (EM) Case Studies

Distinctive Image Features from Scale-Invariant Keypoints David Lowe Presented by Tony X. Han March 11, 2008.

Autonomous Robots Vision © Manfred Huber 2014.

Looking at people and Image-based Localisation Roberto Cipolla Department of Engineering Research team

Features, Feature descriptors, Matching Jana Kosecka George Mason University.

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Yizhou Yu Texture-Mapping Real Scenes from Photographs Yizhou Yu Computer Science Division University of California at Berkeley Yizhou Yu Computer Science.

Lecture 9 Feature Extraction and Motion Estimation Slides by: Michael Black Clark F. Olson Jean Ponce.

776 Computer Vision Jan-Michael Frahm Spring 2012.

Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.

Instructor: Mircea Nicolescu Lecture 10 CS 485 / 685 Computer Vision.

776 Computer Vision Jan-Michael Frahm Spring 2012.

A Plane-Based Approach to Mondrian Stereo Matching

SIFT Scale-Invariant Feature Transform David Lowe

Scale Invariant Feature Transform (SIFT)

Paper Presentation: Shape and Matching

Feature description and matching

Common Classification Tasks

CAP 5415 Computer Vision Fall 2012 Dr. Mubarak Shah Lecture-5

Geometric Hashing: An Overview

Fourier Transform of Boundaries

Feature descriptors and matching

Presentation transcript:

Multimodal Templates for Real-Time Detection of Texture-less Objects in Heavily Cluttered Scenes Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic,Kurt Konolige, Nassir Navab, Vincent Lepetit Department of Computer Science, CAMP, Technische Universit¨at M¨unchen (TUM), Germany WillowGarage, Menlo Park, CA, USA IEEE International Conference on Computer Vision (ICCV) 2011

Outline Goal & Challenges Related Work Modality Extraction Image Cue Depth Cue Similarity Measure Efficient Computation Experiments

Goal

Challenges Objects under different poses over heavily cluttered background Online learning Real-time object learning and detection

Related Work Solving the problem of multi-view 3D object detection has two main categories: Learning Based Methods Template Matching Learning Based Methods: Require a large amount of training data Require long offline training phase Expensive learning for new object

Related Work Template Matching: Others: Better adapted to low textured objects than feature point approaches Easily update template for new object Direct matching is inappropriate for real-time. Others: Matching in Range Data : Construct full 3D CAD model of the object

Outline Goal & Challenges Related Work Modality Extraction Image Cue Depth Cue Similarity Measure Efficient Computation Experiments

Modality Extraction-Image Cue Image gradients are proved to be discriminant and robust to illumination change and noise. Normalized gradients and not their magnitudes makes the measure robust to contrast changes. We compute the normalized gradients on each color channel for input RGB color image. Input image , gradient map at location x:

Modality Extraction-Image Cue Keep only the gradients whose norms are larger than a threshold. Assign to the gradient whose quantized orientation occurs most in a 3 × 3 neighborhood. The similarity measurement function fg: Og(r): the normalized gradient map of the reference image at location r Ig(t): the normalized gradient map of the input image at location t

Modality Extraction-Image Cue Quantizing the gradient orientations Gradient image computed on gray image Gradient image computed with our approach Input color image

Modality Extraction-Depth Cue We use a standard camera and a aligned depth sensor to obtain depth map. Use quantized surface normal computed on a dense depth field for our template representation. Consider the first order Taylor expansion of the depth function D(x): Within a patch defined around x, each pixel offset dx yields an equation.

Modality Extraction-Depth Cue Estimate an optimal gradient in least-square. Depth gradient corresponds to a tangent plane going through three points X, X1 and X2: 𝒗 𝒙 : vector along the line of sight that goes through pixel x (obtain from parameters of depth sensor)

Modality Extraction-Depth Cue The normal to the surface can be estimated as the normalized cross-product of X1 − X and X2 − X. Within a patch defined around x, this would not be robust around occluding contours. Inspired by bilateral filtering, we ignore the pixels whose depth difference with the central pixel (X) is above a threshold. D(x) Tangent plane 𝑿 𝑿𝟏 𝑿𝟐 X Depth sensor Normal of X +Z

Modality Extraction-Depth Cue Quantize the normal directions into n0 bins. Assign to each location the quantized value that occurs most often in a 5 × 5 neighborhood. The similarity measurement function fD: OD(r): the normalized surface normal of the reference image at location r ID(t): the normalized surface normal of the input image at location t

Modality Extraction-Depth Cue Quantizing the surface normals Input image The corresponding depth image Surface normals computed with our approach. Details are clearly visible and depth discontinuities are well handled.

Outline Goal & Challenges Related Work Modality Extraction Image Cue Depth Cue Similarity Measure Efficient Computation Experiments

Similarity Measure We define a template as T = ({Om} m∈M, P ). P: a list of pairs (r,m) made of the locations r of a discriminant feature in modality m. Each template is created by extracting for each “m” a set of its most discriminant features (P). P:(ri, gradients) P:(rk, surface normals) r : record the feature location with respect to object center (C). C

Similarity Measure The object measurement energy function : T: ({Om} m∈M, P ) c: the detected location (could be object center) R(c+r): [ c+r- 𝑁 2 , c+r+ 𝑁 2 ]×[ c+r- 𝑁 2 , c+r+ 𝑁 2 ] , N ∈ const. (neighborhood of size N centered on (c+r) in Im) fm (Om (r), Im (t)) : computes the similarity score for modality m

Efficient Computation We first quantize the input data for each modality into a small number of n0. Use a lookup table тi,m for energy response: i: the index of the quantized value of modality m. (also use i to represent the corresponding value) Lm: list of values of a special modality m appearing in a local neighborhood of a value i from input I. Lm Lm’ C C’

Efficient Computation “Spread” [11] the data around neighborhood to obtain a robust representation Jm instead of Lm. For each quantized value of one modality m with index i we can now compute the response at each location c: тi,m : the precomputed lookup table, Jm as the index [11] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, P. Fua, N. Navab, and V. Lepetit. Gradient response maps for realtime detection of texture-less objects. under revision PAMI.

Efficient Computation Finally, the similarity measure can be: Since the maps Si,m are shared between the templates, matching several templates against the input image can be done very fast once they are computed.

Experiments LINE-MOD: our approach (intensity & depth) LINE-2D: introduced in [11] (use only intensity) LINE-3D: use only the depth map Hardware: Performed on one processor of a standard notebook with an Intel Centrino Processor Core2Duo with 2.4 GHz and 3 GB of RAM. Test data: Six object sequences made of 2000 real images each. Each sequence presents illumination and large viewpoint changes over heavy cluttered background.

Experiments Robustness: A threshold (about 80) separates almost all true positives for LINE-MOD.

Experiments Speed: Learning new templates only requires extracting and storing features, which is almost instantaneous. Templates include: 360 degree tilt rotation, 90 degree inclination rotation and in-plane rotations of ± 80 degrees, scale changes from 1.0 to 2.0. Parse a 640×480 image with over 3000 templates with 126 features at about 10 fps(real-time). The runtime of LINE-MOD is only dependent on the number of features and independent of the object/template size.

Experiments Speed:

Experiments Occlusion: Right: Average recognition score for the six objects with respect to occlusion. With over 30% occlusion our method is still able to recognize objects.

Experiments True positive rates Cup Toy-Car Hole punch

Experiments Toy-Monkey Toy-Duck Camera

Experiments True positive rates = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ×100% False positive rates = 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 ×100%

Experiments