Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008.

Slides:



Advertisements
Similar presentations
For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.
Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.
3D Object Recognition Pipeline Kurt Konolige, Radu Rusu, Victor Eruhmov, Suat Gedikli Willow Garage Stefan Holzer, Stefan Hinterstoisser TUM Morgan Quigley,
Hilal Tayara ADVANCED INTELLIGENT ROBOTICS 1 Depth Camera Based Indoor Mobile Robot Localization and Navigation.
A vision-based system for grasping novel objects in cluttered environments Ashutosh Saxena, Lawson Wong, Morgan Quigley, Andrew Y. Ng 2007 Learning to.
Extracting Minimalistic Corridor Geometry from Low-Resolution Images Yinxiao Li, Vidya, N. Murali, and Stanley T. Birchfield Department of Electrical and.
Patch to the Future: Unsupervised Visual Prediction
Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Learning Convolutional Feature Hierarchies for Visual Recognition
Learning Semantic Scene Models From Observing Activity in Visual Surveillance Dimitios Makris and Tim Ellis (2005) Presented by Steven Wilson.
December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.
Patch Based Synthesis for Single Depth Image Super-Resolution (ECCV 2012) Oisin Mac Aodha, Neill Campbell, Arun Nair and Gabriel J. Brostow Presented By:
The Viola/Jones Face Detector (2001)
Training Regimes Motivation  Allow state-of-the-art subcomponents  With “Black-box” functionality  This idea also occurs in other application areas.
Stanford CS223B Computer Vision, Winter 2005 Lecture 6: Stereo 2 Sebastian Thrun, Stanford Rick Szeliski, Microsoft Hendrik Dahlkamp and Dan Morris, Stanford.
3-D Depth Reconstruction from a Single Still Image 何開暘
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
A Study of Approaches for Object Recognition
Object Recognition with Informative Features and Linear Classification Authors: Vidal-Naquet & Ullman Presenter: David Bradley.
Scene Understanding through Transfer Learning Stephen Gould Ben Packer Geremy Heitz Daphne Koller DARPA Update September 11, 2008.
Multi-Class Object Recognition Using Shared SIFT Features
Feature matching and tracking Class 5 Read Section 4.1 of course notes Read Shi and Tomasi’s paper on.
Stanford CS223B Computer Vision, Winter 2006 Lecture 6 Stereo II Professor Sebastian Thrun CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado Stereo.
Computing motion between images
Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.
FACE DETECTION AND RECOGNITION By: Paranjith Singh Lohiya Ravi Babu Lavu.
1 Intelligent Robotics Research Centre (IRRC) Department of Electrical and Computer Systems Engineering Monash University, Australia Visual Perception.
Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.
GM-Carnegie Mellon Autonomous Driving CRL TitleAutomated Image Analysis for Robust Detection of Curbs Thrust AreaPerception Project LeadDavid Wettergreen,
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
The Brightness Constraint
Local invariant features Cordelia Schmid INRIA, Grenoble.
December 4, 2014Computer Vision Lecture 22: Depth 1 Stereo Vision Comparing the similar triangles PMC l and p l LC l, we get: Similarly, for PNC r and.
Lecture 29: Face Detection Revisited CS4670 / 5670: Computer Vision Noah Snavely.
1 Webcam Mouse Using Face and Eye Tracking in Various Illumination Environments Yuan-Pin Lin et al. Proceedings of the 2005 IEEE Y.S. Lee.
Object Detection with Discriminatively Trained Part Based Models
Statistics in the Image Domain for Mobile Robot Environment Modeling L. Abril Torres-Méndez and Gregory Dudek Centre for Intelligent Machines School of.
Local invariant features Cordelia Schmid INRIA, Grenoble.
The 18th Meeting on Image Recognition and Understanding 2015/7/29 Depth Image Enhancement Using Local Tangent Plane Approximations Kiyoshi MatsuoYoshimitsu.
Peter Henry1, Michael Krainin1, Evan Herbst1,
Chapter 5 Multi-Cue 3D Model- Based Object Tracking Geoffrey Taylor Lindsay Kleeman Intelligent Robotics Research Centre (IRRC) Department of Electrical.
Using Adaptive Tracking To Classify And Monitor Activities In A Site W.E.L. Grimson, C. Stauffer, R. Romano, L. Lee.
Multi Scale CRF Based RGB-D Image Segmentation Using Inter Frames Potentials Taha Hamedani Robot Perception Lab Ferdowsi University of Mashhad The 2 nd.
Object Recognition by Discriminative Combinations of Line Segments and Ellipses Alex Chia ^˚ Susanto Rahardja ^ Deepu Rajan ˚ Maylor Leung ˚ ^ Institute.
John Morris Stereo Vision (continued) Iolanthe returns to the Waitemata Harbour.
DISCRIMINATIVELY TRAINED DENSE SURFACE NORMAL ESTIMATION ANDREW SHARP.
Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering Czech Technical University in Prague Segmentation Based Multi-View.
Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.
CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.
Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.
SIFT.
Design and Calibration of a Multi-View TOF Sensor Fusion System Young Min Kim, Derek Chan, Christian Theobalt, Sebastian Thrun Stanford University.
3D Perception and Environment Map Generation for Humanoid Robot Navigation A DISCUSSION OF: -BY ANGELA FILLEY.
SIFT Scale-Invariant Feature Transform David Lowe
Paper – Stephen Se, David Lowe, Jim Little
Range Image Segmentation for Modeling and Object Detection in Urban Scenes Cecilia Chen & Ioannis Stamos Computer Science Department Graduate Center, Hunter.
A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology
Recognizing Deformable Shapes
Machine Learning Basics
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Common Classification Tasks
Multiple Instance Learning: applications to computer vision
Cascaded Classification Models
Video Compass Jana Kosecka and Wei Zhang George Mason University
SIFT.
CSSE463: Image Recognition Day 30
CSSE463: Image Recognition Day 30
Presentation transcript:

Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008

Gould, et al. Multi-modal robotic perception gratuitous request for comments, suggestions, or insights (especially those that can be implemented in under 24 hours)

Gould, et al. Motivation How can we design a robot to “see” as well as a human? desiderata: small household/office objects “no excuse” detection operate on timescale commensurate with humans scaleable to large number of objects observation: 3-d information would greatly help with object detection and recognition STanford AI Robot

Gould, et al. Wouldn’t it be nice if we could extract 3-d features from monocular images? Current state-of-the-art is not (yet?) good enough, especially when objects are small 3-d from images [Hoiem et al., 2006][Saxena et al., 2007]

Gould, et al. Complementary sensors Image sensors (cameras) provide high resolution color and intensity data Range sensors (laser) provide depth and global contextual information solution: augment visual information with 3-d features from a range scanner, e.g. laser

Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

Gould, et al. Multi-sensor architecture Laser scanlines combined into a sparse point cloud Infer 3-d location and surface normal for every pixel in the image (super-resolution) Dominant planar surfaces extracted from sparse point cloud Combine 2-d and 3-d cues for better object detection

Gould, et al. Super-resolution sensor fusion Super-resolution MRF (similar to [Diebel and Thrun, 2006]) used to infer depth value for every image pixel from sparse point cloud: singleton potential encodes our preference for matching laser measurements; pairwise potential encodes our preference for planar surfaces. Reconstruct dense point cloud and estimate surface normals (in camera coordinate system) Algorithm can be stopped at anytime and later resumed from previous iteration enabling real-time implementation

Gould, et al. Super-resolution sensor fusion We define our super-resolution MRF over image pixels x ij, laser measurements z ij and reconstructed depths y ij as where and h ( x ; ¸ ) is the Huber penalty, w v ij and w h ij are weighting factors indicating how unwilling we are to allow smoothing to occur across edges in the image as in [Diebel and Thrun, 2006]. p ( y j x ; z ) = 1 ´ ( x ; z ) exp 8 < : ¡ k X ( i ; j ) 2 L © ij ¡ X ( i ; j ) 2 I ª ij 9 = ; © ij ( y ; z ) = h ( y i ; j ¡ z i ; j ; ¸ ) ª ij ( x ; y ) = w v ij h ( 2 y i ; j ¡ y i ; j ¡ 1 ¡ y i ; j + 1 ; ¸ ) + w h ij h ( 2 y i ; j ¡ y i ¡ 1 ; j ¡ y i + 1 ; j ; ¸ )

Gould, et al. Dominant plane extraction Plane clustering based on greedy, region-growing algorithm with smoothness constraint [Rabbani et al., 2006] Extracted planes used for determined object “support” (a) Compute normal vectors using neighborhood (b) Grow region over neighbors with similar normal vectors (c) Use neighbors with low residual to expand region

Gould, et al. Multi-modal scene representation We now have a scene represented by: intensity/color for every pixel, x ij 3-d location for every pixel, X ij surface orientation for every pixel, n ij set of dominant planar surfaces, { P k } All coordinates in “image”-space

Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

Gould, et al. Sliding-window object detectors task: find the disposable coffee cups

Gould, et al. Sliding-window object detectors task: find the disposable coffee cups

Gould, et al. 2-d object detectors We use a sliding-window object detector to compute object probabilities given image data only, P image ( o | x, y, σ ) Features are based on localized patch responses from pre- trained dictionary and applied to image at multiple scales [Torralba et al., 2007] Gentle-boost [Friedman et al., 1998] classifier applied to each window examples from “mug” dictionary

Gould, et al. 3-d features Scene representation based on 3-d points and surface normals for every pixel in image, { X ij, n ij }, and set of dominant planes, { P k }. Compute 3-d features over candidate windows (in image plane) by projecting window into 3-d scene

Gould, et al. 3-d features Features include statistics on height above ground, distance from robot, surface variation, surface orientation, support (distance and orientation), and size of object These are combined probabilistically with log- odds ratio from 2-d detector scene heightsupportsize

Gould, et al. 3-d features by class

Gould, et al. Combining 3-d and visual cues Simple logistic classifier probabilistically combines features at location ( x, y ) and image scale σ where q() is the logistic function and advantages: simple and quick to train and evaluate can train 2-d object detectors separately disadvantages: assumes 2-d and 3-d features are independent assumes objects are independent f 2 d ( x ; y ; ¾ ) = l og µ P i mage ( o j x ; y ; ¾ ) 1 ¡ P i mage ( o j x ; y ; ¾ ) ¶ P ( o j x ; y ; ¾ ) = q ¡ µ T 3 d f 3 d ( x ; y ; ¾ ) + µ 2 d f 2 d ( x ; y ; ¾ ) + µ b i as ¢

Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

Gould, et al. NIPS demonstration

Gould, et al. NIPS demonstration hardware: 2 quad-core servers SICK laser on pan unit Axis PTZ camera statistics: 4 hours, 30 minutes video frames (704x480 pixels; fps) took about 10 seconds to find new objects 77.8 million laser points (4.69 kpps) correctly labeled approx. 150k coffee mugs, 70k ski boots, 50k computer monitors very few false detections

Gould, et al. Optimizations for real-time Bottleneck #1: super-resolution MRF initialize using quadratic interpolation of points run on half-scale version of image and then up-sample update neighborhood gradient information Bottleneck #2: 2-d feature extraction prune candidate windows based on color or depth constancy integer operations for patch cross-correlation share patch normalization calculation between patch features multi-thread patch response calculation General principle: software framework: use Switchyard (ROS) robotic framework to run modules in parallel on multiple machines; keep data close to processor that uses it

Gould, et al. Scoring results Non-maximal neighborhood suppression discard multiple overlapping detections Area-of-overlap measure positive detection if more than 50% overlap with a groundtruth object of the correct class AO ( D i ; G j ) = area ( D i \ G j ) area ( D i [ G j )

Gould, et al. Experimental results without 3-d featureswith 3-d features

Gould, et al. Analysis of features Compare maximum F1-score of 2-d detector augmented with each 3-d feature separately

Gould, et al. Example scenes mugcupmonitorclockhandleski boot 2-d onlywith 3-d2-d onlywith 3-d

Gould, et al. Future work Optimization: can 3-d features help reduce the amount of computation needed? e.g. use surface variance or object size to reduce candidate rectangles examined by sliding-window detector Accuracy: can more detailed 3-d features or more sophisticated 3-d scene model help with recognition? e.g. location of other objects in the scene Whole-robot-integration: what other sensor modalities can be used to help detection and/or what active control strategies can be used for improving accuracy? e.g. zooming in for a better view of an object Can robot actively help in data collection/learning?

Gould, et al. Questions Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work