Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008.

Similar presentations


Presentation on theme: "Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008."— Presentation transcript:

1 Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008

2 Gould, et al. Multi-modal robotic perception gratuitous request for comments, suggestions, or insights (especially those that can be implemented in under 24 hours)

3 Gould, et al. Motivation How can we design a robot to “see” as well as a human? desiderata: small household/office objects “no excuse” detection operate on timescale commensurate with humans scaleable to large number of objects observation: 3-d information would greatly help with object detection and recognition STanford AI Robot

4 Gould, et al. Wouldn’t it be nice if we could extract 3-d features from monocular images? Current state-of-the-art is not (yet?) good enough, especially when objects are small 3-d from images [Hoiem et al., 2006][Saxena et al., 2007]

5 Gould, et al. Complementary sensors Image sensors (cameras) provide high resolution color and intensity data Range sensors (laser) provide depth and global contextual information solution: augment visual information with 3-d features from a range scanner, e.g. laser 480 640 30

6 Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

7 Gould, et al. Multi-sensor architecture Laser scanlines combined into a sparse point cloud Infer 3-d location and surface normal for every pixel in the image (super-resolution) Dominant planar surfaces extracted from sparse point cloud Combine 2-d and 3-d cues for better object detection

8 Gould, et al. Super-resolution sensor fusion Super-resolution MRF (similar to [Diebel and Thrun, 2006]) used to infer depth value for every image pixel from sparse point cloud: singleton potential encodes our preference for matching laser measurements; pairwise potential encodes our preference for planar surfaces. Reconstruct dense point cloud and estimate surface normals (in camera coordinate system) Algorithm can be stopped at anytime and later resumed from previous iteration enabling real-time implementation

9 Gould, et al. Super-resolution sensor fusion We define our super-resolution MRF over image pixels x ij, laser measurements z ij and reconstructed depths y ij as where and h ( x ; ¸ ) is the Huber penalty, w v ij and w h ij are weighting factors indicating how unwilling we are to allow smoothing to occur across edges in the image as in [Diebel and Thrun, 2006]. p ( y j x ; z ) = 1 ´ ( x ; z ) exp 8 < : ¡ k X ( i ; j ) 2 L © ij ¡ X ( i ; j ) 2 I ª ij 9 = ; © ij ( y ; z ) = h ( y i ; j ¡ z i ; j ; ¸ ) ª ij ( x ; y ) = w v ij h ( 2 y i ; j ¡ y i ; j ¡ 1 ¡ y i ; j + 1 ; ¸ ) + w h ij h ( 2 y i ; j ¡ y i ¡ 1 ; j ¡ y i + 1 ; j ; ¸ )

10 Gould, et al. Dominant plane extraction Plane clustering based on greedy, region-growing algorithm with smoothness constraint [Rabbani et al., 2006] Extracted planes used for determined object “support” (a) Compute normal vectors using neighborhood (b) Grow region over neighbors with similar normal vectors (c) Use neighbors with low residual to expand region

11 Gould, et al. Multi-modal scene representation We now have a scene represented by: intensity/color for every pixel, x ij 3-d location for every pixel, X ij surface orientation for every pixel, n ij set of dominant planar surfaces, { P k } All coordinates in “image”-space

12 Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

13 Gould, et al. Sliding-window object detectors task: find the disposable coffee cups

14 Gould, et al. Sliding-window object detectors task: find the disposable coffee cups

15 Gould, et al. 2-d object detectors We use a sliding-window object detector to compute object probabilities given image data only, P image ( o | x, y, σ ) Features are based on localized patch responses from pre- trained dictionary and applied to image at multiple scales [Torralba et al., 2007] Gentle-boost [Friedman et al., 1998] classifier applied to each window examples from “mug” dictionary

16 Gould, et al. 3-d features Scene representation based on 3-d points and surface normals for every pixel in image, { X ij, n ij }, and set of dominant planes, { P k }. Compute 3-d features over candidate windows (in image plane) by projecting window into 3-d scene

17 Gould, et al. 3-d features Features include statistics on height above ground, distance from robot, surface variation, surface orientation, support (distance and orientation), and size of object These are combined probabilistically with log- odds ratio from 2-d detector scene heightsupportsize

18 Gould, et al. 3-d features by class

19 Gould, et al. Combining 3-d and visual cues Simple logistic classifier probabilistically combines features at location ( x, y ) and image scale σ where q() is the logistic function and advantages: simple and quick to train and evaluate can train 2-d object detectors separately disadvantages: assumes 2-d and 3-d features are independent assumes objects are independent f 2 d ( x ; y ; ¾ ) = l og µ P i mage ( o j x ; y ; ¾ ) 1 ¡ P i mage ( o j x ; y ; ¾ ) ¶ P ( o j x ; y ; ¾ ) = q ¡ µ T 3 d f 3 d ( x ; y ; ¾ ) + µ 2 d f 2 d ( x ; y ; ¾ ) + µ b i as ¢

20 Gould, et al. Overview Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work

21 Gould, et al. NIPS demonstration

22 Gould, et al. NIPS demonstration hardware: 2 quad-core servers SICK laser on pan unit Axis PTZ camera statistics: 4 hours, 30 minutes 54244 video frames (704x480 pixels; 3.269 fps) took about 10 seconds to find new objects 77.8 million laser points (4.69 kpps) correctly labeled approx. 150k coffee mugs, 70k ski boots, 50k computer monitors very few false detections

23 Gould, et al. Optimizations for real-time Bottleneck #1: super-resolution MRF initialize using quadratic interpolation of points run on half-scale version of image and then up-sample update neighborhood gradient information Bottleneck #2: 2-d feature extraction prune candidate windows based on color or depth constancy integer operations for patch cross-correlation share patch normalization calculation between patch features multi-thread patch response calculation General principle: software framework: use Switchyard (ROS) robotic framework to run modules in parallel on multiple machines; keep data close to processor that uses it

24 Gould, et al. Scoring results Non-maximal neighborhood suppression discard multiple overlapping detections Area-of-overlap measure positive detection if more than 50% overlap with a groundtruth object of the correct class AO ( D i ; G j ) = area ( D i \ G j ) area ( D i [ G j )

25 Gould, et al. Experimental results without 3-d featureswith 3-d features

26 Gould, et al. Analysis of features Compare maximum F1-score of 2-d detector augmented with each 3-d feature separately

27 Gould, et al. Example scenes mugcupmonitorclockhandleski boot 2-d onlywith 3-d2-d onlywith 3-d

28 Gould, et al. Future work Optimization: can 3-d features help reduce the amount of computation needed? e.g. use surface variance or object size to reduce candidate rectangles examined by sliding-window detector Accuracy: can more detailed 3-d features or more sophisticated 3-d scene model help with recognition? e.g. location of other objects in the scene Whole-robot-integration: what other sensor modalities can be used to help detection and/or what active control strategies can be used for improving accuracy? e.g. zooming in for a better view of an object Can robot actively help in data collection/learning?

29 Gould, et al. Questions Motivation Hardware architecture and dataflow Constructing a scene representation Super-resolution sensor fusion Dominant planar surface extraction Multi-sensor object detection 2-d sliding-window object detector 3-d features Multi-sensor object detector Experimental results and analysis Future work


Download ppt "Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008."

Similar presentations


Ads by Google