Scene Understanding through Transfer Learning Stephen Gould Ben Packer Geremy Heitz Daphne Koller DARPA Update September 11, 2008.

Slides:

Advertisements

Similar presentations

Combining Detectors for Human Hand Detection Antonio Hernández, Petia Radeva and Sergio Escalera Computer Vision Center, Universitat Autònoma de Barcelona,

Advertisements

Image Repairing: Robust Image Synthesis by Adaptive ND Tensor Voting IEEE Computer Society Conference on Computer Vision and Pattern Recognition Jiaya.

Caroline Rougier, Jean Meunier, Alain St-Arnaud, and Jacqueline Rousseau IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5,

A generic model to compose vision modules for holistic scene understanding Adarsh Kowdle *, Congcong Li *, Ashutosh Saxena, and Tsuhan Chen Cornell University,

3D Object Recognition Pipeline Kurt Konolige, Radu Rusu, Victor Eruhmov, Suat Gedikli Willow Garage Stefan Holzer, Stefan Hinterstoisser TUM Morgan Quigley,

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab.

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford.

TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.

Introduction To Tracking

Uncertainty Representation. Gaussian Distribution variance Standard deviation.

Quantifying and Transferring Contextual Information in Object Detection Professor: S. J. Wang Student : Y. S. Wang 1.

Non-metric affinity propagation for unsupervised image categorization Delbert Dueck and Brendan J. Frey ICCV 2007.

Training Regimes Motivation  Allow state-of-the-art subcomponents  With “Black-box” functionality  This idea also occurs in other application areas.

1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009.

Learning object shape Gal Elidan Geremy Heitz Daphne Koller February 12 th, 2006 PAIL.

Learning Spatial Context: Using stuff to find things Geremy Heitz Daphne Koller Stanford University October 13, 2008 ECCV 2008.

Abstract We present a model of curvilinear grouping using piecewise linear representations of contours and a conditional random field to capture continuity.

Computational Vision Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.

Context Aware Spatial Priors using Entity Relations (CASPER) Geremy Heitz Jonathan Laserson Daphne Koller December 10 th, 2007 DAGS.

Transfer Learning of Object Classes: From Cartoons to Photographs NIPS Workshop Inductive Transfer: 10 Years Later Geremy Heitz Gal Elidan Daphne Koller.

CONTEXT. Learned Satellite Clusters Results - Satellite Prior: Detector Only Posterior: TAS Model Region Labels.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

1 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Descriptive Querying of.

Learning Spatial Context: Can stuff help us find things? Geremy Heitz Daphne Koller April 14, 2008 DAGS Stuff (n): Material defined by a homogeneous or.

Accuracy # Training Instances (per class) Giraffe NB CENT LOOPS GROUND.

Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.

Multi-modal robotic perception Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Ng, Daphne Koller PAIL, January 2008.

3D Scene Models Object recognition and scene understanding Krista Ehinger.

Crash Course on Machine Learning

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

A Bayesian Approach For 3D Reconstruction From a Single Image

3D LayoutCRF Derek Hoiem Carsten Rother John Winn.

Lecture 12 Stereo Reconstruction II Lecture 12 Stereo Reconstruction II Mata kuliah: T Computer Vision Tahun: 2010.

SPIE'01CIRL-JHU1 Dynamic Composition of Tracking Primitives for Interactive Vision-Guided Navigation D. Burschka and G. Hager Computational Interaction.

A General Framework for Tracking Multiple People from a Moving Camera

Background Subtraction for Temporally Irregular Dynamic Textures Gerald Dalley, Joshua Migdal, and W. Eric L. Grimson Workshop on Applications of Computer.

Computer Vision: Summary and Discussion Computer Vision ECE 5554 Virginia Tech Devi Parikh 11/21/

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Velodyne Lidar Sensor. Plane Detection in a 3D environment using a Velodyne Lidar Jacoby Larson UCSD ECE 172.

Recognizing Deformable Shapes Salvador Ruiz Correa UW Ph.D. Graduate Researcher at Children’s Hospital.

Object Recognition in Images Slides originally created by Bernd Heisele.

Automated Target Recognition Using Mathematical Morphology Prof. Robert Haralick Ilknur Icke José Hanchi Computer Science Dept. The Graduate Center of.

Computer Vision 776 Jan-Michael Frahm 12/05/2011 Many slides from Derek Hoiem, James Hays.

Putting Context into Vision Derek Hoiem September 15, 2004.

Statistics in the Image Domain for Mobile Robot Environment Modeling L. Abril Torres-Méndez and Gregory Dudek Centre for Intelligent Machines School of.

1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.

Autonomous Robots Vision © Manfred Huber 2014.

(c) 2000, 2001 SNU CSE Biointelligence Lab Finding Region Another method for processing image  to find “regions” Finding regions  Finding outlines.

Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.

Object Recognition by Discriminative Combinations of Line Segments and Ellipses Alex Chia ^˚ Susanto Rahardja ^ Deepu Rajan ˚ Maylor Leung ˚ ^ Institute.

Coherent Scene Understanding with 3D Geometric Reasoning Jiyan Pan 12/3/2012.

Multi-view Traffic Sign Detection, Recognition and 3D Localisation Radu Timofte, Karel Zimmermann, and Luc Van Gool.

Image Features (I) Dr. Chang Shu COMP 4900C Winter 2008.

Learning Hierarchical Features for Scene Labeling Cle’ment Farabet, Camille Couprie, Laurent Najman, and Yann LeCun by Dong Nie.

Toward humanoid manipulation in human-centered environments T. Asfour, P. Azad, N. Vahrenkamp, K. Regenstein, A. Bierbaum, K. Welke, J. Schroder, R. Dillmann.

A Plane-Based Approach to Mondrian Stereo Matching

Article Review Todd Hricik.

Recognizing Deformable Shapes

Nonparametric Semantic Segmentation

Common Classification Tasks

Identifying Human-Object Interaction in Range and Video Data

RGB-D Image for Scene Recognition by Jiaqi Guo

“The Truth About Cats And Dogs”

Brief Review of Recognition + Context

Cascaded Classification Models

Creating Data Representations

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

Human-object interaction

Semantic Segmentation

Presentation transcript:

Scene Understanding through Transfer Learning Stephen Gould Ben Packer Geremy Heitz Daphne Koller DARPA Update September 11, 2008

Outline What is Scene Understanding? Scene Understanding Projects R ij TiTi SjSj FjFj Image Window W i N J I ΦDΦD ΦSΦS ΦZΦZ ŶDŶD 0 ŶDŶD 1 ŶDŶD L ŶSŶS 0 ŶSŶS 1 ŶSŶS L ŶZŶZ 0 ŶZŶZ 1 ŶZŶZ L Cascaded Classification Models [NIPS, 2008] TAS: Things and Stuff [ECCV, 2008] 3D Context [ECCV Workshop, 2008] LOOPS [NIPS, 2008] Hierarchical Learning [UAI, 2008] Indoor Depth Reconstruction (in progress)

What is “Understanding”? Vision Subtask (Recognition): “Is there an object of type X in this image?” Airplane? NO Human? YES Dog? YES Scene Understanding: “What is happening in this image?” MAN DOG The man is walking the dog

SEASIDE PASTURE GRASS SKY Computer View of a “Scene”

Human View of a “Scene” “The cow is walking through the grass on a pasture by the sea.” A cow Some grass… She’s walking.

“Context”

What can we do when we have all the components and datasets with ground-truth labels for each?

Scene Understanding CCM SEASIDE PASTURE GRASS SKY Grass = Flat Sky = Far FG = Vertical 40% Grass, 30% Sky… 1 cow, 2 boats…

Solution: CCMs I ΦDΦD ΦSΦS ΦZΦZ ŶDŶD 0 ŶDŶD 1 ŶDŶD L ŶSŶS 0 ŶSŶS 1 ŶSŶS L ŶZŶZ 0 ŶZŶZ 1 ŶZŶZ L I: Image Φ: Image Features Ŷ: Output labels Features for level ℓ+1 computed from Φ and labels of level ℓ

Some Examples: SU-2

Why do we think depth can provide context signals?

Indoor Detection Image sensors (cameras) provide high resolution color and intensity data Range sensors (laser) provide depth and global contextual information Improving detection: augment visual information with 3-d features from a range scanner, e.g. laser

3-d features Scene representation based on 3-d points and surface normals for every pixel in image, { X ij, n ij }, and set of dominant planes, { P k }. Compute 3-d features over candidate windows (in image plane) by projecting window into 3-d scene

Example scenes mugcupmonitorclockhandleski boot 2-d onlywith 3-d2-d onlywith 3-d

What if we only have detection ground-truth? Can we still do anything?

Unsupervised Context - TAS Stuff-Thing: Based on intuitive “relationships” Green & Textured = no cars Red & Boxy = cars nearby Gray & Smooth = cars here

The TAS Model W i : Window T i : Object Presence S j : Region Label F j : Region Features R ij : Relationship R ij TiTi SjSj FjFj Image Window W i N J

TAS Results - Satellite

What about questions that require more than just object bounding boxes?

Finer-grained analysis… Man Dog Scene: “man wearing a backpack walking a dog” Backpack Body Objects (Context) Parts (Rough Layout) Landmarks (Local Shape) Head Fore LegsHind Legs Head Torso Legs B1B1 B2B2 B3B3 B4B4 L1L1 L2L2 L3L3 L4L4 T1T1 T2T2 T3T3 T4T4

RANDOM Shape-based Classification Goal: Classify based on shape characteristics Is the giraffe Or Accuracy # Training Instances (per class) boosted detector GROUND = NN on “true” shape Goal: close this gap

Classifying Lamps # Training Instances (per class) NB BOOSTING LOOPS GROUND Wide base (-) Thin base (+) # Training Instances (per class) NB BOOSTING LOOPS GROUND Triangular (-) Square (+) Accuracy [Submitted to IJCV]

Learning the Shape Model Problem: With few instances, learned models aren’t robust MEAN Principal Components Training Set: std +1 std -1 std +1 std -1 MEAN

F data : Encourage parameters to explain data Undirected Probabilistic Model  root F data Divergence  : high  Elephant  Rhino  : low Divergence: Encourage parameters to be similar to parents Divergence

Does Hierarchy Help? Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Unregularized max likelihood, shrinkage: Much worse, not shown

How can we use this platform for the vision to manipulation transfer task?

Indoor Scene Reconstruction Motivation: 3D relationships are essential for scene understanding. Most applications work on monocular input. Goal: Reconstruct the 3D geometry of an indoor scene/workspace from monocular images. Transfer Task: Use object detection to add 3D constraints. Eventually this will help robotic manipulation.

Indoor Scene Reconstruction Method: learn how geometric features appear in images depth differences between pairs of points co-linearity/co-planarity of triplets of points higher level structures (corners, objects, etc) Encode these as “soft” constraints between parts of the scene and use belief propagation to satisfy the constraints. Preliminary results: see CURIS presentation.