1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Semantic Contours from Inverse Detectors Bharath Hariharan et.al. (ICCV-11)

Automatic Photo Pop-up Derek Hoiem Alexei A.Efros Martial Hebert Carnegie Mellon University.

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Combining Detectors for Human Hand Detection Antonio Hernández, Petia Radeva and Sergio Escalera Computer Vision Center, Universitat Autònoma de Barcelona,

A generic model to compose vision modules for holistic scene understanding Adarsh Kowdle *, Congcong Li *, Ashutosh Saxena, and Tsuhan Chen Cornell University,

Joint Optimisation for Object Class Segmentation and Dense Stereo Reconstruction Ľubor Ladický, Paul Sturgess, Christopher Russell, Sunando Sengupta, Yalin.

Scene Labeling Using Beam Search Under Mutex Constraints ID: O-2B-6 Anirban Roy and Sinisa Todorovic Oregon State University 1.

My Group’s Current Research on Image Understanding.

Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct

What is Statistical Modeling

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs Roozbeh Mottaghi 1, Sanja Fidler 2, Jian Yao 2, Raquel Urtasun 2, Devi Parikh 3 1 UCLA.

Learning to Detect A Salient Object Reporter: 鄭綱 (3/2)

Quantifying and Transferring Contextual Information in Object Detection Professor: S. J. Wang Student : Y. S. Wang 1.

Training Regimes Motivation  Allow state-of-the-art subcomponents  With “Black-box” functionality  This idea also occurs in other application areas.

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.

Learning Spatial Context: Using stuff to find things Geremy Heitz Daphne Koller Stanford University October 13, 2008 ECCV 2008.

Features-based Object Recognition Pierre Moreels California Institute of Technology Thesis defense, Sept. 24, 2007.

Scene Understanding through Transfer Learning Stephen Gould Ben Packer Geremy Heitz Daphne Koller DARPA Update September 11, 2008.

TextonBoost : Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation J. Shotton*, J. Winn†, C. Rother†, and A.

Lecture 17: Parts-based models and context CS6670: Computer Vision Noah Snavely.

Context Aware Spatial Priors using Entity Relations (CASPER) Geremy Heitz Jonathan Laserson Daphne Koller December 10 th, 2007 DAGS.

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

CONTEXT. Learned Satellite Clusters Results - Satellite Prior: Detector Only Posterior: TAS Model Region Labels.

1 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Descriptive Querying of.

Learning Spatial Context: Can stuff help us find things? Geremy Heitz Daphne Koller April 14, 2008 DAGS Stuff (n): Material defined by a homogeneous or.

What, Where & How Many? Combining Object Detectors and CRFs

3D Scene Models Object recognition and scene understanding Krista Ehinger.

Crash Course on Machine Learning

The Three R’s of Vision Jitendra Malik.

SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.

Object Detection Sliding Window Based Approach Context Helps

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

Face detection Slides adapted Grauman & Liebe’s tutorial

#MOTION ESTIMATION AND OCCLUSION DETECTION #BLURRED VIDEO WITH LAYERS

Learning Collections of Parts for Object Recognition and Transfer Learning University of Illinois at Urbana- Champaign.

INTRODUCTION Heesoo Myeong and Kyoung Mu Lee Department of ECE, ASRI, Seoul National University, Seoul, Korea Tensor-based High-order.

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Putting Context into Vision Derek Hoiem September 15, 2004.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Project 3 Results.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.

Recognition Using Visual Phrases

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.

Object-Graphs for Context-Aware Category Discovery Yong Jae Lee and Kristen Grauman University of Texas at Austin 1.

Context Neelima Chavali ECE /21/2013. Roadmap Introduction Paper1 – Motivation – Problem statement – Approach – Experiments & Results Paper 2 Experiments.

Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.

Coherent Scene Understanding with 3D Geometric Reasoning Jiyan Pan 12/3/2012.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Learning Hierarchical Features for Scene Labeling Cle’ment Farabet, Camille Couprie, Laurent Najman, and Yann LeCun by Dong Nie.

Image segmentation.

Gaussian Conditional Random Field Network for Semantic Segmentation

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Object detection with deformable part-based models

LOCUS: Learning Object Classes with Unsupervised Segmentation

Nonparametric Semantic Segmentation

Object detection as supervised classification

Object-Graphs for Context-Aware Category Discovery

Brief Review of Recognition + Context

Cascaded Classification Models

Adarsh Kowdle*, Congcong Li*, Ashutosh Saxena, and Tsuhan Chen

“Traditional” image segmentation

Semantic Segmentation

Presentation transcript:

1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009

2 Scene/Image Understanding What’s happening in these pictures?

3 Human View of a “Scene” “A car passes a bus on the road, while people walk past a building.” ROAD BUILDING CAR BUS PEOPLE WALKING

4 Computer View of a “Scene” BUILDING ROAD STREET SCENE Can we integrate all of these subtasks, so that whole > sum of parts ?

5 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz et al. NIPS 2008a] [Heitz & Koller ECCV 2008]

6 Image/Scene Understanding “a man and a dog are walking on a sidewalk in front of a building” Man Dog Backpack Cigarette Primitives Objects Parts Surfaces Regions Interactions Context Actions Scene Descriptions Established techniques address these in isolation. Reasoning over image statistics Complex web of relations well represented by graphical models. Reasoning over more abstract entities. Building Sidewalk

7 Why will integration help? What is this object?

8 More Context Context is key!

9 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz et al. NIPS 2008a]

10 Human View of a “Scene” ROAD BUILDING CAR BUS PEOPLE WALKING Scene Categorization Object Detection Region Labelling Depth Reconstruction Surface Orientations Boundary/Edge Detection Outlining/Refined Localization Occlusion Reasoning...

11 Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005] Hoiem et al., “Closing the Loop in Scene Interpretation”, 2008 We want to focus more on “semantic” classes We want to be flexible to using outside models We want an extendable framework, not one engineered for a particular set of tasks Related Work = + =

12 How Should we Integrate? Single joint model over all variables Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks Simple, flexible combination of existing models Pros: State-of-the-art models, easier to extend Limited “black-box” interface to components Cons: Missing some of the modeling power DETECTION Dalal & Triggs, 2006 REGION LABELING Gould et al., 2007 DEPTH RECONSTRUCTION Saxena et al., 2007

13 DET 1 REG 1 REC 1 Cascaded Classification Models Image Features f DET Object Detection Region Labeling DET 0 Independent Models f REG REG 0 f REC REC 0 3D Reconstruction Context-aware Models

14 Integrated Model for Scene Understanding Object Detection Multi-class Segmentation Depth Reconstruction Scene Categorization I’ll show you these

15 Basic Object Detection = Car = Person = Motorcycle = Boat = Sheep = Cow Detection Window W Score(W) > 0.5

16 Base Detector - HOG [ Dalal & Triggs, CVPR, 2006 ] HOG Detector: Feature Vector XSVM Classifier

17 Context-Aware Object Detection From Base Detector Log Score D(W) From Scene Category MAP category, marginals From Region Labels How much of each label is in a window adjacent to W From Depths Mean, variance of depths, estimate of “true” object size Final Classifier P(Y) = Logistic(Φ(W)) Scene Type: Urban scene % of “road” below W Variance of depths in W

18 Multi-class Segmentation CRF Model Label each pixel as one of: {‘grass’, ‘road’, ‘sky’, etc } Conditional Markov random field (CRF) over superpixels: Singleton potentials: log- linear function of boosted detectors scores for each class Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image [Gould et al., IJCV 2007]

19 Context-Aware Multi-class Seg. Additional Feature: Relative Location Map Where is the grass?

20 Depth Reconstruction CRF [Saxena et al., PAMI 2008] Label each pixel with it’s distance from the camera Conditional Markov random field (CRF) over superpixels Continuous variables Models depth as linear function of features with pairwise smoothness constraints

21 Depth Reconstruction with Context BLACK BOX GRASS SKY Find d* Reoptimize depths with new constraints: d CCM = argmin α||d - d*|| + β||d - d CONTEXT ||

22 Training I: Image f: Image Features Ŷ: Output labels Training Regimes Independent Ground: Groundtruth Input I fDfD fSfS fZfZ ŶDŶD 0 ŶSŶS 0 ŶZŶZ 0 I fDfD fSfS fZfZ ŶDŶD 1 ŶSŶS * ŶSŶS 1 ŶZŶZ * ŶZŶZ 1

23 Training CCM Training Regime Later models can ignore the mistakes of previous models Training realistically emulates testing setup Allows disjoint datasets K-CCM: A CCM with K levels of classifiers I fDfD fSfS fZfZ ŶDŶD 0 ŶDŶD 1 ŶSŶS 0 ŶSŶS 1 ŶZŶZ 0 ŶZŶZ 1

24 Experiments DS1 422 Images, fully labeled Categorization, Detection, Multi-class Segmentation 5-fold cross validation DS Images, disjoint labels Detection, Multi-class Segmentation, 3D Reconstruction 997 Train, 748 Test

25 CCM Results – DS1 CAR PEDESTRIAN MOTORBIKE BOAT CATEGORIES REGION LABELS

26 CCM Results – DS2 DetectionCarPersonBikeBoatSheepCowDepth INDEP m 2-CCM m RegionsTreeRoadGrassWaterSkyBuildingFG INDEP CCM Boats

27 Example Results INDEPENDENT CCM

28 Example Results Independent ObjectsIndependent RegionsCCM Objects Independent ObjectsIndependent RegionsCCM Regions

29 Understanding the man “a man, a dog, a sidewalk, a building”

30 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz & Koller ECCV 2008]

31 Things vs. Stuff Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape. (REGIONS) Thing (n): An object with a specific size and shape. (DETECTIONS) From: Forsyth et al. Finding pictures of objects in large collections of images. Object Representation in Computer Vision, 1996.

32 Cascaded Classification Models DET 1 REG 1 REC 1 Image Features f DET f REG f REC Object Detection Region Labeling DET 0 Independent Models REG 0 REC 0 3D Reconstruction Context-aware Models

33 CCM Feedforward CCMs vs. TAS Image f DET f REG DET 0 REG 0 DET 1 REG 1 TAS Modeled Jointly Image f DET f REG DET REG Relationships

34 Satellite Detection Example FALSE POSITIVE TRUE POSITIVE

35 Stuff-Thing Context Stuff-Thing: Based on spatial relationships Intuition: Trees = no cars Houses = cars nearby Road = cars here “Cars drive on roads” “Cows graze on grass” “Boats sail on water” Goal: Unsupervised

36 Things Detection T i Є {0,1} T i = 1: Candidate window contains a positive detection TiTi Image Window W i P(T i ) = Logistic(score(W i ))

37 Stuff Coherent image regions Coarse “superpixels” Feature vector F j in R n Cluster label S j in {1…C} Stuff model Naïve Bayes SjSj FjFj

38 Relationships Descriptive Relations “Near”, “Above”, “In front of”, etc. Choose set R = { r 1 …r K } R ijk =1: Detection i and region j have relation k Relationship model S 72 = Trees S 4 = Houses S 10 = Road T1T1 R ijk TiTi SjSj R 1,10,in =1

39 Unrolled Model T1T1 S1S1 S2S2 S3S3 S4S4 S5S5 T2T2 T3T3 R 2,1,above = 0 R 3,1,left = 1 R 1,3,near = 0 R 3,3,in = 1 R 1,1,left = 1 Candidate Windows Image Regions

40 Learning the Parameters Assume we know R S j is hidden Everything else observed Expectation-Maximization “Contextual clustering” Parameters are readily interpretable R ijk TiTi SjSj FjFj Image Window W i N J K Supervised in Training Set Always Observed Always Hidden

41 Which Relationships to Use? Rijk = spatial relationship between candidate i and region j Rij1 = candidate in region Rij2 = candidate closer than 2 bounding boxes (BBs) to region Rij3 = candidate closer than 4 BBs to region Rij4 = candidate farther than 8 BBs from region Rij5 = candidate 2BBs left of region Rij6 = candidate 2BBs right of region Rij7 = candidate 2BBs below region Rij8 = candidate more than 2 and less than 4 BBs from region … RijK = candidate near region boundary How do we avoid overfitting?

42 Learning the TAS Relations Intuition “Detached” R ijk = inactive relationship Structural EM iterates: Learn parameters Decide which edge to toggle Evaluate with l (T|F,W,R) Requires inference Better results than using standard E[ l (T,S,F,W,R)] R ij1 TiTi SjSj FjFj R ij2 R ijK

43 Inference Goal: Block Gibbs Sampling Easy to sample T i ’s given S j ’s and vice versa

44 Learned Satellite Clusters

45 Results - Satellite Prior: Detector Only Posterior: Detections Posterior: Region Labels

46 Discovered Context - Bicycles Bicycles Cluster #3

47 TAS Results – Bicycles Examples Discover “true positives” Remove “false positives” BIKE ? ? ?

48 Results – VOC 2005 TAS Base Detector

49 Understanding the man “a man and a dog on a sidewalk, in front of a building ”

50 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions

51 Shape models for segmentation We have a good deformable shape model (LOOPS) for outlining objects We have good models for segmenting objects Let’s combine them Add terms encouraging landmarks to lie on segmentation boundaries Ben Packer is working on this… OutlineSegmentation Joint OutlineJoint Segmentation Landmark Seg Mask

52 Refined Segmentation Our segmentation only knows about pixel “classes” What about objects? Steve Gould is working on this… Region Class Region Appearance Pixel/Region Assignment Pixel Appearance

53 Full TAS-like Integration R ijk TiTi SjSj Depths Occlusion Edges Surface Edges Shape Models