Cascaded Classification Models

Cascaded Classification Models
Combining Models for Holistic Scene Understanding Geremy Heitz Stephen Gould Ashutosh Saxena Daphne Koller Stanford University NIPS 2008 December 11, 2008

Outline Understanding Scene Understanding Related Work CCM Framework
Results

“A car passes a bus on the road, while people walk past a building.”
Human View of a “Scene” BUILDING PEOPLE BUS CAR ROAD “A car passes a bus on the road, while people walk past a building.”

Computer View of a “Scene”
BUILDING ROAD Can we integrate all of these subtasks, so that whole > sum of parts ? STREET SCENE

Related Work = + = Intrinsic Images
[Barrow and Tenenbaum, 1978], [Tappen et al., 2005] Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008 We want to focus more on “semantic” classes We want to be flexible to using outside models = + Problems with Hoiem: 1) Required output to be in the form of an image, 2) Used his own models that he had personally developed over the previous years, and 3) At joint learning time, only learned “surfaces” and “edges/occlusions”, the other models were pre-trained ahead of time. =

How Should we Integrate?
Single joint model over all variables Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks Simple, flexible combination of existing models Pros: State-of-the-art models, easier to extend Requires: Limited “black-box” interface to components Cons: Missing some of the modeling power DETECTION Dalal & Triggs, 2006 REGION LABELING Gould et al., 2007 DEPTH RECONSTRUCTION Saxena et al., 2007

Other Opportunities for Integration
Text Understanding Audio Signals Source Separation Speaker Recognition Speech Recognition Part-of-speech tagger noun verb adj “Mr. Obama sent himself an important reminder.” Semantic Role Identification Verb: sent Sender: Mr. Obama Receiver: himself Content: reminder Anaphora Resolution

Outline Understanding Scene Understanding Related Work CCM Framework
Results

Cascaded Classification Models
Image Features fDET fREG fREC DET1 REG1 REC1 DET0 Independent Models REG0 REC0 Context-aware Models Object Detection Region Labeling 3D Reconstruction

Integrated Model for Scene Understanding
Object Detection Region Labeling Depth Reconstruction Scene Categorization I’ll show you these

Basic Object Detection
Detection Window W = Car = Person Sliding window detection, score for each window = Motorcycle = Boat = Sheep = Cow Score(W) > 0.5

Context-Aware Object Detection
Scene Type: Urban scene From Scene Category MAP category, marginals From Region Labels How much of each label is in a window adjacent to W From Depths Mean, variance of depths, estimate of “true” object size Final Classifier % of “building” above W Variance of depths in W P(Y) = Logistic(Φ(W))

Region Labeling CRF Model
Label each pixel as one of: {‘grass’, ‘road’, ‘sky’, etc } Conditional Markov random field (CRF) over superpixels: Singleton potentials: log- linear function of boosted detectors scores for each class Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image [Gould et al., IJCV 2007]

Context-Aware Region Labeling
Where is the grass? Additional Feature: Relative Location Map

Depth Reconstruction CRF
Label each pixel with it’s distance from the camera Conditional Markov random field (CRF) over superpixels Continuous variables Models depth as linear function of features with pairwise smoothness constraints [Saxena et al., PAMI 2008]

Depth Reconstruction with Context
Grass is horizontal Sky is far away GRASS SKY BLACK BOX Find d* Reoptimize depths with new constraints: dCCM = argmin α||d - d*|| + β||d - dCONTEXT||

Training I fD fS fZ ŶD ŶS ŶZ I fD fS fZ ŶD ŶS * ŶZ I: Image
ŶS ŶZ I: Image f: Image Features Ŷ: Output labels Training Regimes Independent Ground: Groundtruth Input I fD fS fZ ŶD 1 ŶS * ŶZ

Training I fD fS fZ ŶD ŶS ŶZ CCM Training Regime
Later models can ignore the mistakes of previous models Training realistically emulates testing setup Allows disjoint datasets K-CCM: A CCM with K levels of classifiers I fD fS fZ ŶD 1 ŶS ŶZ

Experiments DS1 DS2 422 Images, fully labeled
Categorization, Detection, Multi-class Segmentation 5-fold cross validation DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D Reconstruction 997 Train, 748 Test

CCM Results – DS1 CATEGORIES PEDESTRIAN CAR REGION LABELS MOTORBIKE
BOAT

CCM Results – DS2 Boats & Water Detection Car Person Bike Boat Sheep
Cow Depth INDEP 0.357 0.267 0.410 0.096 0.319 0.395 16.7m 2-CCM 0.364 0.272 0.212 0.289 0.415 15.4m Regions Tree Road Grass Water Sky Building FG INDEP 0.541 0.702 0.859 0.444 0.924 0.436 0.828 2-CCM 0.581 0.692 0.860 0.565 0.930 0.489 0.819 INDEP Pred. Road Pred. Water True Road 4946 251 True Water 1150 2144 Boats & Water 2-CCM Pred. Road Pred. Water True Road 4878 322 True Water 820 2730

Example Results INDEPENDENT CCM

Example Results Independent Objects Independent Regions CCM Objects
CCM Regions

CCM Summary The various subtasks of computer vision do indeed interact through context cues A simple framework can allow off-the-shelf, black-box methods to improve each other Can we train in more sophisticated ways? Downstream models re-train upstream ones Something like EM for missing labels Other applications

Thanks!

Cascaded Classification Models

Similar presentations

Presentation on theme: "Cascaded Classification Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cascaded Classification Models

Similar presentations

Presentation on theme: "Cascaded Classification Models"— Presentation transcript:

Similar presentations

About project

Feedback