Cascaded Classification Models Combining Models for Holistic Scene Understanding Geremy Heitz Stephen Gould Ashutosh Saxena Daphne Koller Stanford University NIPS 2008 December 11, 2008
Outline Understanding Scene Understanding Related Work CCM Framework Results
“A car passes a bus on the road, while people walk past a building.” Human View of a “Scene” BUILDING PEOPLE BUS CAR ROAD “A car passes a bus on the road, while people walk past a building.”
Computer View of a “Scene” BUILDING ROAD Can we integrate all of these subtasks, so that whole > sum of parts ? STREET SCENE
Related Work = + = Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005] Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008 We want to focus more on “semantic” classes We want to be flexible to using outside models = + Problems with Hoiem: 1) Required output to be in the form of an image, 2) Used his own models that he had personally developed over the previous years, and 3) At joint learning time, only learned “surfaces” and “edges/occlusions”, the other models were pre-trained ahead of time. =
How Should we Integrate? Single joint model over all variables Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks Simple, flexible combination of existing models Pros: State-of-the-art models, easier to extend Requires: Limited “black-box” interface to components Cons: Missing some of the modeling power DETECTION Dalal & Triggs, 2006 REGION LABELING Gould et al., 2007 DEPTH RECONSTRUCTION Saxena et al., 2007
Other Opportunities for Integration Text Understanding Audio Signals Source Separation Speaker Recognition Speech Recognition Part-of-speech tagger noun verb adj “Mr. Obama sent himself an important reminder.” Semantic Role Identification Verb: sent Sender: Mr. Obama Receiver: himself Content: reminder Anaphora Resolution
Outline Understanding Scene Understanding Related Work CCM Framework Results
Cascaded Classification Models Image Features fDET fREG fREC DET1 REG1 REC1 DET0 Independent Models REG0 REC0 Context-aware Models Object Detection Region Labeling 3D Reconstruction
Integrated Model for Scene Understanding Object Detection Region Labeling Depth Reconstruction Scene Categorization I’ll show you these
Basic Object Detection Detection Window W = Car = Person Sliding window detection, score for each window = Motorcycle = Boat = Sheep = Cow Score(W) > 0.5
Context-Aware Object Detection Scene Type: Urban scene From Scene Category MAP category, marginals From Region Labels How much of each label is in a window adjacent to W From Depths Mean, variance of depths, estimate of “true” object size Final Classifier % of “building” above W Variance of depths in W P(Y) = Logistic(Φ(W))
Region Labeling CRF Model Label each pixel as one of: {‘grass’, ‘road’, ‘sky’, etc } Conditional Markov random field (CRF) over superpixels: Singleton potentials: log- linear function of boosted detectors scores for each class Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image [Gould et al., IJCV 2007]
Context-Aware Region Labeling Where is the grass? Additional Feature: Relative Location Map
Depth Reconstruction CRF Label each pixel with it’s distance from the camera Conditional Markov random field (CRF) over superpixels Continuous variables Models depth as linear function of features with pairwise smoothness constraints [Saxena et al., PAMI 2008]
Depth Reconstruction with Context Grass is horizontal Sky is far away GRASS SKY BLACK BOX Find d* Reoptimize depths with new constraints: dCCM = argmin α||d - d*|| + β||d - dCONTEXT||
Training I fD fS fZ ŶD ŶS ŶZ I fD fS fZ ŶD ŶS * ŶZ I: Image ŶS ŶZ I: Image f: Image Features Ŷ: Output labels Training Regimes Independent Ground: Groundtruth Input I fD fS fZ ŶD 1 ŶS * ŶZ
Training I fD fS fZ ŶD ŶS ŶZ CCM Training Regime Later models can ignore the mistakes of previous models Training realistically emulates testing setup Allows disjoint datasets K-CCM: A CCM with K levels of classifiers I fD fS fZ ŶD 1 ŶS ŶZ
Experiments DS1 DS2 422 Images, fully labeled Categorization, Detection, Multi-class Segmentation 5-fold cross validation DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D Reconstruction 997 Train, 748 Test
CCM Results – DS1 CATEGORIES PEDESTRIAN CAR REGION LABELS MOTORBIKE BOAT
CCM Results – DS2 Boats & Water Detection Car Person Bike Boat Sheep Cow Depth INDEP 0.357 0.267 0.410 0.096 0.319 0.395 16.7m 2-CCM 0.364 0.272 0.212 0.289 0.415 15.4m Regions Tree Road Grass Water Sky Building FG INDEP 0.541 0.702 0.859 0.444 0.924 0.436 0.828 2-CCM 0.581 0.692 0.860 0.565 0.930 0.489 0.819 INDEP Pred. Road Pred. Water True Road 4946 251 True Water 1150 2144 Boats & Water 2-CCM Pred. Road Pred. Water True Road 4878 322 True Water 820 2730
Example Results INDEPENDENT CCM
Example Results Independent Objects Independent Regions CCM Objects CCM Regions
CCM Summary The various subtasks of computer vision do indeed interact through context cues A simple framework can allow off-the-shelf, black-box methods to improve each other Can we train in more sophisticated ways? Downstream models re-train upstream ones Something like EM for missing labels Other applications
Thanks!