Max-Margin Training of Upstream Scene Understanding Models Jun Zhu Carnegie Mellon University Joint work with Li-Jia Li *, Li Fei-Fei *, and Eric P. Xing * Stanford University
How to Represent a Scene Image? Seeing the forest before the trees – Fast scene categorization with gist features – Oliva & Torralba, global properties (e.g., openness, mean depth, expansion, etc.) for scene gist – Kevin et al., 2003 Use the gist features to see the trees (i.e., recognizing objects) But, the trees compose the forest … – Object recognition is critical for scene categorization – Sudderth et al., 2005; Fei-Fei et al., 2005; CMU, March, 2010 badmintonboccecroquet This is a forest scene.
Upstream Scene Understanding Models Erik Sudderth’s “Scene, Object, and Parts” model (CVPR 2005) Using MLE to estimate model CMU, March, 2010
Upstream Scene Understanding Models Kevin Murphy’s “Forest & Tree” Model (NIPS 2003) Using MLE to estimate model CMU, March, 2010
Upstream Scene Understanding Models Fei-Fei’s “Total Scene Understanding” Model (CVPR 2009) Using MLE to estimate model CMU, March, 2010 Athlete Horse Grass Trees Sky Saddle class: Polo
We want to answer … Are we satisfying with the MLE method? Can we learn scene understanding models CMU, March, 2010
A Simple Working Example Joint scene categorization and object annotation model Global features: – Can be arbitrary! – Gist (Oliva & Torralba, 2001) – Sparse SIFT codes (Yang, Yu, Gong & Huang, CMU, March, 2010
Problem with MLE Model Joint Distribution Prediction rules for scene CMU, March, 2010
Problem with MLE Model Joint Distribution Maximum Likelihood Estimation CMU, March, 2010 Decoupling! Scene ClassificationObject Annotation
Problem with MLE Model Joint Distribution Weak Coupling CMU, March, 2010
Problem with MLE Model Joint Distribution Weak Coupling CMU, March, 2010
Max-margin Training to achieve Strong Coupling Hint: although MLE decouples scene model and object model, the joint prediction rule couples them Discriminant function & Hinge CMU, March, 2010
Max-margin Training to achieve Strong Coupling Hint: although MLE decouples scene model and object model, the joint prediction rule couples them Regularized Hinge Loss Minimization – Hinge loss couples both scene & object models, while log-loss is defined on scene model CMU, March, 2010
Solving the Optimization Problem Approximation to the intractable log-likelihood The optimization CMU, March, 2010
EM-style Algorithm Posterior Inference (inner-max problem): Parameter Estimation (outer-min problem) – alternating minimization (next CMU, March, 2010
Alternating-Minimization for a a Closed-form solutions – Gaussian parameters (c.f. MLE for Gaussian Mixture) – Topic parameters Loss-augmented SVM CMU, March, 2010
Experiments 8-category sports data set (Li & Fei-Fei, 2007): – 1574 images (50/50 split) Badminton, bocce, croquet, polo, rowing, snowboarding, sailing, rockclimbing – Pre-segment each image into regions – Region features: color, texture, and location patches with SIFT features – Global features: Gist (Oliva & Torralba, 2001) Sparse SIFT codes (Yang, Yu, Gong, & Huang, 2009) 67-category MIT indoor scene (Quattoni & Torralba, 2009): – ~80 per-category for training; ~20 per-category for testing – Same feature representation as above – Gist global CMU, March, 2010
Scene Classification Gist features – Fei-Fei’s theme model: 0.65 (different image representation) – SVM: CMU, March, 2010
Scene Classification Loss CMU, March, 2010
Scene Classification Confusion Matrix & CMU, March, 2010 $ blue for correct; red for wrong
MIT Indoor Scene Classification CMU, March, 2010 $ ROI+Gist(annotation) used human annotated interest regions.
MIT Indoor Scene CMU, March, 2010
Object Annotation kNN classifier with features – Overall: – Example CMU, March, 2010
Conclusions & Future Work Conclusions: – MLE estimation can result in a weak coupling in upstream scene understanding models – Max-margin approach can be applied to achieve a well-balanced prediction rule Future Work – Improve the performance of the object annotation model Incorporate global features with conditional models “Double direction” max-margin learning with supervision on object annotation for scene completion – Systematical comparison with downstream scene understanding models Multi-class sLDA (Wang et al., 2009) MedLDA (Zhu et al., CMU, March, 2010