Context-based vision system for place and object recognition Antonio Torralba Kevin Murphy Bill Freeman Mark Rubin Presented by David Lee Some slides borrowed from Kevin Murphy
Object out of context
Object in context
Wearable test-bed
System diagram
Computing the features
24 filtered Images Downsample to 4x4 4x4x24 =384 dim 80 dim
Visualizing the filter bank output Images 80-dimensional representation
Place recognition system
Hidden Markov Model Hidden states = location (63 values) Observations = v G t ∈ R 80 Transition model encodes topology of environment Observation model is a mixture of Gaussians (100 views per place)
Hidden Markov Model Observation Likelihood Prediction Prior Transition Matrix Mixture of Gaussians MLE (counting)
Scene Categorization 17 Categories (Office, Corridor, Street, etc) Train a separate HMM on category labels
Place recognition demo
Specific location Location category Indoor/outdoor Ground truth System estimate Performance on known env.
Performance on new env.
Comparison of features Recognition Categorization
Effect of HMM on recognition With Without (But with temporal smoothing)
From place to object recognition
Object priming Predict object properties based on context (top-down signals): Visual gist, v t G Specific Location, Q t Kind of location, C t
Object Priming Again… MLE Probability of object i Probability of object i in image v i given entire video sequence Probability of object i Given current observation & place Estimate of current place (Output of HMM) Mixture of Gaussians Observation Likelihood Prior probability of object i being in place q
Predicting object presence
ROC curves for object detection
Predicting object position and scale
Estimate of mask Probability of an object i being present and location being q (Output of previous system) Estimate of mask given current gist, place, and object delta Gaussian
Predicted segmentation
Conclusion Real world problem (and it works!) Uses only global feature (context) How much did {HMM / place prior} affect {place recognition / object detection}? Can we really say “context” did the job?