Context Aware Spatial Priors using Entity Relations (CASPER) Geremy Heitz Jonathan Laserson Daphne Koller December 10 th, 2007 DAGS
Outline Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Building Tree Car
Representation Building Tree Car Building Car l = bag of object categories ρ = location of centroids We model P( ρ, l) Why? Because we use a generative model P(ρ, l | I) ~ P(ρ, l) P(I|ρ, l) I = the Image
Building Tree Car Building Car Building Tree Car Building Car Tree Car Which one makes more sense? Does Context matter?
Can it help Object Recognition? LOOPS
Outline Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Fixed Order Model Each image has the same bag of objects example: 1 car, 2 buildings, 1 tree Object centroids are drawn jointly 1 P(ρ, l) = 1 {l = l_fixed_order} P(ρ | l) Similar to constellations (Fergus) Problem: We don't always know the exact set of objects
TDP (Sudderth, 2005) Each image has a different bag of objects Object centroids are drawn independently P(ρ, l) = P(l) П P(ρ i | l i ) Problems: This doesn't take pairwise constraints into account We have lost context
Outline Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
CASPER Each image has a different bag of objects Object centroids are drawn jointly given l P(ρ,l) = P(l) P(ρ | l) Questions: How do we represent P(l)? How do we represent P(ρ | l)? How do we learn? How do we infer?
P(l) Dirichlet Process We don’t want to get into that now Other options Multinomial Uniform
P( ρ | l) - Desiderata Correlations between ρ's Sharing of parameters between l's Intuitive parameterization Continuous Multivariate Distribution Easy to learn parameters Easy to evaluate likelihood Easy to condition Gaussian?
MV Gaussian - Options Learn a different Gaussian for every l Can't share parameters Large number (∞) of l's Gaussian Process ρ(x) ~ GP(mu(x), K(x,x’)) Every finite set of x’s produces a Gaussian ρ [ρ(x 1 ) ρ(x 2 ) … ρ(x k )] ~ Gaussian x t is a hidden function of the class l t Mu(x t ) = Ax t K(x t,x t’ ) = c exp(-||B(x t -x t’ )|| 2 ) Two objects of the same class -> same x? Is correlation the natural space?
Car Spatial Distribution - Options “Singleton Expert” P(ρ i |l i ) Gaussian over absolute object location “Pairwise Expert”P(ρ i -ρ j | li,lj ) Gaussian offset between objects Expert can be one of K mixture components Tree Car k = 1 k = 2 k = 1
CASPER P(ρ|l) How to use experts? Introduce an auxiliary variable d P(ρ|d,l) d tells us which experts are ‘on’ Building Tree Car Building Car For each edge e= (l i,l j ), d e indexes all possible experts for this edge Default is a uniform expert P(ρ|d,l) ~ POE d POE d = П P(ρ i |l i ) П P(ρ i -ρ j | dij,li,lj ) Product of Gaussians is a Gaussian
CASPER P(ρ|d,l) POE d = Z d N(ρ; μ d, Σ d ) P(ρ|d,l) = N(ρ; μ d, Σ d ) = 1/Z d POE d P(d|l) ~ Z d (Multinomial) P(ρ,d|l) ~ POE d Car3 Car2 Car1 Car2 Car1 Car Car3 Car2 Example: P(ρ,d|l) ~ P(ρ 2 -ρ 1 | d 12 ) P(ρ 3 -ρ 2 | d 32 ) d1 d2 Car2 P(ρ|d 1,l) = P(ρ|d 2,l) but Z d 2 >Z d 1 hence POE d 2 > POE d 1
Learning the Experts Training set with supervised (ρ,l) pairs (one pair for each image) Gibbs over the hidden variables d e Loop over edges Update expert sufficient statistics with each update Does it converge? not as much as we want it to Work in progress Building Tree Car Building Car
Outline Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Preliminary Experiments LabelMe Datasets STREETS BEDROOMS
* * * * * * * * * * * * * ** * * * * * * * * * * * FEATURES Harris Interest Operator -> y i SIFT Descriptor -> w i Instance membership -> t i INSTANCES Centroid -> ρ t Class label -> l t * * Car ρtρt (y i, w i, t i ) (ρ t, l t ) Observed P(I| ρ,l) = P(y, w|ρ,l)
What do the true ρ’s look like? Car -> Car Lamp -> Lamp Bed -> Lamp
Learning/Inference in Full Model TDP - Three stage Gibbs: Assign features to instances (Sample t i for every feature) Assign expert components (Sample d e for every edge) Assign instances to classes (Sample l t, ρ t for every instance) Training Supervise (t,l) variables Gibbs over d and ρ Testing Introduce new images Gibbs (t,l,d,ρ) of new images Independent-TDP: ρ’s are independent CASPER-TDP: ρ’s are distributed according to CASPER
Learned Experts
* * * * * * * * * * * * * ** * * * * * * * * * * * FEATURES * * (y i, w i, t i ) * * * * *
IMAGE GROUNDTRUTH IND – N = 0.1IND – N = 0.5
Evaluation – Gen Model N = 0.1N = 0.3N = 0.5 Bed Lamp Painting Window Table “Synthetic Appearance” Visual words give strong indicator for the class Evaluated on Detection Performance Precision/Recall F1 score for centroid and class identification Results here with Independent TDP Can we hope to do this well?
Evaluation - Context INDEPENDENTCASPER Bed Lamp Painting Window Table Independent-TDP vs CASPER-TDP N = 0.5 Why isn’t context helping here?
Problems with this Setup Bad Feelings Supervised setting – Detection Our model is not trained to maximize detection ability We will lose to many/most discriminative approaches Context is NOT the main reason why TDP fails Unsupervised setting Likelihood? Does anyone care? Object discovery? Context is a lower-order consideration How would we show that CASPER > Independent?
Outline Goal – Scene Understanding Existing Methods CASPER Preliminary Experiments Future Direction – Going Discriminative
Going Discriminative Up to now we have been generative: P(I, ρ, l) = P(I | ρ, l) P(ρ, l) How do we convert this into discriminative? Include CASPER distribution over (ρ,l) Include term with boosted object detectors Slap on a partition function P(ρ, l | I) = 1/Z * CASPER * DETECTORS
Discriminative Framework Boosted Detectors “Over detect” Each “candidate” has: location ρ t, class variable l t detection score D I (l t ) P(ρ, l | I) ~ P(ρ, l) Π D I (l t ) Goal: Reassign detection candidates to classes Respects the “detection strength” Respects the context between objects D I (face) = 0.09 D I (face) = 0.92
Similarities to Steve’s work “Over detection” using boosted detectors But some detections don’t make sense in context 3D information allows him to “sort out” which detections are correct
CASPER Learning/Inference Gibbs Inference Loop over images Loop over detection candidates t Sample (l t | everything else) Loop over pairs of candidates Sample (d e | everything else) Training l t is known, Gibbs over d e Evaluation Precision/Recall for detections
Possible Datasets
Short Term Plan Learn the boosted detectors Determine our baseline performance Add Gibbs inference Submit to a conference that is far far away… ICML = Helsinki, Finland
Alternate Names Spatial Priors for Arbitrary Groups of Objects
Product of Experts Precision Space View P1(x) = N(a, A) P2(x) = N(b, B) P1(x)P2(x) = Z N(c, C) Z = N(a ; b, B+A) C -1 = A -1 + B -1 c = C(A -1 a + B -1 b) What does this mean? Precision matrices of the experts ADD Even if each expert has a singular A -1 the sum is PSD
CASPER Detection Detection strength component DI(lt) = P(lt | I[ρt]) Occurrence component P(l) = Π P(lt)lt ~ Multinomial CASPER component P(ρ,d | l) ~ POEd