Download presentation
Presentation is loading. Please wait.
Published byBuck Floyd Modified over 9 years ago
1
Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman
2
Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x
3
Image Classification 0.00 0.750.00 Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h) (-∞, +∞) f (Ψ(x,y 1,h))
4
Image Classification 0.000.230.00 0.01 0.00 f (Ψ(x,y 2,h)) 0.00 0.750.00 f (Ψ(x,y 1,h)) Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h) (-∞, +∞) Prediction y(f) Learn f
5
Loss-based Learning User defined loss function Δ(y,y(f)) f* = argmin f Σ i Δ(y i,y i (f)) Minimize loss between predicted and ground-truth output No restriction on the loss function General framework (object detection, segmentation, …)
6
Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
7
Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x
8
Latent SVM Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Parameters Features
9
Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))
10
Learning Latent SVM Δ(y i,y i (w))w T Ψ(x,y i (w),h i (w)) + - w T Ψ(x,y i (w),h i (w)) Δ(y i,y i (w))≤ w T Ψ(x,y i (w),h i (w)) +- max h i w T Ψ(x,y i,h i ) Δ(y i,y)}≤ max y,h {w T Ψ(x,y,h) + - max h i w T Ψ(x,y i,h i ) Training data {(x i,y i ), i = 1,2,…,n}
11
Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Self-Paced Learning, NIPS 2010
12
Recap min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Learning
13
Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x
14
Image Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h) (-∞, +∞) w T Ψ(x,y 1,h)
15
Image Classification 0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h) (-∞, +∞) Uncertainty in h
16
Latent SVM Max-Margin Min-Entropy (M3E) Models Dissimilarity Coefficient Learning Outline Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012
17
M3E Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w)
18
Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = 1. Shannon Entropy of Generalized Distribution - Σ h P w (y,h|x) log(P w (y,h|x)) Σ h P w (y,h|x)
19
Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x))
20
Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM
21
Learning M3E Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))
22
Learning M3E Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)}
23
Learning M3E Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results
24
Image Classification 271 images, 6 classes 90/10 train/test split 5 folds Mammals Dataset 0/1 loss
25
Image Classification HOG-Based Model. Dalal and Triggs, 2005
26
Motif Finding ~ 40,000 sequences 50/50 train/test split 5 Proteins, 5 folds UniProbe Dataset Binding vs. Not-Binding
27
Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding
28
Recap Scoring function Prediction Learning P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) y(w) = argmin y G α (y;x,w) min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i
29
Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Kumar, Packer and Koller, ICML 2012
30
Object Detection Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x
31
Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples
32
Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi
33
Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions
34
Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks
35
Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi
36
Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)
37
The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)
38
In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)
39
Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure
40
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )
41
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)
42
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)
43
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi
44
Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w
45
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)
46
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)
47
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i
48
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent
49
Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)
50
Object Detection Bison Deer Elephant Giraffe Llama Rhino Input x Output y = “Deer” Latent Variable h Mammals Dataset 60/40 Train/Test Split 5 Folds Train Input x i Output y i
51
Results – 0/1 Loss Statistically Significant
52
Results – Overlap Loss
53
Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i
54
Results – 0/1 Loss Statistically Significant
55
Results – Overlap Loss Statistically Significant
56
Latent SVM ignores latent variable uncertainty M3E for latent variable independent loss DISC for latent variable dependent loss Strict generalizations of Latent SVM Code available online Conclusions
57
Questions? http://cvc.centrale-ponts.fr/personnel/pawan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.