Presentation is loading. Please wait.

Presentation is loading. Please wait.

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Similar presentations


Presentation on theme: "Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben."— Presentation transcript:

1 Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman

2 Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

3 Image Classification 0.00 0.750.00 Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h)  (-∞, +∞) f (Ψ(x,y 1,h))

4 Image Classification 0.000.230.00 0.01 0.00 f (Ψ(x,y 2,h)) 0.00 0.750.00 f (Ψ(x,y 1,h)) Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h)  (-∞, +∞) Prediction y(f) Learn f

5 Loss-based Learning User defined loss function Δ(y,y(f)) f* = argmin f Σ i Δ(y i,y i (f)) Minimize loss between predicted and ground-truth output No restriction on the loss function General framework (object detection, segmentation, …)

6 Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

7 Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

8 Latent SVM Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Parameters Features

9 Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))

10 Learning Latent SVM Δ(y i,y i (w))w T Ψ(x,y i (w),h i (w)) + - w T Ψ(x,y i (w),h i (w)) Δ(y i,y i (w))≤ w T Ψ(x,y i (w),h i (w)) +- max h i w T Ψ(x,y i,h i ) Δ(y i,y)}≤ max y,h {w T Ψ(x,y,h) + - max h i w T Ψ(x,y i,h i ) Training data {(x i,y i ), i = 1,2,…,n}

11 Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Self-Paced Learning, NIPS 2010

12 Recap min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Learning

13 Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

14 Image Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h)  (-∞, +∞) w T Ψ(x,y 1,h)

15 Image Classification 0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h)  (-∞, +∞) Uncertainty in h

16 Latent SVM Max-Margin Min-Entropy (M3E) Models Dissimilarity Coefficient Learning Outline Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

17 M3E Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w)

18 Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = 1. Shannon Entropy of Generalized Distribution - Σ h P w (y,h|x) log(P w (y,h|x)) Σ h P w (y,h|x)

19 Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x))

20 Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM

21 Learning M3E Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))

22 Learning M3E Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)}

23 Learning M3E Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results

24 Image Classification 271 images, 6 classes 90/10 train/test split 5 folds Mammals Dataset 0/1 loss

25 Image Classification HOG-Based Model. Dalal and Triggs, 2005

26 Motif Finding ~ 40,000 sequences 50/50 train/test split 5 Proteins, 5 folds UniProbe Dataset Binding vs. Not-Binding

27 Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding

28 Recap Scoring function Prediction Learning P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) y(w) = argmin y G α (y;x,w) min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i

29 Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Kumar, Packer and Koller, ICML 2012

30 Object Detection Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

31 Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples

32 Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi

33 Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

34 Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

35 Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

36 Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

37 The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

38 In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

39 Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

40 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )

41 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)

42 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

43 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

44 Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

45 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

46 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

47 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

48 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

49 Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)

50 Object Detection Bison Deer Elephant Giraffe Llama Rhino Input x Output y = “Deer” Latent Variable h Mammals Dataset 60/40 Train/Test Split 5 Folds Train Input x i Output y i

51 Results – 0/1 Loss Statistically Significant

52 Results – Overlap Loss

53 Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

54 Results – 0/1 Loss Statistically Significant

55 Results – Overlap Loss Statistically Significant

56 Latent SVM ignores latent variable uncertainty M3E for latent variable independent loss DISC for latent variable dependent loss Strict generalizations of Latent SVM Code available online Conclusions

57 Questions? http://cvc.centrale-ponts.fr/personnel/pawan


Download ppt "Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben."

Similar presentations


Ads by Google