Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman

Image Classification Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

Image Classification 0.00 0.750.00 Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h)  (-∞, +∞) f (Ψ(x,y 1,h))

Image Classification 0.000.230.00 0.01 0.00 f (Ψ(x,y 2,h)) 0.00 0.750.00 f (Ψ(x,y 1,h)) Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h)  (-∞, +∞) Prediction y(f) Learn f

Loss-based Learning User defined loss function Δ(y,y(f)) f* = argmin f Σ i Δ(y i,y i (f)) Minimize loss between predicted and ground-truth output No restriction on the loss function General framework (object detection, segmentation, …)

Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Latent SVM Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Parameters Features

Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))

Learning Latent SVM Δ(y i,y i (w))w T Ψ(x,y i (w),h i (w)) + - w T Ψ(x,y i (w),h i (w)) Δ(y i,y i (w))≤ w T Ψ(x,y i (w),h i (w)) +- max h i w T Ψ(x,y i,h i ) Δ(y i,y)}≤ max y,h {w T Ψ(x,y,h) + - max h i w T Ψ(x,y i,h i ) Training data {(x i,y i ), i = 1,2,…,n}

Learning Latent SVM Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Self-Paced Learning, NIPS 2010

Recap min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h)) Learning

Image Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h)  (-∞, +∞) w T Ψ(x,y 1,h)

Image Classification 0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h)  (-∞, +∞) Uncertainty in h

Latent SVM Max-Margin Min-Entropy (M3E) Models Dissimilarity Coefficient Learning Outline Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

M3E Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w)

Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x))

Rényi Entropy G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM

Learning M3E Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w))

Learning M3E Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)}

Learning M3E Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results

Image Classification 271 images, 6 classes 90/10 train/test split 5 folds Mammals Dataset 0/1 loss

Image Classification HOG-Based Model. Dalal and Triggs, 2005

Motif Finding ~ 40,000 sequences 50/50 train/test split 5 Proteins, 5 folds UniProbe Dataset Binding vs. Not-Binding

Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding

Recap Scoring function Prediction Learning P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) y(w) = argmin y G α (y;x,w) min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i

Latent SVM Max-Margin Min-Entropy Models Dissimilarity Coefficient Learning Outline Kumar, Packer and Koller, ICML 2012

Object Detection Images x i Labels y i Bison Deer Elephant Giraffe Llama Rhino Boxes h i y = “Deer” Image x

Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples

Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi

Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)

Object Detection Bison Deer Elephant Giraffe Llama Rhino Input x Output y = “Deer” Latent Variable h Mammals Dataset 60/40 Train/Test Split 5 Folds Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss

Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss Statistically Significant

Latent SVM ignores latent variable uncertainty M3E for latent variable independent loss DISC for latent variable dependent loss Strict generalizations of Latent SVM Code available online Conclusions

Questions? http://cvc.centrale-ponts.fr/personnel/pawan

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Similar presentations

Presentation on theme: "Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.

Similar presentations

Presentation on theme: "Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben."— Presentation transcript:

Similar presentations

About project

Feedback