Loss-based Learning with Weak Supervision M. Pawan Kumar
About the Talk Methods that use latent structured SVM A little math-y Initial stages
Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009 Outline
Weakly Supervised Data Input x Output y {-1,+1} Hidden h x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h) (-∞, +∞) Optimize score over all possible y and h x y = +1 h
Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SSVM
Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w)) Learning Latent SSVM Minimize empirical risk specified by loss function
Δ(y i,y i (w))w T Ψ(x,y i (w),h i (w)) + - w T Ψ(x,y i (w),h i (w)) Δ(y i,y i (w))≤ w T Ψ(x,y i (w),h i (w)) +- max h i w T Ψ(x,y i,h i ) Δ(y i,y)}≤ max y,h {w T Ψ(x,y,h) + - max h i w T Ψ(x,y i,h i ) Training data {(x i,y i ), i = 1,2,…,n} Learning Latent SSVM
Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Learning Latent SSVM
Start with an initial estimate of w min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - w T Ψ(x i,y i,h i *) ≤ ξ i CCCP Impute hidden variables h i * = argmax h w T Ψ(x i,y i,h) Update w Repeat until convergence Loss independent Loss dependent
min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap
Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Joint Work with Aseem Behl and C. V. Jawahar Outline
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81
Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly
Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Yue, Finley, Radlinski and Joachims, 2007
Supervised Learning - Input Training images XBounding boxes H P N = {H P,H N }
Supervised Learning - Output Ranking matrix Y Y ik = +1 if i is better ranked than k -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y*
SSVM Formulation Ψ(X,Y,{H P,H N }) Σ i P Σ k N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =
Prediction using SSVM Y(w) = argmax Y w T Ψ(X,Y, {H P,H N }) Sort by value of sample score w T Φ(x i,h i ) Same as standard binary SVM
Learning SSVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction
Learning SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P,H N }) + - w T Ψ(X,Y(w),{H P,H N })
Learning SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P,H N }) + - w T Ψ(X,Y*,{H P,H N })
Learning SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y ≤ ξ min w ||w|| 2 + C ξ
Learning SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y ≤ ξ min w ||w|| 2 + C ξ Loss Augmented Inference
Rank 1Rank 2Rank 3 Rank positives according to sample scores
Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Rank negatives according to sample scores
Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Slide best negative to a higher rank Continue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference
Recap Scoring function w T Ψ(X,Y,{H P,H N }) Y(w) = argmax Y w T Ψ(X,Y, {H P,H N }) Prediction Learning Using optimal loss augmented inference
Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline
Weakly Supervised Learning - Input Training images X
Weakly Supervised Learning - Latent Training images X Bounding boxes H P All bounding boxes in negative images are negative
Intuitive Prediction Procedure Select the best bounding boxes in all images
Intuitive Prediction Procedure Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Rank them according to their sample scores
Ranking matrix Y Y ik = +1 if i is better ranked than k -1 if k is better ranked than i 0 if i and k are ranked equally Optimal ranking Y* Weakly Supervised Learning - Output
Latent SSVM Formulation Ψ(X,Y,{H P,H N }) Σ i P Σ k N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =
Prediction using Latent SSVM max Y,H w T Ψ(X,Y, {H P,H N })
Prediction using Latent SSVM max Y,H w T Σ i P Σ k N Y ik (Φ(x i,h i )-Φ(x k,h k )) Choose best bounding box for positives Choose worst bounding box for negatives Not what we wanted
Learning Latent SSVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction
Learning Latent SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y(w),{H P (w),H N (w)})
Learning Latent SSVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y*,{H P,H N }) max H
Learning Latent SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max H max Y, H ≤ ξ min w ||w|| 2 + C ξ
Learning Latent SSVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max H max Y, H ≤ ξ min w ||w|| 2 + C ξ Loss Augmented Inference Cannot be solved optimally
Recap Unintuitive prediction Non-optimal loss augmented inference Can we do better? Unintuitive objective function
Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline
Latent AP-SVM Formulation Ψ(X,Y,{H P,H N }) Σ i P Σ k N Y ik (Φ(x i,h i )-Φ(x k,h k )) |P||N| Scoring function w T Ψ(X,Y,{H P,H N }) Joint feature vector =
Prediction using Latent AP-SSVM Choose best bounding box for all samples Optimize over the ranking h i (w) = argmax h w T Φ(x i,h) Y(w) = argmax Y w T Ψ(X,Y, {H P (w),H N (w)}) Sort by sample scores
Learning Latent AP-SVM Δ(Y*,Y(w))min w Loss = 1 – AP of prediction
Learning Latent AP-SVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y(w),{H P (w),H N (w)})
Learning Latent AP-SVM Δ(Y*,Y(w)) w T Ψ(X,Y(w),{H P (w),H N (w)}) + - w T Ψ(X,Y*,{H P (w),H N (w)})
Learning Latent AP-SVM Δ(Y*,Y) w T Ψ(X,Y,{H P (w),H N }) + - w T Ψ(X,Y*,{H P (w),H N }) max Y, H N
Learning Latent AP-SVM Δ(Y*,Y) w T Ψ(X,Y,{H P,H N }) + - w T Ψ(X,Y*,{H P,H N }) max Y, H N min H P H P (w) minimizing the above upper bound ≤ ξ min w ||w|| 2 + C ξ
Start with an initial estimate of w CCCP Impute hidden variables Update w Repeat until convergence
Above algorithm is optimal. Imputing Hidden Variables Choose best bounding boxes according to sample score
Start with an initial estimate of w CCCP Impute hidden variables Update w Repeat until convergence
Loss Augmented Inference Choose best bounding boxes according to sample score
Loss Augmented Inference Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Slide best negative to a higher rank Continue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference
Recap Intuitive prediction Optimal loss augmented inference Performance in practice? Intuitive objective function
Latent SSVM Ranking –Supervised Learning –Weakly Supervised Learning –Latent AP-SVM –Experiments Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline
VOC 2011 action classification 10 action classes + other 2424 ‘trainval’ images 2424 ‘test’ images –Hidden annotations –Evaluated using a remote server –Only AP values are computed Dataset
Latent SSVM with 0/1 loss (latent SVM) –Relative loss weight C –Relative positive sample weight J –Robustness threshold K Latent SSVM with AP loss (latent SSVM) –Relative loss weight C –Approximate greedy inference algorithm 5 random initializations 5-fold cross-validation (80-20 split) Baselines
Cross-Validation Statistically significant improvement
Test
Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Joint Work with Wojciech Zaremba, Alexander Gramfort and Matthew Blaschko IPMI 2013
M/EEG Data
Faster activation (familiar with task)
M/EEG Data Slower activation (bored with task)
Classifying M/EEG Data Statistically significant improvement
Functional Connectivity visual cortex → deep subcortical source visual cortex → higher level cognitive processing Connected components have similar delay
Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI Outline Joint Work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura Azzabou, Pierre Carlier MICCAI 2013
Training Data Annotators provide ‘hard’ segmentation
Training Data Annotators provide ‘hard’ segmentation Random Walks provides ‘soft’ segmentation Best ‘soft’ segmentation?
Segmentation Statistically significant improvement
To Conclude … Choice of loss function matters during training Many interesting latent variables –Computer Vision (onerous annotations) –Medical Imaging (impossible annotations) Large-scale experiments –Other problems –General loss –Efficient Optimization
Questions?
SPLENDID Nikos Paragios Equipe Galen INRIA Saclay Daphne Koller DAGS Stanford Machine Learning Weak Annotations Noisy Annotations Applications Computer Vision Medical Imaging Self-Paced Learning for Exploiting Noisy, Diverse or Incomplete Data Visits between INRIA Saclay and Stanford University