Loss-based Learning with Weak Supervision M. Pawan Kumar.

Loss-based Learning with Weak Supervision M. Pawan Kumar

Segmentation Information Log (Size) ~ 2000 Computer Vision Data

Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data

Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data

Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data

Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data

Two Types of Problems Part I – Annotation Mismatch Part II – Output Mismatch Outline

Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

Part I

Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data Input x Output y  {-1,+1} Hidden h x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = +1 h

Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound

Learning Latent SVM - w T  (x i,y i (w),h i (w))  (y i, y i (w)) w T  (x i,y i (w),h i (w)) +

Learning Latent SVM  (y i, y i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program

min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap

Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?

Linear Upper Bound - max h i w T  (x i,y i,h i ) -w T  (x i,y i,h i *) h i * = argmax h i w t T  (x i,y i,h i ) Current estimate = w t ≥ - max h i w T  (x i,y i,h i )

CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch

Action Classification Input x Output y = “Using Computer” PASCAL VOC 2011 80/20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup http://www.centrale-ponts.fr/tutorials/cvpr2013/

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Kumar, Packer and Koller, NIPS 2010

1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning

Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM 1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning

Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human  Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) CCCP for Latent SVM

min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

min ||w|| 2 + C∑ i v i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i v i  {0,1} Trivial Solution Self-Paced Learning

v i  {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

v i  [0,1] min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning

Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Decrease K  K/  SPL for Latent SVM Update w t+1 as the ε-optimal solution of

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Behl, Jawahar and Kumar, In Preparation

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81

Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

Results Statistically significant improvement

Speed – Proximal Regularization Start with an good initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i + C t ||w - w t || 2 w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

Speed – Cascades Weiss and Taskar, AISTATS 2010 Sapp, Toshev and Taskar, ECCV 2010

Accuracy – (Self) Pacing Pacing the sample complexity – NIPS 2010 Pacing the model complexity Pacing the problem complexity

Building Accurate Systems Model Inference Learning 85% 5% 10% Learning cannot provide huge gains without a good model Inference cannot provide huge gains without a good model

Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy Models Outline – Annotation Mismatch Yu and Joachims, ICML 2009

Latent Variable Dependent Loss - w T  (x i,y i (w),h i (w))  (y i, y i (w), h i (w)) w T  (x i,y i (w),h i (w)) +

Latent Variable Dependent Loss  (y i, y i (w), h i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

Latent Variable Dependent Loss  (y i, y, h) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i

Optimizing Precision@k Input X = {x i, i = 1, …, n} Annotation Y = {y i, i = 1, …, n}  {-1,+1} n Latent H = ranking  (Y*, Y, H) 1-Precision@k

Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy (M3E) Models Outline – Annotation Mismatch Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Running vs. Jumping Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h)  (-∞, +∞) w T Ψ(x,y 1,h)

0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h)  (-∞, +∞) Uncertainty in h Running vs. Jumping Classification

Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w) M3E

G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x)) Rényi Entropy

G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM Rényi Entropy

Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w)) Learning M3E

Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)} Learning M3E

Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results Learning M3E

Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding Results

Part II

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch

Weakly Supervised Data Input x Output y  {0,1,…,C} Hidden h x y = 0 h

Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = 0 h

Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,0,h) Φ(x,h) 0 = x y = 0 h 0......

Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,1,h) 0 Φ(x,h) = x y = 0 h 0......

Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,C,h) 0 0 = x y = 0 h Φ(x,h)......

Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = 0 h

Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Linear Model Parameters

Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples

Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi

Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch Kumar, Packer and Koller, ICML 2012

Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)

Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

Results – 0/1 Loss Statistically Significant

Results – Overlap Loss Statistically Significant

Questions? http://www.centrale-ponts.fr/personnel/pawan

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Similar presentations

Presentation on theme: "Loss-based Learning with Weak Supervision M. Pawan Kumar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Similar presentations

Presentation on theme: "Loss-based Learning with Weak Supervision M. Pawan Kumar."— Presentation transcript:

Similar presentations

About project

Feedback