Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Segmentation Information Log (Size) ~ 2000 Computer Vision Data
Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data
Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data
Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data
Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data
Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y
Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification
Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)
Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Output mismatch is out of scope We will focus on this case Desired output during test time is y
Latent SVM Optimization Practice Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
Weakly Supervised Data Input x Output y {-1,+1} Hidden h x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h) (-∞, +∞) Optimize score over all possible y and h x y = +1 h
Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters
Learning Latent SVM (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}
Learning Latent SVM (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound
Learning Latent SVM - w T (x i,y i (w),h i (w)) (y i, y i (w)) w T (x i,y i (w),h i (w)) +
Learning Latent SVM (y i, y i (w)) w T (x i,y i (w),h i (w)) + - max h i w T (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program
min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap
Latent SVM Optimization Practice Outline
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?
Linear Upper Bound - max h i w T (x i,y i,h i ) -w T (x i,y i,h i *) h i * = argmax h i w t T (x i,y i,h i ) Current estimate = w t ≥ - max h i w T (x i,y i,h i )
CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence
Latent SVM Optimization Practice Outline
Action Classification Input x Output y = “Using Computer” PASCAL VOC /20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i
0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup
Objective
Train Error
Test Error
Time
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations
Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i )
Objective
Train Error
Test Error
Time
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations
Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i )
Objective
Train Error
Test Error
Time
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Kumar, Packer and Koller, NIPS 2010
1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning
Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning
Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) CCCP for Latent SVM
min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Self-Paced Learning
min ||w|| 2 + C∑ i v i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i v i {0,1} Trivial Solution Self-Paced Learning
v i {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Self-Paced Learning
v i [0,1] min ||w|| 2 + C∑ i v i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning
Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Decrease K K/ SPL for Latent SVM Update w t+1 as the ε-optimal solution of
Objective
Train Error
Test Error
Time
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Behl, Mohapatra, Jawahar and Kumar, PAMI 2015
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81
Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly
Results Statistically significant improvement
Questions?