Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar http://www.robots.ox.ac.uk/~oval/ Slides available online http://mpawankumar.infohttp://mpawankumar.info

Segmentation Information Log (Size) ~ 2000 Computer Vision Data

Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data

Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data

Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data

Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data

Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Output mismatch is out of scope We will focus on this case Desired output during test time is y

Latent SVM Optimization Practice Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data Input x Output y  {-1,+1} Hidden h x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = +1 h

Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound

Learning Latent SVM - w T  (x i,y i (w),h i (w))  (y i, y i (w)) w T  (x i,y i (w),h i (w)) +

Learning Latent SVM  (y i, y i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program

min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap

Latent SVM Optimization Practice Outline

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?

Linear Upper Bound - max h i w T  (x i,y i,h i ) -w T  (x i,y i,h i *) h i * = argmax h i w t T  (x i,y i,h i ) Current estimate = w t ≥ - max h i w T  (x i,y i,h i )

CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

Latent SVM Optimization Practice Outline

Action Classification Input x Output y = “Using Computer” PASCAL VOC 2011 80/20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup http://mpawankumar.info/tutorials/cvpr2013/

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Kumar, Packer and Koller, NIPS 2010

1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning

Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM 1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning

Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human  Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) CCCP for Latent SVM

min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

min ||w|| 2 + C∑ i v i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i v i  {0,1} Trivial Solution Self-Paced Learning

v i  {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

v i  [0,1] min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning

Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Decrease K  K/  SPL for Latent SVM Update w t+1 as the ε-optimal solution of

Objective

Train Error

Test Error

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Behl, Mohapatra, Jawahar and Kumar, PAMI 2015

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81

Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

Results Statistically significant improvement

Questions?

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Similar presentations

Presentation on theme: "Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Similar presentations

Presentation on theme: "Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online"— Presentation transcript:

Similar presentations

About project

Feedback