Download presentation
Presentation is loading. Please wait.
Published byJulius Williams Modified over 9 years ago
1
Loss-based Learning with Weak Supervision M. Pawan Kumar
2
Segmentation Information Log (Size) ~ 2000 Computer Vision Data
3
Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data
4
Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data
5
Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data
6
Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data
7
Two Types of Problems Part I – Annotation Mismatch Part II – Output Mismatch Outline
8
Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y
9
Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification
10
Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)
11
Part I
12
Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
13
Weakly Supervised Data Input x Output y {-1,+1} Hidden h x y = +1 h
14
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h
15
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h
16
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h
17
Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h) (-∞, +∞) Optimize score over all possible y and h x y = +1 h
18
Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters
19
Learning Latent SVM (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}
20
Learning Latent SVM (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound
21
Learning Latent SVM - w T (x i,y i (w),h i (w)) (y i, y i (w)) w T (x i,y i (w),h i (w)) +
22
Learning Latent SVM (y i, y i (w)) w T (x i,y i (w),h i (w)) + - max h i w T (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)
23
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?
24
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program
25
min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap
26
Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch
27
Learning Latent SVM (y i, y) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program
28
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part
29
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound
30
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part
31
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence
32
Concave-Convex Procedure + (y i, y) w T (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?
33
Linear Upper Bound - max h i w T (x i,y i,h i ) -w T (x i,y i,h i *) h i * = argmax h i w t T (x i,y i,h i ) Current estimate = w t ≥ - max h i w T (x i,y i,h i )
34
CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence
35
Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch
36
Action Classification Input x Output y = “Using Computer” PASCAL VOC 2011 80/20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i
37
0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup http://www.centrale-ponts.fr/tutorials/cvpr2013/
38
Objective
39
Train Error
40
Test Error
41
Time
42
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch
43
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations
44
Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i )
45
Objective
47
Train Error
49
Test Error
51
Time
53
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch
54
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations
55
Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i )
56
Objective
58
Train Error
60
Test Error
62
Time
64
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Kumar, Packer and Koller, NIPS 2010
65
1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning
66
Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM 1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning
67
Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances
68
Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) CCCP for Latent SVM
69
min ||w|| 2 + C∑ i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Self-Paced Learning
70
min ||w|| 2 + C∑ i v i i w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i v i {0,1} Trivial Solution Self-Paced Learning
71
v i {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Self-Paced Learning
72
v i [0,1] min ||w|| 2 + C∑ i v i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y, h) - i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning
73
Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i i - ∑ i v i /K w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Decrease K K/ SPL for Latent SVM Update w t+1 as the ε-optimal solution of
74
Objective
76
Train Error
78
Test Error
80
Time
82
Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Behl, Jawahar and Kumar, In Preparation
83
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1
84
Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81
85
Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly
86
Results Statistically significant improvement
87
Speed – Proximal Regularization Start with an good initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i i + C t ||w - w t || 2 w T (x i,y i,h i *) - w T (x i,y,h) ≥ (y i, y) - i h i * = argmax h i H w t T (x i,y i,h i ) Repeat until convergence
88
Speed – Cascades Weiss and Taskar, AISTATS 2010 Sapp, Toshev and Taskar, ECCV 2010
89
Accuracy – (Self) Pacing Pacing the sample complexity – NIPS 2010 Pacing the model complexity Pacing the problem complexity
90
Building Accurate Systems Model Inference Learning 85% 5% 10% Learning cannot provide huge gains without a good model Inference cannot provide huge gains without a good model
91
Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy Models Outline – Annotation Mismatch Yu and Joachims, ICML 2009
92
Latent Variable Dependent Loss - w T (x i,y i (w),h i (w)) (y i, y i (w), h i (w)) w T (x i,y i (w),h i (w)) +
93
Latent Variable Dependent Loss (y i, y i (w), h i (w)) w T (x i,y i (w),h i (w)) + - max h i w T (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)
94
Latent Variable Dependent Loss (y i, y, h) w T (x i,y,h) + max y,h - max h i w T (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i
95
Optimizing Precision@k Input X = {x i, i = 1, …, n} Annotation Y = {y i, i = 1, …, n} {-1,+1} n Latent H = ranking (Y*, Y, H) 1-Precision@k
96
Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy (M3E) Models Outline – Annotation Mismatch Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012
97
Running vs. Jumping Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h) (-∞, +∞) w T Ψ(x,y 1,h)
98
0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h) (-∞, +∞) Uncertainty in h Running vs. Jumping Classification
99
Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w) M3E
100
G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = 1. Shannon Entropy of Generalized Distribution - Σ h P w (y,h|x) log(P w (y,h|x)) Σ h P w (y,h|x) Rényi Entropy
101
G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x)) Rényi Entropy
102
G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM Rényi Entropy
103
Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w)) Learning M3E
104
Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)} Learning M3E
105
Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results Learning M3E
106
Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding Results
107
Part II
108
Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)
109
Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch
110
Weakly Supervised Data Input x Output y {0,1,…,C} Hidden h x y = 0 h
111
Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = 0 h
112
Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,0,h) Φ(x,h) 0 = x y = 0 h 0......
113
Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,1,h) 0 Φ(x,h) = x y = 0 h 0......
114
Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,C,h) 0 0 = x y = 0 h Φ(x,h)......
115
Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h) (-∞, +∞) Optimize score over all possible y and h x y = 0 h
116
Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Linear Model Parameters
117
Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples
118
Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi
119
Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch Kumar, Packer and Koller, ICML 2012
120
Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions
121
Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks
122
Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi
123
Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)
124
The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)
125
In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)
126
Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure
127
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )
128
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)
129
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)
130
Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi
131
Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch
132
Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w
133
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)
134
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)
135
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i
136
Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent
137
Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)
138
Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch
139
Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i
140
Results – 0/1 Loss Statistically Significant
141
Results – Overlap Loss Statistically Significant
142
Questions? http://www.centrale-ponts.fr/personnel/pawan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.