Presentation is loading. Please wait.

Presentation is loading. Please wait.

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Similar presentations


Presentation on theme: "Loss-based Learning with Weak Supervision M. Pawan Kumar."— Presentation transcript:

1 Loss-based Learning with Weak Supervision M. Pawan Kumar

2 Segmentation Information Log (Size) ~ 2000 Computer Vision Data

3 Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data

4 Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data

5 Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data

6 Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data

7 Two Types of Problems Part I – Annotation Mismatch Part II – Output Mismatch Outline

8 Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y

9 Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification

10 Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

11 Part I

12 Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

13 Weakly Supervised Data Input x Output y  {-1,+1} Hidden h x y = +1 h

14 Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h

15 Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h

16 Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h

17 Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = +1 h

18 Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters

19 Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}

20 Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound

21 Learning Latent SVM - w T  (x i,y i (w),h i (w))  (y i, y i (w)) w T  (x i,y i (w),h i (w)) +

22 Learning Latent SVM  (y i, y i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

23 Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?

24 Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program

25 min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap

26 Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch

27 Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program

28 Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

29 Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound

30 Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

31 Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence

32 Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?

33 Linear Upper Bound - max h i w T  (x i,y i,h i ) -w T  (x i,y i,h i *) h i * = argmax h i w t T  (x i,y i,h i ) Current estimate = w t ≥ - max h i w T  (x i,y i,h i )

34 CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

35 Latent SVM Optimization Practice Extensions Outline – Annotation Mismatch

36 Action Classification Input x Output y = “Using Computer” PASCAL VOC 2011 80/20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

37 0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup http://www.centrale-ponts.fr/tutorials/cvpr2013/

38 Objective

39 Train Error

40 Test Error

41 Time

42 Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch

43 Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

44 Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

45 Objective

46

47 Train Error

48

49 Test Error

50

51 Time

52

53 Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch

54 Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

55 Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

56 Objective

57

58 Train Error

59

60 Test Error

61

62 Time

63

64 Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Kumar, Packer and Koller, NIPS 2010

65 1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning

66 Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM 1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning

67 Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human  Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances

68 Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) CCCP for Latent SVM

69 min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

70 min ||w|| 2 + C∑ i v i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i v i  {0,1} Trivial Solution Self-Paced Learning

71 v i  {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

72 v i  [0,1] min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning

73 Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Decrease K  K/  SPL for Latent SVM Update w t+1 as the ε-optimal solution of

74 Objective

75

76 Train Error

77

78 Test Error

79

80 Time

81

82 Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Extensions Outline – Annotation Mismatch Behl, Jawahar and Kumar, In Preparation

83 Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1

84 Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81

85 Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

86 Results Statistically significant improvement

87 Speed – Proximal Regularization Start with an good initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i + C t ||w - w t || 2 w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

88 Speed – Cascades Weiss and Taskar, AISTATS 2010 Sapp, Toshev and Taskar, ECCV 2010

89 Accuracy – (Self) Pacing Pacing the sample complexity – NIPS 2010 Pacing the model complexity Pacing the problem complexity

90 Building Accurate Systems Model Inference Learning 85% 5% 10% Learning cannot provide huge gains without a good model Inference cannot provide huge gains without a good model

91 Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy Models Outline – Annotation Mismatch Yu and Joachims, ICML 2009

92 Latent Variable Dependent Loss - w T  (x i,y i (w),h i (w))  (y i, y i (w), h i (w)) w T  (x i,y i (w),h i (w)) +

93 Latent Variable Dependent Loss  (y i, y i (w), h i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

94 Latent Variable Dependent Loss  (y i, y, h) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i

95 Optimizing Precision@k Input X = {x i, i = 1, …, n} Annotation Y = {y i, i = 1, …, n}  {-1,+1} n Latent H = ranking  (Y*, Y, H) 1-Precision@k

96 Latent SVM Optimization Practice Extensions –Latent Variable Dependent Loss –Max-Margin Min-Entropy (M3E) Models Outline – Annotation Mismatch Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

97 Running vs. Jumping Classification 0.00 0.25 0.000.250.00 0.250.00 Score w T Ψ(x,y,h)  (-∞, +∞) w T Ψ(x,y 1,h)

98 0.000.240.00 0.010.00 w T Ψ(x,y 2,h) 0.00 0.25 0.000.250.00 0.250.00 w T Ψ(x,y 1,h) Only maximum score used No other useful cue? Score w T Ψ(x,y,h)  (-∞, +∞) Uncertainty in h Running vs. Jumping Classification

99 Scoring function P w (y,h|x) = exp(w T Ψ(x,y,h))/Z(x) Prediction y(w) = argmin y H α (P w (h|y,x)) – log P w (y|x) Partition Function Marginalized Probability Rényi Entropy Rényi Entropy of Generalized Distribution G α (y;x,w) M3E

100 G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = 1. Shannon Entropy of Generalized Distribution - Σ h P w (y,h|x) log(P w (y,h|x)) Σ h P w (y,h|x) Rényi Entropy

101 G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h log(P w (y,h|x)) Rényi Entropy

102 G α (y;x,w) = 1 1-α log Σ h P w (y,h|x) α Σ h P w (y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - max h w T Ψ(x,y,h) Same prediction as latent SVM Rényi Entropy

103 Training data {(x i,y i ), i = 1,2,…,n} Highly non-convex in w Cannot regularize w to prevent overfitting w* = argmin w Σ i Δ(y i,y i (w)) Learning M3E

104 Δ(y i,y i (w)) G α (y i (w);x i,w) + Training data {(x i,y i ), i = 1,2,…,n} - G α (y i (w);x i,w) Δ(y i,y i (w))≤ G α (y i ;x i,w) + - G α (y i (w);x i,w) max y {Δ(y i,y) ≤ G α (y i ;x i,w) + - G α (y;x i,w)} Learning M3E

105 Training data {(x i,y i ), i = 1,2,…,n} min w ||w|| 2 + C Σ i ξ i G α (y i ;x i,w) + Δ(y i,y) – G α (y;x i,w) ≤ ξ i When α tends to infinity, M3E = Latent SVM Other values can give better results Learning M3E

106 Motif + Markov Background Model. Yu and Joachims, 2009 Motif Finding Results

107 Part II

108 Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

109 Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch

110 Weakly Supervised Data Input x Output y  {0,1,…,C} Hidden h x y = 0 h

111 Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = 0 h

112 Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,0,h) Φ(x,h) 0 = x y = 0 h 0......

113 Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,1,h) 0 Φ(x,h) = x y = 0 h 0......

114 Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,C,h) 0 0 = x y = 0 h Φ(x,h)......

115 Weakly Supervised Detection Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = 0 h

116 Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Linear Model Parameters

117 Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) Unknown latent variable values Supervised Samples + Σ i Δ’(y i,y i (w),h i (w)) Weakly Supervised Samples

118 Minimizing General Loss min w Σ i Δ(y i,h i,y i (w),h i (w)) A single distribution to achieve two objectives P w (h i |x i,y i )ΣhiΣhi

119 Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch Kumar, Packer and Koller, ICML 2012

120 Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

121 Solution Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks

122 Solution Model Accuracy of Latent Variable Predictions Use two different distributions for the two different tasks Pθ(hi|yi,xi)Pθ(hi|yi,xi) hihi

123 Solution Use two different distributions for the two different tasks hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

124 The Ideal Case No latent variable uncertainty, correct prediction hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i,h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) hi(w)hi(w)

125 In Practice Restrictions in the representation power of models hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi)

126 Our Framework Minimize the dissimilarity between the two distributions hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) User-defined dissimilarity measure

127 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i )

128 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β Σ h,h’ Δ(y i,h,y i,h’)P θ (h|y i,x i )P θ (h’|y i,x i ) Hi(w,θ)Hi(w,θ)

129 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - (1-β) Δ(y i (w),h i (w),y i (w),h i (w)) - β H i (θ,θ)Hi(w,θ)Hi(w,θ)

130 Our Framework Minimize Rao’s Dissimilarity Coefficient hihi Pw(yi,hi|xi)Pw(yi,hi|xi) (yi,hi)(yi,hi) (y i (w),h i (w)) Pθ(hi|yi,xi)Pθ(hi|yi,xi) - β H i (θ,θ)Hi(w,θ)Hi(w,θ) min w,θ ΣiΣi

131 Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch

132 Optimization min w,θ Σ i H i (w,θ) - β H i (θ,θ) Initialize the parameters to w 0 and θ 0 Repeat until convergence End Fix w and optimize θ Fix θ and optimize w

133 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

134 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case I: y i (w) = y i hi(w)hi(w)

135 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i

136 Optimization of θ min θ Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) - β H i (θ,θ) hihi Pθ(hi|yi,xi)Pθ(hi|yi,xi) Case II: y i (w) ≠ y i Stochastic subgradient descent

137 Optimization of w min w Σ i Σ h Δ(y i,h,y i (w),h i (w))P θ (h|y i,x i ) Expected loss, models uncertainty Form of optimization similar to Latent SVM Δ independent of h, implies latent SVM Concave-Convex Procedure (CCCP)

138 Problem Formulation Dissimilarity Coefficient Learning Optimization Experiments Outline – Output Mismatch

139 Action Detection Input x Output y = “Using Computer” Latent Variable h PASCAL VOC 2011 60/40 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

140 Results – 0/1 Loss Statistically Significant

141 Results – Overlap Loss Statistically Significant

142 Questions? http://www.centrale-ponts.fr/personnel/pawan


Download ppt "Loss-based Learning with Weak Supervision M. Pawan Kumar."

Similar presentations


Ads by Google