Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning from Big Data Lecture 5

Similar presentations


Presentation on theme: "Learning from Big Data Lecture 5"— Presentation transcript:

1 Learning from Big Data Lecture 5
M. Pawan Kumar Slides available online

2 Outline Structured Output Prediction Structured Output SVM
Optimization Results

3 Image Classification Is this an urban or rural area? Input: x
Output: y  {-1,+1}

4 Image Classification Is this scan healthy or unhealthy? Input: x
Output: y  {-1,+1}

5 Image Classification y x Probabilistic Graphical Label +1 Model
Unobserved output x Observed input

6 Feature Vector x Feature Φ(x)

7 Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3

8 Joint Feature Vector Input: x Output: y  {-1,+1} Ψ(x,y)

9 Joint Feature Vector Input: x Output: y  {-1,+1} Φ(x) Ψ(x,-1) =

10 Joint Feature Vector Input: x Output: y  {-1,+1} Ψ(x,+1) = Φ(x)

11 Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y  {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

12 Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y  {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))

13 Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results

14 Image Classification Which city is this? Input: x
Output: y  {1,2,…,C}

15 Image Classification What type of tumor does this scan contain?
Input: x Output: y  {1,2,…,C}

16 Image Classification y x C Graphical Model 3 2 1 Unobserved output
Observed input

17 Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3

18 Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,y)

19 Joint Feature Vector Input: x Output: y  {1,2,…,C} Φ(x) Ψ(x,1) = .

20 Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,2) = Φ(x) .

21 Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,C) = . Φ(x)

22 Object Detection Where is the object in the image? Input: x
Output: y  {Pixels}

23 Object Detection Where is the rupture in the scan? Input: x
Output: y  {Pixels}

24 Object Detection y x C Graphical Model 3 2 1 Unobserved output
Observed input

25 Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3

26 Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3

27 Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3

28 Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y  {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

29 Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y  {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))

30 Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results

31 Segmentation What is the semantic class of each pixel? Input: x
car road grass tree sky sky What is the semantic class of each pixel? Input: x Output: y  {1,2,…,C}m

32 Segmentation What is the muscle group of each pixel? Input: x
Output: y  {1,2,…,C}m

33 Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8

34 Feature Vector Pre-Trained CNN x1 Feature Φ(x1) fc7 conv1 conv2 conv3

35 Joint Feature Vector Φ(x1) Ψu(x1,1) = . Input: x1
Output: y1  {1,2,…,C} Φ(x1) Ψu(x1,1) = .

36 Joint Feature Vector Ψu(x1,2) = Φ(x1) . Input: x1
Output: y1  {1,2,…,C} Ψu(x1,2) = Φ(x1) .

37 Joint Feature Vector Ψu(x1,C) = . Φ(x1) Input: x1
Output: y1  {1,2,…,C} Ψu(x1,C) = . Φ(x1)

38 Feature Vector Pre-Trained CNN x2 Feature Φ(x2) fc7 conv1 conv2 conv3

39 Joint Feature Vector Φ(x2) Ψu(x2,1) = . Input: x2
Output: y2  {1,2,…,C} Φ(x2) Ψu(x2,1) = .

40 Joint Feature Vector Ψu(x2,2) = Φ(x2) . Input: x2
Output: y2  {1,2,…,C} Ψu(x2,2) = Φ(x2) .

41 Joint Feature Vector Ψu(x2,C) = . Φ(x2) Input: x2
Output: y2  {1,2,…,C} Ψu(x2,C) = . Φ(x2)

42 Overall Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)

43 Score Function f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) Input: x
Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)

44 Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))

45 Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)

46 Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) Maximize for each a  {1,2,…,m} independently

47 Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8

48 Unary Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)

49 Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9

50 Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x12,y12) = δ(y1=y2)

51 Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x23,y23) = δ(y2=y3)

52 Pairwise Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψp(x12,y12) Ψp(x,y) = Ψp(x23,y23) .

53 Overall Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x,y) Ψ(x,y) = Ψp(x,y)

54 Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

55 Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))
Input: x Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))

56 Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y) Input: x
Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y)

57 Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya)
Input: x Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) + ∑a,b (wab)TΨp(xab,yab) Week 5 “Optimization” lectures

58 Summary How do I fix “f”? Input x, Outputs {y1,y2,..} Features Ψ(x,yi)
Extract Features Features Ψ(x,yi) How do I fix “f”? Compute Scores Prediction y(f) Scores f(Ψ(x,yi)) maxyi f(Ψ(x,yi))

59 Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results

60 f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

61 f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

62 Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples
Measure of prediction quality f* = argminf Σi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth

63 f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Relative weight (hyperparameter) Regularizer

64 f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Error can be negative log-likelihood Probabilistic model

65 Outline Structured Output Prediction Structured Output SVM
Optimization Results Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

66 Score Function and Prediction
Input: x Output: y Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxy wTΨ(x,y) Predicted Output: y(w) = argmaxy wTΨ(x,y)

67 Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Classification loss? “New York” “Paris” 1 Δ(y,y(w)) = δ(y=y(w))

68 Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Detection loss? Overlap score Area of intersection Area of union

69 Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Segmentation loss? car road grass tree sky Fraction of incorrect pixels Micro-average Macro-average

70 Learning Objective Training data {(xi,yi), i = 1,2,…,n}
Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)

71 Learning Objective Training data {(xi,yi), i = 1,2,…,n}
wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi(w)) ≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi) ≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi) Sensitive to regularization of w Convex

72 Learning Objective Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Quadratic program with large number of constraints Many polynomial time algorithms

73 Outline Structured Output Prediction Structured Output SVM
Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Shalev-Shwartz et al. Mathematical Programming 2011

74 Gradient Convex function g(z) Gradient s at a point z0 Gradient? 2z0
g(z) – g(z0) ≥ sT(z-z0) g(z) = z2

75 Gradient Descent minz g(z) Start at some point z0
Move along the negative gradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = z2

76 Gradient Convex function g(z) Gradient s at a point z0 May not exist
g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|

77 Subgradient Convex function g(z) Subgradient s at a point z0
May not be unique g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|

78 Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work

79 Subgradient Descent minz max{z2 + 2z1, z2 - 2z1} 5 5+3λ z2 -2 g(z) = 5
-2 g(z) = 5 g(z) = 4 5 1 g(z) = 3 z1

80 Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work

81 Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) limt→∞ λt = 0 limT→∞∑1T λt = ∞ g(z) = |z| Convergence

82 Learning Objective Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Constrained problem?

83 Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 +
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Subgradient? g(z) – g(z0) ≥ sT(z-z0)

84 Subgradient C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Ψ(xi,y) - Ψ(xi,yi)

85 Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi) Proof?

86 Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi)

87 Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Classification inference
Output: y  {1,2,…,C} Brute-force search

88 Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Detection inference
Output: y  {1,2,…,C} Brute-force search

89 Inference maxy ∑a (wa)TΨu(xia,ya) + ∑a,b (wab)TΨp(xiab,yab)
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Segmentation inference maxy ∑a (wa)TΨu(xia,ya) car road grass tree sky + ∑a,b (wab)TΨp(xiab,yab) + ∑a Δ(yia,ya) Week 5 “Optimization” lectures

90 Subgradient Descent Start at some parameter w0 For t = 0 to T
// Number of iterations s = 2wt For i = 1 to n // Number of samples ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi)) End wt+1 = wt + λtst λt = 1/(t+1) End

91 Subgradient Descent Start at some parameter w0 For t = 0 to T
// Number of iterations s = 2wt For i = 1 to n // Number of samples ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi)) End wt+1 = wt + λtst λt = 1/(t+1) End

92 C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

93 Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n

94 Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n Expected value? Original objective function

95 Stochastic Subgradient Descent
Start at some parameter w0 For t = 0 to T // Number of iterations s = 2wt Choose a sample ‘i’ with probability 1/n ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi)) wt+1 = wt + λtst λt = 1/(t+1) End

96 Convergence Rate Compute an ε-optimal solution C: SSVM hyperparameter
d: Number of non-zeros in the feature vector O(dC/ε) iterations Each iteration requires solving an inference problem

97 Side Note: Structured Output CNN
SSVM conv1 fc7 conv2 conv3 conv4 conv5 fc6 Back-propagate the subgradients

98 Outline Structured Output Prediction Structured Output SVM
Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Lacoste-Julien et al. ICML 2013

99 Conditional Gradient Slide courtesy Martin Jaggi

100 Conditional Gradient Slide courtesy Martin Jaggi

101 Conditional Gradient Slide courtesy Martin Jaggi

102 Conditional Gradient Slide courtesy Martin Jaggi

103 SSVM Primal minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi
for all y Derive dual on board

104 SSVM Dual maxα ||Mα||2/4 + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y w = Mα/2 bT = [Δ(yi,y)]

105 Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Solve this over all possible α Standard Frank-Wolfe Solve this over all possible αi for a sample ‘i’ Block Coordinate Frank-Wolfe

106 Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Vertices? C, if y = ŷ αi(y) = 0, otherwise

107 Solution maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)} Inference C, if y = ŷ si(y) = 0, otherwise Which one maximizes the linear function?

108 Update αt+1 = (1-μ) αt + μs Standard Frank-Wolfe
s contains the solution for all the samples Block Coordinate Frank-Wolfe s contains the solution for sample ‘i’ sj = αtj for all other samples

109 Step-Size αt+1 = (1-μ) αt + μs
Maximizing a quadratic function in one variable μ Analytical computation of optimal step-size

110 Comparison OCR Dataset

111 Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function

112 Optical Character Recognition
Identify each letter in a handwritten word Taskar, Guestrin and Koller, NIPS 2003

113 Optical Character Recognition
X1 X2 X3 X4 Labels L = {a, b, …., z} Logistic Regression Multi-Class SVM Taskar, Guestrin and Koller, NIPS 2003

114 Optical Character Recognition
X1 X2 X3 X4 Labels L = {a, b, …., z} Maximum Likelihood Structured Output SVM Taskar, Guestrin and Koller, NIPS 2003

115 Optical Character Recognition
Taskar, Guestrin and Koller, NIPS 2003

116 Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006

117 Image Segmentation Labels L = {0, 1}
X1 X2 X3 X4 X5 X6 X7 X8 X9 Labels L = {0, 1} Szummer, Kohli and Hoiem, ECCV 2006

118 Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006

119 Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function

120 Scene Dataset Finley and Joachims, ICML 2008

121 Reuters Dataset Finley and Joachims, ICML 2008

122 Yeast Dataset Finley and Joachims, ICML 2008

123 Mediamill Dataset Finley and Joachims, ICML 2008

124 Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function

125 “Jumping” Classification

126 Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Threshold the score for classification

127 “Jumping” Ranking Average Precision = 1 Rank 1 Rank 2 Rank 3 Rank 4

128 Ranking vs. Classification
Average Precision = 0.81 = 1 = 0.92 Accuracy = 0.67 = 1

129 Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Sort the score for ranking

130 Computes subgradients of the AP loss

131 Yue, Finley, Radlinski and Joachims, SIGIR 2007
4% improvement for free 5x slower AP 0-1 AP Average Precision Training Time 0-1 Yue, Finley, Radlinski and Joachims, SIGIR 2007

132 Efficient Optimization of Average Precision
C. V. Jawahar Pritish Mohapatra M. Pawan Kumar

133 Training Time 5x slower Slightly faster
AP Training Time 5x slower Slightly faster 0-1 AP Each iteration for AP optimization is slightly slower It takes fewer iterations to converge in practice

134 Questions?


Download ppt "Learning from Big Data Lecture 5"

Similar presentations


Ads by Google