Download presentation
1
Learning from Big Data Lecture 5
M. Pawan Kumar Slides available online
2
Outline Structured Output Prediction Structured Output SVM
Optimization Results
3
Image Classification Is this an urban or rural area? Input: x
Output: y {-1,+1}
4
Image Classification Is this scan healthy or unhealthy? Input: x
Output: y {-1,+1}
5
Image Classification y x Probabilistic Graphical Label +1 Model
Unobserved output x Observed input
6
Feature Vector x Feature Φ(x)
7
Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3
8
Joint Feature Vector Input: x Output: y {-1,+1} Ψ(x,y)
9
Joint Feature Vector Input: x Output: y {-1,+1} Φ(x) Ψ(x,-1) =
10
Joint Feature Vector Input: x Output: y {-1,+1} Ψ(x,+1) = Φ(x)
11
Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
12
Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))
13
Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results
14
Image Classification Which city is this? Input: x
Output: y {1,2,…,C}
15
Image Classification What type of tumor does this scan contain?
Input: x Output: y {1,2,…,C}
16
Image Classification y x C Graphical Model 3 2 1 Unobserved output
Observed input
17
Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3
18
Joint Feature Vector Input: x Output: y {1,2,…,C} Ψ(x,y)
19
Joint Feature Vector Input: x Output: y {1,2,…,C} Φ(x) Ψ(x,1) = .
20
Joint Feature Vector Input: x Output: y {1,2,…,C} Ψ(x,2) = Φ(x) .
21
Joint Feature Vector Input: x Output: y {1,2,…,C} Ψ(x,C) = . Φ(x)
22
Object Detection Where is the object in the image? Input: x
Output: y {Pixels}
23
Object Detection Where is the rupture in the scan? Input: x
Output: y {Pixels}
24
Object Detection y x C Graphical Model 3 2 1 Unobserved output
Observed input
25
Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3
26
Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3
27
Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3
28
Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
29
Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))
30
Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results
31
Segmentation What is the semantic class of each pixel? Input: x
car road grass tree sky sky What is the semantic class of each pixel? Input: x Output: y {1,2,…,C}m
32
Segmentation What is the muscle group of each pixel? Input: x
Output: y {1,2,…,C}m
33
Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8
34
Feature Vector Pre-Trained CNN x1 Feature Φ(x1) fc7 conv1 conv2 conv3
35
Joint Feature Vector Φ(x1) Ψu(x1,1) = . Input: x1
Output: y1 {1,2,…,C} Φ(x1) Ψu(x1,1) = .
36
Joint Feature Vector Ψu(x1,2) = Φ(x1) . Input: x1
Output: y1 {1,2,…,C} Ψu(x1,2) = Φ(x1) .
37
Joint Feature Vector Ψu(x1,C) = . Φ(x1) Input: x1
Output: y1 {1,2,…,C} Ψu(x1,C) = . Φ(x1)
38
Feature Vector Pre-Trained CNN x2 Feature Φ(x2) fc7 conv1 conv2 conv3
39
Joint Feature Vector Φ(x2) Ψu(x2,1) = . Input: x2
Output: y2 {1,2,…,C} Φ(x2) Ψu(x2,1) = .
40
Joint Feature Vector Ψu(x2,2) = Φ(x2) . Input: x2
Output: y2 {1,2,…,C} Ψu(x2,2) = Φ(x2) .
41
Joint Feature Vector Ψu(x2,C) = . Φ(x2) Input: x2
Output: y2 {1,2,…,C} Ψu(x2,C) = . Φ(x2)
42
Overall Joint Feature Vector
Input: x Output: y {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)
43
Score Function f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) Input: x
Output: y {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)
44
Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))
Input: x Output: y {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))
45
Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)
Input: x Output: y {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)
46
Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)
Input: x Output: y {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) Maximize for each a {1,2,…,m} independently
47
Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8
48
Unary Joint Feature Vector
Input: x Output: y {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)
49
Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9
50
Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x12,y12) = δ(y1=y2)
51
Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x23,y23) = δ(y2=y3)
52
Pairwise Joint Feature Vector
Input: x Output: y {1,2,…,C}m Ψp(x12,y12) Ψp(x,y) = Ψp(x23,y23) .
53
Overall Joint Feature Vector
Input: x Output: y {1,2,…,C}m Ψu(x,y) Ψ(x,y) = Ψp(x,y)
54
Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
55
Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))
Input: x Output: y {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))
56
Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y) Input: x
Output: y {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y)
57
Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya)
Input: x Output: y {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) + ∑a,b (wab)TΨp(xab,yab) Week 5 “Optimization” lectures
58
Summary How do I fix “f”? Input x, Outputs {y1,y2,..} Features Ψ(x,yi)
Extract Features Features Ψ(x,yi) How do I fix “f”? Compute Scores Prediction y(f) Scores f(Ψ(x,yi)) maxyi f(Ψ(x,yi))
59
Outline Structured Output Prediction Structured Output SVM
Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results
60
f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth
61
f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth
62
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples
Measure of prediction quality f* = argminf Σi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth
63
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Relative weight (hyperparameter) Regularizer
64
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Error can be negative log-likelihood Probabilistic model
65
Outline Structured Output Prediction Structured Output SVM
Optimization Results Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004
66
Score Function and Prediction
Input: x Output: y Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxy wTΨ(x,y) Predicted Output: y(w) = argmaxy wTΨ(x,y)
67
Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Classification loss? “New York” “Paris” 1 Δ(y,y(w)) = δ(y=y(w))
68
Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Detection loss? Overlap score Area of intersection Area of union
69
Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Segmentation loss? car road grass tree sky Fraction of incorrect pixels Micro-average Macro-average
70
Learning Objective Training data {(xi,yi), i = 1,2,…,n}
Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)
71
Learning Objective Training data {(xi,yi), i = 1,2,…,n}
wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi(w)) ≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi) ≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi) Sensitive to regularization of w Convex
72
Learning Objective Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Quadratic program with large number of constraints Many polynomial time algorithms
73
Outline Structured Output Prediction Structured Output SVM
Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Shalev-Shwartz et al. Mathematical Programming 2011
74
Gradient Convex function g(z) Gradient s at a point z0 Gradient? 2z0
g(z) – g(z0) ≥ sT(z-z0) g(z) = z2
75
Gradient Descent minz g(z) Start at some point z0
Move along the negative gradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = z2
76
Gradient Convex function g(z) Gradient s at a point z0 May not exist
g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|
77
Subgradient Convex function g(z) Subgradient s at a point z0
May not be unique g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|
78
Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work
79
Subgradient Descent minz max{z2 + 2z1, z2 - 2z1} 5 5+3λ z2 -2 g(z) = 5
-2 g(z) = 5 -λ g(z) = 4 5 1 g(z) = 3 z1
80
Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work
81
Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) limt→∞ λt = 0 limT→∞∑1T λt = ∞ g(z) = |z| Convergence
82
Learning Objective Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Constrained problem?
83
Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 +
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Subgradient? g(z) – g(z0) ≥ sT(z-z0)
84
Subgradient C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Ψ(xi,y) - Ψ(xi,yi)
85
Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi) Proof?
86
Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi)
87
Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Classification inference
Output: y {1,2,…,C} Brute-force search
88
Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Detection inference
Output: y {1,2,…,C} Brute-force search
89
Inference maxy ∑a (wa)TΨu(xia,ya) + ∑a,b (wab)TΨp(xiab,yab)
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Segmentation inference maxy ∑a (wa)TΨu(xia,ya) car road grass tree sky + ∑a,b (wab)TΨp(xiab,yab) + ∑a Δ(yia,ya) Week 5 “Optimization” lectures
90
Subgradient Descent Start at some parameter w0 For t = 0 to T
// Number of iterations s = 2wt For i = 1 to n // Number of samples ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi)) End wt+1 = wt + λtst λt = 1/(t+1) End
91
Subgradient Descent Start at some parameter w0 For t = 0 to T
// Number of iterations s = 2wt For i = 1 to n // Number of samples ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi)) End wt+1 = wt + λtst λt = 1/(t+1) End
92
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
93
Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n
94
Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n Expected value? Original objective function
95
Stochastic Subgradient Descent
Start at some parameter w0 For t = 0 to T // Number of iterations s = 2wt Choose a sample ‘i’ with probability 1/n ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi)) wt+1 = wt + λtst λt = 1/(t+1) End
96
Convergence Rate Compute an ε-optimal solution C: SSVM hyperparameter
d: Number of non-zeros in the feature vector O(dC/ε) iterations Each iteration requires solving an inference problem
97
Side Note: Structured Output CNN
SSVM conv1 fc7 conv2 conv3 conv4 conv5 fc6 Back-propagate the subgradients
98
Outline Structured Output Prediction Structured Output SVM
Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Lacoste-Julien et al. ICML 2013
99
Conditional Gradient Slide courtesy Martin Jaggi
100
Conditional Gradient Slide courtesy Martin Jaggi
101
Conditional Gradient Slide courtesy Martin Jaggi
102
Conditional Gradient Slide courtesy Martin Jaggi
103
SSVM Primal minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi
for all y Derive dual on board
104
SSVM Dual maxα ||Mα||2/4 + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y w = Mα/2 bT = [Δ(yi,y)]
105
Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Solve this over all possible α Standard Frank-Wolfe Solve this over all possible αi for a sample ‘i’ Block Coordinate Frank-Wolfe
106
Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Vertices? C, if y = ŷ αi(y) = 0, otherwise
107
Solution maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)} Inference C, if y = ŷ si(y) = 0, otherwise Which one maximizes the linear function?
108
Update αt+1 = (1-μ) αt + μs Standard Frank-Wolfe
s contains the solution for all the samples Block Coordinate Frank-Wolfe s contains the solution for sample ‘i’ sj = αtj for all other samples
109
Step-Size αt+1 = (1-μ) αt + μs
Maximizing a quadratic function in one variable μ Analytical computation of optimal step-size
110
Comparison OCR Dataset
111
Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function
112
Optical Character Recognition
Identify each letter in a handwritten word Taskar, Guestrin and Koller, NIPS 2003
113
Optical Character Recognition
X1 X2 X3 X4 Labels L = {a, b, …., z} Logistic Regression Multi-Class SVM Taskar, Guestrin and Koller, NIPS 2003
114
Optical Character Recognition
X1 X2 X3 X4 Labels L = {a, b, …., z} Maximum Likelihood Structured Output SVM Taskar, Guestrin and Koller, NIPS 2003
115
Optical Character Recognition
Taskar, Guestrin and Koller, NIPS 2003
116
Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006
117
Image Segmentation Labels L = {0, 1}
X1 X2 X3 X4 X5 X6 X7 X8 X9 Labels L = {0, 1} Szummer, Kohli and Hoiem, ECCV 2006
118
Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006
119
Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function
120
Scene Dataset Finley and Joachims, ICML 2008
121
Reuters Dataset Finley and Joachims, ICML 2008
122
Yeast Dataset Finley and Joachims, ICML 2008
123
Mediamill Dataset Finley and Joachims, ICML 2008
124
Outline Structured Output Prediction Structured Output SVM
Optimization Results Exact Inference Approximate Inference Choice of Loss Function
125
“Jumping” Classification
126
Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Threshold the score for classification
127
“Jumping” Ranking Average Precision = 1 Rank 1 Rank 2 Rank 3 Rank 4
128
Ranking vs. Classification
Average Precision = 0.81 = 1 = 0.92 Accuracy = 0.67 = 1
129
Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Sort the score for ranking
130
Computes subgradients of the AP loss
131
Yue, Finley, Radlinski and Joachims, SIGIR 2007
4% improvement for free 5x slower AP 0-1 AP Average Precision Training Time 0-1 Yue, Finley, Radlinski and Joachims, SIGIR 2007
132
Efficient Optimization of Average Precision
C. V. Jawahar Pritish Mohapatra M. Pawan Kumar
133
Training Time 5x slower Slightly faster
AP Training Time 5x slower Slightly faster 0-1 AP Each iteration for AP optimization is slightly slower It takes fewer iterations to converge in practice
134
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.