Learning from Big Data Lecture 5

Name: Learning from Big Data Lecture 5
Uploaded: 2017-08-14T23:33:42+00:00
Duration: PTM46S16
Channel: Barry Garrett
Description: Learning from Big Data Lecture 5

Learning from Big Data Lecture 5
M. Pawan Kumar Slides available online

Outline Structured Output Prediction Structured Output SVM
Optimization Results

Image Classification Is this an urban or rural area? Input: x
Output: y  {-1,+1}

Image Classification Is this scan healthy or unhealthy? Input: x
Output: y  {-1,+1}

Image Classification y x Probabilistic Graphical Label +1 Model
Unobserved output x Observed input

Feature Vector x Feature Φ(x)

Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3

Joint Feature Vector Input: x Output: y  {-1,+1} Ψ(x,y)

Joint Feature Vector Input: x Output: y  {-1,+1} Φ(x) Ψ(x,-1) =

Joint Feature Vector Input: x Output: y  {-1,+1} Ψ(x,+1) = Φ(x)

Score Function f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Input: x
Output: y  {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y  {-1,+1} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))

Binary Output Multi-label Output Structured Output Learning Structured Output SVM Optimization Results

Image Classification Which city is this? Input: x
Output: y  {1,2,…,C}

Image Classification What type of tumor does this scan contain?
Input: x Output: y  {1,2,…,C}

Image Classification y x C Graphical Model 3 2 1 Unobserved output
Observed input

Feature Vector Pre-Trained CNN Feature Φ(x) x fc7 conv1 conv2 conv3

Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,y)

Joint Feature Vector Input: x Output: y  {1,2,…,C} Φ(x) Ψ(x,1) = .

Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,2) = Φ(x) .

Joint Feature Vector Input: x Output: y  {1,2,…,C} Ψ(x,C) = . Φ(x)

Object Detection Where is the object in the image? Input: x
Output: y  {Pixels}

Object Detection Where is the rupture in the scan? Input: x
Output: y  {Pixels}

Object Detection y x C Graphical Model 3 2 1 Unobserved output
Observed input

Joint Feature Vector Pre-Trained CNN y Ψ(x,y) x fc7 conv1 conv2 conv3

Output: y  {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)
Input: x Output: y  {1,2,…,C} f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) Maximize the score over all possible outputs y* = argmaxy f(Ψ(x,y))

Segmentation What is the semantic class of each pixel? Input: x
car road grass tree sky sky What is the semantic class of each pixel? Input: x Output: y  {1,2,…,C}m

Segmentation What is the muscle group of each pixel? Input: x
Output: y  {1,2,…,C}m

Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8

Feature Vector Pre-Trained CNN x1 Feature Φ(x1) fc7 conv1 conv2 conv3

Joint Feature Vector Φ(x1) Ψu(x1,1) = . Input: x1
Output: y1  {1,2,…,C} Φ(x1) Ψu(x1,1) = .

Joint Feature Vector Ψu(x1,2) = Φ(x1) . Input: x1
Output: y1  {1,2,…,C} Ψu(x1,2) = Φ(x1) .

Joint Feature Vector Ψu(x1,C) = . Φ(x1) Input: x1
Output: y1  {1,2,…,C} Ψu(x1,C) = . Φ(x1)

Feature Vector Pre-Trained CNN x2 Feature Φ(x2) fc7 conv1 conv2 conv3

Joint Feature Vector Φ(x2) Ψu(x2,1) = . Input: x2
Output: y2  {1,2,…,C} Φ(x2) Ψu(x2,1) = .

Joint Feature Vector Ψu(x2,2) = Φ(x2) . Input: x2
Output: y2  {1,2,…,C} Ψu(x2,2) = Φ(x2) .

Joint Feature Vector Ψu(x2,C) = . Φ(x2) Input: x2
Output: y2  {1,2,…,C} Ψu(x2,C) = . Φ(x2)

Overall Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)

Score Function f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) Input: x
Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)

Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy f(Ψu(x,y))

Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy wTΨu(x,y)

Prediction f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y)
Input: x Output: y  {1,2,…,C}m f: Ψu(x,y) → (-∞,+∞) wTΨu(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) Maximize for each a  {1,2,…,m} independently

Segmentation Graphical Model x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8

Unary Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x1,y1) Ψu(x,y) = Ψu(x2,y2) . Ψu(xm,ym)

Pairwise Joint Feature Vector
x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9

x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x12,y12) = δ(y1=y2)

x1 x2 x3 y1 y2 y3 x4 x5 x6 y4 y5 y6 x7 x8 x9 y7 y8 y9 Ψp(x23,y23) = δ(y2=y3)

Input: x Output: y  {1,2,…,C}m Ψp(x12,y12) Ψp(x,y) = Ψp(x23,y23) .

Overall Joint Feature Vector
Input: x Output: y  {1,2,…,C}m Ψu(x,y) Ψ(x,y) = Ψp(x,y)

Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y)

Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))
Input: x Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy f(Ψ(x,y))

Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y) Input: x
Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy wTΨ(x,y)

Prediction f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya)
Input: x Output: y  {1,2,…,C}m f: Ψ(x,y) → (-∞,+∞) wTΨ(x,y) y* = argmaxy ∑a (wa)TΨu(xa,ya) + ∑a,b (wab)TΨp(xab,yab) Week 5 “Optimization” lectures

Summary How do I fix “f”? Input x, Outputs {y1,y2,..} Features Ψ(x,yi)
Extract Features Features Ψ(x,yi) How do I fix “f”? Compute Scores Prediction y(f) Scores f(Ψ(x,yi)) maxyi f(Ψ(x,yi))

f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

f* = argminf EP(x,y) Error(y(f),y)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth

Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples
Measure of prediction quality f* = argminf Σi Error(yi(f),yi) Expectation over empirical distribution Prediction Ground Truth

f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Relative weight (hyperparameter) Regularizer

f* = argminf Σi Error(yi(f),yi) + λ R(f)
Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Error can be negative log-likelihood Probabilistic model

Optimization Results Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

Score Function and Prediction
Input: x Output: y Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxy wTΨ(x,y) Predicted Output: y(w) = argmaxy wTΨ(x,y)

Error Function Loss or risk of prediction given ground-truth Δ(y,y(w))
User specified Classification loss? “New York” “Paris” 1 Δ(y,y(w)) = δ(y=y(w))

User specified Detection loss? Overlap score Area of intersection Area of union

User specified Segmentation loss? car road grass tree sky Fraction of incorrect pixels Micro-average Macro-average

Learning Objective Training data {(xi,yi), i = 1,2,…,n}
Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur)

wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi(w)) ≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi) ≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi) Sensitive to regularization of w Convex

minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Quadratic program with large number of constraints Many polynomial time algorithms

Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Shalev-Shwartz et al. Mathematical Programming 2011

Gradient Convex function g(z) Gradient s at a point z0 Gradient? 2z0
g(z) – g(z0) ≥ sT(z-z0) g(z) = z2

Gradient Descent minz g(z) Start at some point z0
Move along the negative gradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = z2

Gradient Convex function g(z) Gradient s at a point z0 May not exist
g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|

Subgradient Convex function g(z) Subgradient s at a point z0
May not be unique g(z) – g(z0) ≥ sT(z-z0) g(z) = |z|

Subgradient Descent minz g(z) Start at some point z0
Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work

Subgradient Descent minz max{z2 + 2z1, z2 - 2z1} 5 5+3λ z2 -2 g(z) = 5
-2 g(z) = 5 -λ g(z) = 4 5 1 g(z) = 3 z1

Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) Estimate step-size via line search g(z) = |z| Doesn’t always work

Move along the negative subgradient direction zt+1 ← zt – λtg’(zt) limt→∞ λt = 0 limT→∞∑1T λt = ∞ g(z) = |z| Convergence

minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y Constrained problem?

Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 +
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Subgradient? g(z) – g(z0) ≥ sT(z-z0)

Subgradient C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Ψ(xi,y) - Ψ(xi,yi)

Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi) Proof?

Subgradient ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Ψ(xi,ŷ) - Ψ(xi,yi)

Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Classification inference
Output: y  {1,2,…,C} Brute-force search

Inference ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Detection inference
Output: y  {1,2,…,C} Brute-force search

Inference maxy ∑a (wa)TΨu(xia,ya) + ∑a,b (wab)TΨp(xiab,yab)
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Segmentation inference maxy ∑a (wa)TΨu(xia,ya) car road grass tree sky + ∑a,b (wab)TΨp(xiab,yab) + ∑a Δ(yia,ya) Week 5 “Optimization” lectures

Subgradient Descent Start at some parameter w0 For t = 0 to T
// Number of iterations s = 2wt For i = 1 to n // Number of samples ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi)) End wt+1 = wt + λtst λt = 1/(t+1) End

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Learning Objective Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n

Stochastic Approximation
Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)} Choose a sample ‘i’ with probability 1/n Expected value? Original objective function

Stochastic Subgradient Descent
Start at some parameter w0 For t = 0 to T // Number of iterations s = 2wt Choose a sample ‘i’ with probability 1/n ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)} s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi)) wt+1 = wt + λtst λt = 1/(t+1) End

Convergence Rate Compute an ε-optimal solution C: SSVM hyperparameter
d: Number of non-zeros in the feature vector O(dC/ε) iterations Each iteration requires solving an inference problem

Side Note: Structured Output CNN
SSVM conv1 fc7 conv2 conv3 conv4 conv5 fc6 Back-propagate the subgradients

Optimization Stochastic subgradient descent Conditional gradient aka Frank-Wolfe Results Lacoste-Julien et al. ICML 2013

Conditional Gradient Slide courtesy Martin Jaggi

SSVM Primal minw ||w||2 + C Σiξi wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi
for all y Derive dual on board

SSVM Dual maxα ||Mα||2/4 + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y w = Mα/2 bT = [Δ(yi,y)]

Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Solve this over all possible α Standard Frank-Wolfe Solve this over all possible αi for a sample ‘i’ Block Coordinate Frank-Wolfe

Linear Program maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y Vertices? C, if y = ŷ αi(y) = 0, otherwise

Solution maxα (Mα)Twt + bTα ∑y αi(y) = C for all i αi(y) ≥ 0
for all i, y ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)} Inference C, if y = ŷ si(y) = 0, otherwise Which one maximizes the linear function?

Update αt+1 = (1-μ) αt + μs Standard Frank-Wolfe
s contains the solution for all the samples Block Coordinate Frank-Wolfe s contains the solution for sample ‘i’ sj = αtj for all other samples

Step-Size αt+1 = (1-μ) αt + μs
Maximizing a quadratic function in one variable μ Analytical computation of optimal step-size

Comparison OCR Dataset

Optimization Results Exact Inference Approximate Inference Choice of Loss Function

Optical Character Recognition
Identify each letter in a handwritten word Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4 Labels L = {a, b, …., z} Logistic Regression Multi-Class SVM Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4 Labels L = {a, b, …., z} Maximum Likelihood Structured Output SVM Taskar, Guestrin and Koller, NIPS 2003

Taskar, Guestrin and Koller, NIPS 2003

Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006

Image Segmentation Labels L = {0, 1}
X1 X2 X3 X4 X5 X6 X7 X8 X9 Labels L = {0, 1} Szummer, Kohli and Hoiem, ECCV 2006

Image Segmentation Szummer, Kohli and Hoiem, ECCV 2006

Scene Dataset Finley and Joachims, ICML 2008

Reuters Dataset Finley and Joachims, ICML 2008

Yeast Dataset Finley and Joachims, ICML 2008

Mediamill Dataset Finley and Joachims, ICML 2008

“Jumping” Classification

Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Threshold the score for classification

“Jumping” Ranking Average Precision = 1 Rank 1 Rank 2 Rank 3 Rank 4

Ranking vs. Classification
Average Precision = 0.81 = 1 = 0.92 Accuracy = 0.67 = 1

Standard Pipeline Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier Classifier assigns a score to each test sample Sort the score for ranking

Computes subgradients of the AP loss

Yue, Finley, Radlinski and Joachims, SIGIR 2007
4% improvement for free 5x slower AP 0-1 AP Average Precision Training Time 0-1 Yue, Finley, Radlinski and Joachims, SIGIR 2007

Efficient Optimization of Average Precision
C. V. Jawahar Pritish Mohapatra M. Pawan Kumar

Training Time 5x slower Slightly faster
AP Training Time 5x slower Slightly faster 0-1 AP Each iteration for AP optimization is slightly slower It takes fewer iterations to converge in practice

Questions?

Learning from Big Data Lecture 5

Similar presentations

Presentation on theme: "Learning from Big Data Lecture 5"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning from Big Data Lecture 5

Similar presentations

Presentation on theme: "Learning from Big Data Lecture 5"— Presentation transcript:

Similar presentations

About project

Feedback