Kai-Wei Chang University of Virginia

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net
Lecture 9: Learning Kai-Wei Chang University of Virginia Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Advanced ML: Inference

So far we saw What is structured output prediction?
Different ways for modeling structured prediction Conditional random fields, factor graphs, constraints What we only occasionally touched upon: Algorithms for training and inference Viterbi (inference in sequences) Structured perceptron (training in general)

Up next Empirical Risk Minimization
Structural Support Vector Machine How it naturally extends multiclass SVM Empirical Risk Minimization Or: how structural SVM and CRF are solving very similar problems Training Structural SVM via stochastic gradient descent And some tricks

Where are we? Empirical Risk Minimization

Recall: Binary and Multiclass SVM
Binary SVM Maximize margin Equivalently, Minimize norm of weights such that the closest points to the hyperplane have a score §1 Multiclass SVM Each label has a different weight vector (like one-vs-all) Maximize multiclass margin Minimize total norm of the weights such that the true label is scored at least 1 more than the second best one

Multiclass SVM in the separable case
We have a data set D = {<xi, yi>} Recall hard binary SVM

Multiclass SVM in the separable case
Size of the weights. Effectively, regularizer We have a data set D = {<xi, yi>} Recall hard binary SVM The score for the true label is higher than the score for any other label by 1

Structural SVM: First attempt
Suppose we have some definition of a structure (a factor graph) And feature definitions for each “part” p as Áp(x, yp) Remember: we can talk about the feature vector for the entire structure Features sum over the parts We also have a data set D = {<xi, yi>} Modeling

Suppose we have some definition of a structure (a factor graph) And feature definitions for each “part” p as Áp(x, yp) Remember: we can talk about the feature vector for the entire structure Features sum over the parts We also have a data set D = {<xi, yi>} What do we want from training (following the multiclass idea) For each example, The annotated structure yi gets the highest score among all structures The structure yi gets a score of at least one more than any other

Maximize margin For every training example Score for gold structure Score for other structure Some other structure

Maximize margin Some other structure Score for gold Score for other Input with gold structure

Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure

Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure Problem?

Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure Problem Gold structure

Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure Problem Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect

Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure Problem Gold structure Other structure A: Only one mistake Structure B has is more wrong, but this formulation will be happy if both A & B are scored one less than gold! Other structure B: Fully incorrect No partial credit!

Structural SVM: Second attempt
Maximize margin by minimizing norm of w Some other structure Score for gold Score for other Input with gold structure Hamming distance between structures

Maximize margin by minimizing norm of w Another structure, could be yi Score for gold Score for other Input with gold structure Hamming distance between structures. Defined to be zero if y = yi Intuition It is okay for a structure that is close (in Hamming sense) to the true one to get a score that is close to the true structure Structures that are more different from the true structure should score much lower

Maximize margin by minimizing norm of w Another structure, could be yi Score for gold Score for other Input with gold structure Hamming distance between structures. Defined to be zero if y = yi Problem?

Maximize margin by minimizing norm of w What if these constraints are not satisfied for any w for a given dataset? Another structure, could be yi Score for gold Score for other Input with gold structure Hamming distance between structures. Defined to be zero if y = yi Problem? What if the data is not separable?

Structural SVM: Third attempt
Maximize margin by minimizing norm of w Another structure Score for gold Score for other Input with gold structure Hamming distance Also minimize total slack Slack variable for each example, must be positive

Structural SVM: Third attempt
Maximize margin & minimize slack C: the tradeoff parameter Score for other Another structure Score for gold Hamming distance Slack variable for each example Input with gold structure Slacks must be positive Slack variables allow some examples to be misclassified. Minimizing the slack forces this to happen as few times as possible Questions?

Structural SVM Questions? Maximize margin & minimize slack
C: the tradeoff parameter Score for other Another structure Score for gold Hamming distance Slack variable for each example Input with gold structure Slacks must be positive

Structural SVM Questions? Maximize margin, minimize slack
C: the tradeoff parameter Score for other Another structure Score for gold Hamming distance Slack variable for each example Input with gold structure Slacks must be positive Equivalent formulation

Comments Other slightly different formulations exist
Generally same principle Multiclass is a special case of structure Structural SVM strictly generalizes multiclass SVM Can be seen as minimizing structured version of hinge loss Remember empirical risk minimization? Learning as optimization We have framed the optimization problem We haven’t seen how it can be solved yet That is, we don’t have a learning algorithm yet Exercise: Work it out

How to train the model? e.g., maximizing log-likelihood
B D C E F G Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Background knowledge about domain Data annotation difficulty e.g., maximizing log-likelihood max 𝑤 𝑖 𝑃( 𝑦 𝑖 ∣𝑥 𝑖 , 𝑤) Or, maximizing the margin (e.g., SVM) Semi-supervised/indirectly supervised? Advanced ML: Inference

Recap: Structural SVM Maximize margin, minimize slack
C: the tradeoff parameter Score for other Another structure Score for gold Hamming distance Slack variable for each example Input with gold structure Slacks must be positive Equivalent formulation

Broader picture: Learning as loss minimization
Collect some annotated data. More is generally better Pick a hypothesis class (also called model) Decide how the score decomposes over the parts of the output Choose a loss function Decide on how to penalize incorrect decisions Learning = minimize empirical risk + regularizer Typically an optimization procedure needed here This must look familiar. We have seen this before for binary classification!

Empirical risk minimization
Suppose the function Loss scores the quality of a prediction with respect to the true structure Loss(f(x), y) tells us how good f is for this x by comparing it against y Evaluate the quality of the predictor f by averaging over the unknown distribution P that generates data Expected risk: We don’t know P, so use the empirical risk possibly with regularizer Learning: Minimizing regularized risk; various algorithms exist

The loss function zoo: binary classification
Zero-one

The loss function zoo Hinge: SVM Exponential: AdaBoost Perceptron
Zero-one Logistic regression

Structured classifiers: Different learning objectives
Structural SVM CRF Where p is defined as 𝑃 𝑦 𝑥,𝑤 = exp 𝑤 𝑇 𝜙 𝑥,𝑦 𝑍(𝑥,𝑤) “Machine learning is stochastic optimization” – Nati Srebro

Structural SVM CRF Regularizer Where p is defined as 𝑃 𝑦 𝑥,𝑤 = exp 𝑤 𝑇 𝜙 𝑥,𝑦 𝑍(𝑥,𝑤)

Structural SVM CRF Regularizer How badly does w do on the training data Where p is defined as 𝑃 𝑦 𝑥,𝑤 = exp 𝑤 𝑇 𝜙 𝑥,𝑦 𝑍(𝑥,𝑤)

Structural SVM CRF Structured hinge loss Regularizer How badly does w do on the training data Where p is defined as 𝑃 𝑦 𝑥,𝑤 = exp 𝑤 𝑇 𝜙 𝑥,𝑦 𝑍(𝑥,𝑤)

Structural SVM CRF Structured hinge loss Regularizer How badly does w do on the training data Log loss (Negative Log-Likelihood) Where p is defined as 𝑃 𝑦 𝑥,𝑤 = exp 𝑤 𝑇 𝜙 𝑥,𝑦 𝑍(𝑥,𝑤)

Training structural SVM
We want to minimize The function being minimized is convex in w Why? Many, many algorithms exist Let’s look at a gradient based one

Recall: General gradient descent
To minimize a differentiable convex function g(x) Initialize x0 to any value Iterate until convergence Find gradient of g at xt: 𝛻g(xt) Update: xt+1→ xt - 𝛾t 𝛻g(xt) What we know: g(x0) ≥ g(x1) ≥ g(x2) … Guarantee: Will converge to local minimum ( with some assumptions)

Advanced ML: Inference

Stochastic gradient descent
To minimize a function g(x) that has the form  gi(x) (Each gi is differentiable) Initialize x0 Iterate until convergence Pick a random gi and compute its gradient at xt : 𝛻gi(xt) Update: xt+1→ xt - 𝛾t 𝛻g(xt) General idea: Replace the gradient with a noisy estimate

Gradient descent vs SGD
Stochastic Gradient descent Gradient descent [Images from Ng]

GD v.s. SGD -- ask direction
Advanced ML: Inference

Stochastic gradient descent
We want to minimize General SGD strategy: Pretend our training set has only one example (say xi, yi) Compute gradient and update w C=1 C=N This C is different from the above one Gradient doesn’t exist!

Consider the function f(x) = max {f1(x), f2(x)}
Subgradients of max Consider the function f(x) = max {f1(x), f2(x)} f1(x) > f2(x), unique subgradient 𝛻 f1(x) f2(x) > f1(x), unique subgradient 𝛻 f2(x) f1(x) = f2(x), subgradients in [𝛻f1(x), 𝛻f2(x)] Strategy: Solve the max first, then compute gradient of whichever function is argmax

Recap: Binary Classification Stochastic (sub)-gradient descent for SVM
𝑓 𝑤 = 𝒘 𝑇 𝒘+𝐶 𝑖 max (0, 1− y i 𝐰 T 𝐱 i ) Given a training set 𝒟={ 𝒙, 𝑦 } Initialize 𝒘←𝟎∈ ℝ 𝑛 For epoch 1…𝑇: For (𝒙,𝑦) in 𝒟: if 𝑦 𝒘 ⊤ 𝒙 <𝟏 𝒘← 1−𝜂 𝒘+𝜂 𝐶 𝑦𝒙 else 𝒘← 1−𝜂 𝒘 Return 𝒘 CS6501 Lecture 2

Recap: Multi-class SVM
min 𝑤 𝑘 𝑤 𝑘 𝑇 𝑤 𝑘 +𝐶 𝑖 ( max 𝑘 Δ 𝑦 𝑖 ,𝑘 + 𝑤 𝑘 𝑇 𝑥 − 𝑤 𝑦 𝑖 𝑇 𝑥 ) Let’s define: Δ 𝑦, 𝑦 ′ = 𝛿 if 𝑦≠𝑦′ 𝑖𝑓 𝑦=𝑦′ CS6501 Lecture 3

Recap: SGD for multi-class SVM
min 𝑤 𝑘 𝑤 𝑘 𝑇 𝑤 𝑘 +𝐶 𝑖 ( max 𝑘 Δ 𝑦 𝑖 ,𝑘 + 𝑤 𝑘 𝑇 𝑥 − 𝑤 𝑦 𝑖 𝑇 𝑥 ) Advanced ML: Inference

Global features for sequential tagging
The feature function decomposes over the sequence

SGD for Sequential tagging
Given a training set 𝒟={ 𝒙, 𝑦 } Initialize 𝒘←𝟎∈ ℝ 𝑛 For epoch 1…𝑇: For (𝒙,𝑦) in 𝒟: 𝑦 =argma x y′ 𝑤 𝑇 𝜙(𝑥,𝑦′) 𝒘← 𝟏−𝜂 𝒘+𝜂 𝜙 𝑥, 𝑦 −𝜙 𝑥, 𝑦 Return 𝒘 Inference in the training loop Prediction: argma x y 𝑤 𝑇 𝜙(𝑥,𝑦) CS6501 Lecture 3

Back to Structural SVM: Stochastic subgradient descent
We want to minimize General SGD strategy: Pretend our training set has only one example (say xi, yi) Compute subgradient and update w

Subgradient computation
Solve the max. Suppose solution is y’ The loss-augmented/loss-sensitive/cost- augmented inference step Compute gradient of Subgradient is

SGD for structural SVM: The update
Gradient: At each step, go down the gradient: Or equivalently If y’ = yi: Otherwise: Even of loss-augmented inference gives the true answer, shrink w. (Increase margin!)

Subgradient computation
Solve the max. Suppose solution is y’ The loss-augmented/loss-sensitive/cost- augmented inference step Compute gradient of Subgradient is

Loss augmented inference
Solve Recall: Δ(y, yi) is the Hamming distance between y and yi Last term in the inference is constant. So effectively How difficult is this? Computationally, identical complexity to inference You need to implement inference anyway Minor tweak will give you loss-augmented inference

Example of loss-augment inference
2 B 1 2 A 3 3 = 10 A -> A 1 A -> B 2 B -> A 3 B-> B 4 B 1 B 1 B 1 + 2 = 12 4 2 B 1 A 2 = 10 Advanced ML: Inference

A trick for Hamming loss
2 B 1 2 A 3 3 = 10 A -> A 1 A -> B 2 B -> A 3 B-> B 4 B 1 B 1 B 1 + 1 = 12 4 2 B 1 A 2 + 1 = 10 Advanced ML: Inference

Recall: Structured Perceptron algorithm
Given a training set D = {(x,y)} Initialize w = 0 For epoch = 1 … T: For each training example (x, y) ∈ D: Predict y’ = argmaxy’ 𝑤 𝑇 𝜙(𝑥, 𝑦 ′ ) If y ≠ y’, update w ← w + learningRate (𝜙(x, y) - 𝜙(x, y’)) Return w Update only on an error. Perceptron is an mistake-driven algorithm. If there is a mistake, promote y and demote y’

Recall: Structured Perceptron algorithm
Given a training set D = {(x,y)} Initialize w = 0 For epoch = 1 … T: For each training example (x, y) ∈ D: Predict y’ = argmaxy’ 𝑤 𝑇 𝜙(𝑥, 𝑦 ′ ) If y ≠ y’, update w ← w + learningRate (𝜙(x, y) - 𝜙(x, y’)) Return w Inference in training loop! Update only on an error. Perceptron is an mistake-driven algorithm. If there is a mistake, promote y and demote y’

SGD for Structural SVM Given a training set D = {(xi,yi)}
Update is a lot like the Perceptron update. Two differences: Loss augmented inference instead of standard inference Shrink w even if it the inference is correct Given a training set D = {(xi,yi)} Initialize w = 0 For epoch = 1 … T: Shuffle data For each training example (xi, yi) ∈ D: Let y’ = argmaxy wT 𝜙(xi, y) + 𝜙(yi, y) If y’ = y: shink w ← w(1- 𝜂) Else: update w ← w(1- 𝜂) + 𝜂 C (𝜙(xi, yi) - 𝜙(xi, y’)) Return w

SGD for structural SVM and Perceptron
SGD update is a lot like the Perceptron update Two differences: Loss augmented inference If ¢ is uniformly zero, this disappears Shrink w even if it the inference is correct If no regularization (or C is very large), this disappears

Structural Support Vector Machine How it naturally extends multiclass SVM Empirical Risk Minimization Or: how structural SVM and CRF are solving very similar problems Training Structural SVM via stochastic gradient descent The algorithm and loss augmented inference Comparison to Perceptron Practical tips

1. Convergence and learning rates
With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes 𝜂 are positive Sum of squares of step sizes over t = 1 to ∞ is not infinite Sum of step sizes over t = 1 to ∞ is infinity Example: Or use the AdaGrad update (coming up)

2. The AdaGrad update Intuition:
Frequently updated features should get smaller learning rates pay more attention to infrequent, possibly informative features Suppose the update is g = 𝜙(xi, yi) - 𝜙(xi, y’) Set learning rate for feature i as Easy to implement and great theoretical guarantees Per feature learning rate Maintain the sum of squared updates so far as a separate vector

2. The AdaGrad update [Image from Duchi]

2. The AdaGrad update Adagrad changes the geometry of the underlying space [Image from Duchi]

3. Mini-batches for SGD What we saw: Approximate the gradient with a single example Instead, select small set (10, 100) of examples (called mini-batch) at each step and use those to approximate the gradient Compute subgradient of this mini-batch and update w Converges faster, more stable

4. Miscellaneous tips Shuffle the dataset at the beginning of each epoch Monitor training cost and validation error Training error should “generally” decrease Stop training when validation error doesn’t change for some time Very important: Check your gradient computation Even small errors will make SGD erratic Finite differences: See Leon Bottou’s “Stochastic Gradient Descent Tricks” Experiment with learning rates using a small dataset

Summary: Optimizing the SVM objective
We have seen a solver for structural SVMs Stochastic sub-gradient descent and some variants Easy to implement Other approaches exist Dual co-ordinate ascent/descent, cutting planes, etc. To use in your problem (for all solvers): Need to define the model (independence assumptions between predictions), the features, inference and loss-augmented inference Important: SGD is a general strategy, not just for SVMs Use for other objective functions, neural networks, etc.

Kai-Wei Chang University of Virginia

Similar presentations

Presentation on theme: "Kai-Wei Chang University of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kai-Wei Chang University of Virginia

Similar presentations

Presentation on theme: "Kai-Wei Chang University of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback