Administration HW4 is due on Saturday 3/11

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Lecture 9 Support Vector Machines

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Engineering Optimization

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Support vector machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Separating Hyperplanes

Support Vector Machines (and Kernel Methods in general)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Support Vector Machines

Unconstrained Optimization Problem

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Lecture 10: Support Vector Machines

(recap) Kernel Perceptron

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

SVM by Sequential Minimal Optimization (SMO)

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

An Introduction to Support Vector Machine (SVM)

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Support vector machines

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

Support Vector Machine

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Large Margin classifiers

Dan Roth Department of Computer and Information Science

Support Vector Machines

Support Vector Machines

Jan Rupnik Jozef Stefan Institute

Perceptrons Support-Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Chap 3. The simplex method

Statistical Learning Dong Liu Dept. EEIS, USTC.

Large Scale Support Vector Machines

COSC 4335: Other Classification Techniques

Support Vector Machines

Support vector machines

Feature space tansformation methods

Support Vector Machines

Lecture 18. SVM (II): Non-separable Cases

CIS 519/419 Applied Machine Learning

Support vector machines

Support vector machines

COSC 4368 Machine Learning Organization

SVMs for Document Ranking

Support Vector Machines Kernels

Presentation transcript:

Administration HW4 is due on Saturday 3/11 No extensions! We will release solutions on Saturday night, so there is enough time for you to look at it before the exam. Midterm exam on Thursday 3/16 Closed books; in class; ~4 questions All the material covered before the midterm Practice midterms will be released over the weekend Next Tuesday 3/14: Review Additional Office hours: 8:30-9:30 Tomorrow (Wednesday) Questions?

Projects Projects proposals are due on March 10 2017 Scale of Projects: 25% of the grade Projects proposals are due on March 10 2017 Within a week we will give you an approval to continue with your project along with comments and/or a request to modify/augment/do a different project. There will also be a mechanism for peer comments. We encourage team projects – a team can be up to 3 people. Please start thinking and working on the project now. Your proposal is limited to 1-2 pages, but needs to include references and, ideally, some of the ideas you have developed in the direction of the project (maybe even some preliminary results). Any project that has a significant Machine Learning component is good. You can do experimental work, theoretical work, a combination of both or a critical survey of results in some specialized topic. The work has to include some reading. Even if you do not do a survey, you must read (at least) two related papers or book chapters and relate your work to it. Originality is not mandatory but is encouraged. Try to make it interesting!

Examples Fake News Challenge :- http://www.fakenewschallenge.org/ KDD Cup 2013: "Author-Paper Identification": given an author and a small set of papers, we are asked to identify which papers are really written by the author. https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge “Author Profiling”: given a set of document, profile the author: identification, gender, native language, …. Caption Control: Is it gibberish? Spam? High quality text? Adapt an NLP program to a new domain Work on making learned hypothesis (e.g., linear threshold functions, NN) more comprehensible Explain the prediction Develop a (multi-modal) People Identifier Compare Regularization methods: e.g., Winnow vs. L1 Regularization Large scale clustering of documents + name the cluster Deep Networks: convert a state of the art NLP program to a deep network, efficient, architecture. Try to prove something

COLT approach to explaining Learning No Distributional Assumption Training Distribution is the same as the Test Distribution Generalization bounds depend on this view and affects model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) This is also called the “Structural Risk Minimization” principle.

COLT approach to explaining Learning No Distributional Assumption Training Distribution is the same as the Test Distribution Generalization bounds depend on this view and affect model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m) As presented, the VC dimension is a combinatorial parameter that is associated with a class of functions. We know that the class of linear functions has a lower VC dimension than the class of quadratic functions. But, this notion can be refined to depend on a given data set, and this way directly affect the hypothesis chosen for a given data set.

Data Dependent VC dimension So far we discussed VC dimension in the context of a fixed class of functions. We can also parameterize the class of functions in interesting ways. Consider the class of linear functions, parameterized by their margin. Note that this is a data dependent notion.

Linear Classification Let X = R2, Y = {+1, -1} Which of these classifiers would be likely to generalize better? h1 h2

VC and Linear Classification Recall the VC based generalization bound: Err(h) · errTR(h) + Poly{VC(H), 1/m, log(1/±)} Here we get the same bound for both classifier: ErrTR (h1) = ErrTR (h2)= 0 h1, h2 2 Hlin(2), VC(Hlin(2)) = 3 How, then, can we explain our intuition that h2 should give better generalization than h1?

Linear Classification Although both classifiers separate the data, the distance with which the separation is achieved is different: h1 h2

Concept of Margin The margin °i of a point xi 2 Rn with respect to a linear classifier h(x) = sign(w ¢ x +b) is defined as the distance of xi from the hyperplane w ¢ x + b = 0: °i = |(w ¢ xi +b)/||w|| | The margin of a set of points {x1,…xm} with respect to a hyperplane w, is defined as the margin of the point closest to the hyperplane: ° = mini°i = mini|(w ¢ xi +b)/||w|| |

VC and Linear Classification If H° is the space of all linear classifiers in <n that separate the training data with margin at least °, then: VC(H°) · min(R2/°2, n) +1, Where R is the radius of the smallest sphere (in <n) that contains the data. Thus, for such classifiers, we have a bound of the form: Err(h) · errTR(h) + { (O(R2/°2 ) + log(4/±))/m }1/2

Data Dependent VC dimension Namely, when we consider the class H° of linear hypotheses that separate a given data set with a margin °, We see that Large Margin °  Small VC dimension of H° Consequently, our goal could be to find a separating hyperplane w that maximizes the margin of the set S of examples. A second observation that drives an algorithmic approach is that: Small ||w||  Large Margin This leads to an algorithm: from among all those w’s that agree with the data, find the one with the minimal size ||w|| The benefit of VC dimension

How does it help us to derive these h’s? (PS0, PS1): The distance between a point x and the hyperplane defined by (w; b) is: |wT x + b|/||w|| Maximal Margin This discussion motivates the notion of a maximal margin. The maximal margin of a data set S is define as: °(S) = max||w||=1 min(x,y) 2 S |y wT x| For a given w: Find the closest point. 2. Then, find the one the gives the maximal margin value across all w’s (of size 1). Note: the selection of the point is in the min and therefore the max does not change if we scale w, so it’s okay to only deal with normalized w’s. How does it help us to derive these h’s? argmax||w||=1 min(x,y) 2 S |y wT x|

Margin and VC dimension Theorem (Vapnik): If H° is the space of all linear classifiers in <n that separate the training data with margin at least °, then VC(H°) · R2/°2 where R is the radius of the smallest sphere (in <n) that contains the data. This is the first observation that will lead to an algorithmic approach. The second observation is that: Small ||w||  Large Margin Consequently: the algorithm will be: from among all those w’s that agree with the data, find the one with the minimal size ||w|| Believe We’ll Prove

Hard SVM We want to choose the hyperplane that achieves the largest margin. That is, given a data set S, find: w* = argmax||w||=1 min(x,y) 2 S |y wT x| How to find this w*? Claim: Define w0 to be the solution of the optimization problem: w0 = argmin {||w||2 : 8 (x,y) 2 S, y wT x ¸ 1 }. Then: w 0/||w0|| = argmax||w||=1 min(x,y) 2 S y wT x That is, the normalization of w0 corresponds to the largest margin separating hyperplane. 1. For a given w: Find the closest point. 2. Among all w’s (of size 1) find the w the maximizes this point’s margin. Note: the selection of the point in the min and therefore the largest margin w do not change if we scale w, so it’s okay to only deal with normalized w’s. 1. Consider the set of “good” w’s (those that separate the data). 2. Among those, choose the one with minimal size.

Hard SVM (2) Claim: Define w0 to be the solution of the optimization problem: w0 = argmin {||w||2 : 8 (x,y) 2 S, y wT x ¸ 1 } (**) Then: w 0/||w0|| = argmax||w||=1 min(x,y) 2 S y wT x That is, the normalization of w0 corresponds to the largest margin separating hyperplane. Proof: Define w’ = w 0/||w0|| and let w* be the largest-margin separating hyperplane of size 1. We need to show that w’ = w*. Note first that w*/°(S) satisfies the constraints in (**); therefore: ||w0|| · ||w*/°(S)|| = 1/°(S) . Consequently: 8 (x,y) 2 S y w’T x = 1/||w0|| y w0T x ¸ 1/||w0|| ¸ °(S) But since ||w’|| = 1 this implies that w’ corresponds to the largest margin, that is w’= w* Def. of w0 Prev. ineq. Def. of w’ Def. of w0

Margin of a Separating Hyperplane A separating hyperplane: wT x+b = 0 Distance between wT x+b = +1 and -1 is 2 / ||w|| What we did: Consider all possible w with different angles Scale w such that the constraints are tight Pick the one with largest margin/minimal size Assumption: data is linearly separable Let (x0 ,y0) be a point on wTx+b = 1 Then its distance to the separating plane wT x+b = 0 is: |wT (x0 ,y0) +b|/||w||= 1/||w|| wT xi +b¸ 1 if yi = 1 wT xi +b· -1 if yi = -1 => 𝑦 𝑖 ( 𝑤 𝑇 𝑥 𝑖 +𝑏)≥1 wT x+b = 0 wT x+b = -1

Hard SVM Optimization We have shown that the sought after weight vector w is the solution of the following optimization problem: SVM Optimization: (***) Minimize: ½ ||w||2 Subject to: 8 (x,y) 2 S: y wT x ¸ 1 This is a quadratic optimization problem in (n+1) variables, with |S|=m inequality constraints. It has a unique solution.

Maximal Margin min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 The margin of a linear separator wT x+b = 0 is 2 / ||w|| max 2 / ||w|| = min ||w|| = min ½ wTw min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 s.t y i (w T x i +𝑏)≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈𝑆

Support Vector Machines The name “Support Vector Machine” stems from the fact that w* is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/||w*|| from the separating hyperplane. These vectors are therefore called support vectors. Theorem: Let w* be the minimizer of the SVM optimization problem (***) for S = {(xi, yi)}. Let I= {i: w*Tx = 1}. Then there exists coefficients ®i >0 such that: w* = i 2 I ®i yi xi This representation should ring a bell…

(recap) Kernel Perceptron If n’ is large, we cannot represent w explicitly. However, the weight vector w can be written as a linear combination of examples: Where 𝛼 𝑗 is the number of mistakes made on 𝑥 (𝑗) Then we can compute f(x) based on {𝑥 (𝑗) } and 𝜶

(recap) Kernel Perceptron In the training phase, we initialize 𝜶 to be an all-zeros vector. For training sample (𝑥 (𝑘) , 𝑦 (𝑘) ), instead of using the original Perceptron update rule in the 𝑅 𝑛 ′ space we maintain 𝜶 by based on the relationship between w and 𝜶 :

Duality This, and other properties of Support Vector Machines are shown by moving to the dual problem. Theorem: Let w* be the minimizer of the SVM optimization problem (***) for S = {(xi, yi)}. Let I= {i: yi (w*Txi +b)= 1}. Then there exists coefficients ®i >0 such that: w* = i 2 I ®i yi xi

Footnote about the threshold Similar to Perceptron, we can augment vectors to handle the bias term 𝑥 ⇐ 𝑥 , 1 ; 𝑤 ⇐ 𝑤 , 𝑏 so that 𝑤 𝑇 𝑥 = 𝑤 𝑇 𝑥+𝑏 Then consider the following formulation min 𝑤 1 2 𝑤 𝑇 𝑤 s.t y i 𝑤 T 𝑥 i ≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈S However, this formulation is slightly different from (***), because it is equivalent to min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤+ 1 2 𝑏 2 s.t y i ( 𝑤 T x i +𝑏)≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈S The bias term is included in the regularization. This usually doesn’t matter For simplicity, we ignore the bias term

Key Issues Computational Issues Is it really optimal? Training of an SVM used to be is very time consuming – solving quadratic program. Modern methods are based on Stochastic Gradient Descent and Coordinate Descent. Is it really optimal? Is the objective function we are optimizing the “right” one?

Real Data 17,000 dimensional context sensitive spelling Histogram of distance of points from the hyperplane In practice, even in the separable case, we may not want to depend on the points closest to the hyperplane but rather on the distribution of the distance. If only a few are close, maybe we can dismiss them. This applies both to generalization bounds and to the algorithm.

Soft SVM min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 Notice that the relaxation of the constraint: y i w T x i ≥1 Can be done by introducing a slack variable 𝜉 𝑖 (per example) and requiring: y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 Now, we want to solve: A large value of C means that misclassifications are bad - resulting in smaller margins and less training error (but more expected true error). A small C results in more training error, hopefully better true error. What is the value of slack variable when the data point is classified correctly? min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 ∀𝑖

Soft SVM (2) min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 Now, we want to solve: Which can be written as: min 𝑤 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 max (0, 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ) . What is the interpretation of this? min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t 𝜉 𝑖 ≥1− y i w T x i ; 𝜉 𝑖 ≥0 ∀𝑖 min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 ∀𝑖 In optimum, ξ i = max (0, 1− y i w T x i )

Minw ½ ||w||2 + C (x,y) 2 S max(0, 1 – y wTx) Soft SVM (3) The hard SVM formulation assumes linearly separable data. A natural relaxation: maximize the margin while minimizing the # of examples that violate the margin (separability) constraints. However, this leads to non-convex problem that is hard to solve. Instead, move to a surrogate loss function that is convex. SVM relies on the hinge loss function (note that the dual formulation can give some intuition for that too). Minw ½ ||w||2 + C (x,y) 2 S max(0, 1 – y wTx) where the parameter C controls the tradeoff between large margin (small ||w||) and small hinge-loss.

SVM Objective Function The problem we solved is: Min ½ ||w||2 + c  »i Where »i > 0 is called a slack variable, and is defined by: » i = max(0, 1 – yi wtxi) Equivalently, we can say that: yi wtxi ¸ 1 - »; » ¸ 0 And this can be written as: Min ½ ||w||2 + c  »i General Form of a learning algorithm: Minimize empirical loss, and Regularize (to avoid over fitting) Theoretically motivated improvement over the original algorithm we’ve see at the beginning of the semester. Regularization term Empirical loss Can be replaced by other regularization functions Can be replaced by other loss functions

Balance between regularization and empirical loss

Balance between regularization and empirical loss (DEMO)

Underfitting and Overfitting Model complexity Expected Error Underfitting Overfitting Variance Bias Simple models: High bias and low variance Complex models: High variance and low bias Smaller C Larger C

What Do We Optimize?

What Do We Optimize(2)? We get an unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence wk that converges to the optimal solution of the optimization problem above. Currently: Limited memory BFGS is very popular

Optimization: How to Solve 1. Earlier methods used Quadratic Programming. Very slow. 2. The soft SVM problem is an unconstrained optimization problems. It is possible to use the gradient descent algorithm! Still, it is quite slow. Many options within this category: Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence wk that converges to the optimal solution of the optimization problem above. Currently: Limited memory BFGS is very popular 3. 3rd generation algorithms are based on Stochastic Gradient Decent The runtime does not depend on n=#(examples); advantage when n is very large. Stopping criteria is a problem: method tends to be too aggressive at the beginning and reaches a moderate accuracy quite fast, but it’s convergence becomes slow if we are interested in more accurate solutions. 4. Dual Coordinated Descent (& Stochastic Version)

SGD for SVM Initialize 𝑤=0∈ 𝑅 𝑛 2. For every example x i , y i ∈𝐷 Goal: min 𝑤 𝑓 𝑤 ≡ 1 2 𝑤 𝑇 𝑤+ 𝐶 𝑚 𝑖 max 0, 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 . m: data size Compute sub-gradient of 𝑓 𝑤 : 𝛻𝑓 𝑤 =𝑤−𝐶 𝑦 𝑖 𝑥 𝑖 if 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ≥0 ; otherwise 𝛻𝑓 𝑤 =𝑤 m is here for mathematical correctness, it doesn’t matter in the view of modeling. Initialize 𝑤=0∈ 𝑅 𝑛 2. For every example x i , y i ∈𝐷 If 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ≤1 update the weight vector to 𝑤← 1−𝛾 𝑤+𝛾 𝐶𝑦 𝑖 𝑥 𝑖 (𝛾 - learning rate) Otherwise 𝑤←(1−𝛾)𝑤 Continue until convergence is achieved Convergence can be proved for a slightly complicated version of SGD (e.g, Pegasos) This algorithm should ring a bell…

Nonlinear SVM Primal: Dual: min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 We can map data to a high dimensional space: x→𝜙 𝑥 (DEMO) Then use Kernel trick: 𝐾 𝑥 𝑖 , 𝑥 𝑗 =𝜙 𝑥 𝑖 𝑇 𝜙 𝑥 𝑗 (DEMO2) Primal: min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T 𝜙 𝑥 𝑖 ≥1− 𝜉 𝑖 𝜉 𝑖 ≥0 ∀𝑖 Dual: min 𝛼 1 2 𝛼 𝑇 Q𝛼− 𝑒 𝑇 𝛼 s.t 0≤𝛼≤𝐶 ∀𝑖 Q 𝑖𝑗 = 𝑦 𝑖 𝑦 𝑗 𝐾 𝑥 𝑖 , 𝑥 𝑗 Theorem: Let w* be the minimizer of the primal problem, 𝛼 ∗ be the minimizer of the dual problem. Then w ∗ = 𝑖 𝛼 ∗ y i x i

Nonlinear SVM Tradeoff between training time and accuracy Complex model v.s. simple model From: http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf

At this point these work only for linear kernels PEGASOS At = S Subgradient method |At| = 1 Stochastic gradient LibSVM has a version that is called LibLinear that implements similar ideas in the dual space. At this point these work only for linear kernels Not a real problem. Subgradient Projection

Support Vector Machines – Short Summary SVM = Linear Classifier + Regularization + [Kernel Trick]. Typically we are looking for the best w: wopt = argminw 1m L(f(x,w,b),y) Where L is a loss function that penalizes f when it makes a mistake. Typically, L is the 0-1 loss (hinge-loss): L(y’, y)= ½ |y – sgn y’| (y, y’ 2 {-1,1}) Just optimizing with respect to the training data may lead to overfitting. We want to achieve good performance on the training while maintaining some notion of “simple” hypothesis. We therefore change our goal to: wopt = argminw 1m L(f(x,w,b),y) +  R(w) Where L is a loss function, and R is a regularization function, which penalizes w that results in a model that is too complex (high VC dimension).  determines the tradeoff between the empirical error and the regularization.

Support Vector Machines – Short Version SVM = Linear Classifier + Regularization + [Kernel Trick]. This leads to an algorithm: from among all those w’s that agree with the data, find the one with the minimal size ||w|| Minimize ½ ||w||2 Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S This is an optimization problem that can be solved using techniques from optimization theory. By introducing Lagrange multipliers  we can rewrite the dual formulation of this optimization problems as: w = i i yi xi Where the ’s are such that the following functional is maximized: L() = -1/2 i j 1 j xi xj yi yj + i i The optimum setting of the ’s turns out to be: i yi (w ¢ xi +b -1 ) = 0 8 i

Support Vector Machines – Short Version Dependence on the dot product leads to the ability to introduce kernels (just like in perceptron) What if the data is not linearly separable? What is the difference from regularized perceptron/Winnow? SVM = Linear Classifier + Regularization + [Kernel Trick]. Minimize ½ ||w||2 Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S The dual formulation of this optimization problems gives: w = i i yi xi Optimum setting of ’s : i yi (w ¢ xi +b -1 ) = 0 8 i That is, i :=0 only when w ¢ xi +b -1=0 Those are the points sitting on the margin, call support vectors We get: f(x,w,b) = w ¢ x +b = i i yi xi ¢ x +b The value of the function depends on the support vectors, and only on their dot product with the point of interest x.

SVM Notes Chp. 5 of “Support Vector Machine” [Cristianini and Shawe-Taylor] provides an introduction to the theory of Optimization required in order to understand SVMs.

Optimization We are interested in finding the minimum or a maximum of a function, typically subject to some constraints. The general problem is: Minimize f(w), w 2 Rn Subject to gi(w) · 0, i=1,…k hi(w) = 0, i=1,…m An Inequality constraint gi(w) · 0 is called active if the solution w* satisfies gi(w*) = 0 An inequality constraint can be transformed into an equality constraint by introducing a slack variable: optimization gi(w) · 0 ´ gi(w) + i = 0,  ¸ 0

Optimization Problems An optimization problem in which the objective function and the constraints are linear is called a linear program. If the objective function is quadratic and the constraints linear it is called a quadratic program. If the objective function, the constraints and the domain of the functions are convex, that problem is a convex optimization problem.

w u f(w) f(u) Convexity A real valued function f(w) is called convex for w 2 Rn if, 8 w,u 2 Rn, and for any a 2 (0,1): f(aw + (1-a)u) · af(w) + (1-a) f(u) Claim: If a function is convex, any local minimum w* of the unconstrained optimization problem with objective function f is also a global minimum. Proof: Take any u><w*. By the definition of the local minimum there exists a sufficiently close to 1 that f(w*) · f(aw*+ (1-a)u) · af(w*) + (1-a) f(u) Where the first inequality is a result of the assumption of local minimum and the second is from the convexity assumption. But that implies (algebra) that f(w*) · f(u) This is important. It renders optimization problems tractable when the functions involved are convex.

Solving Optimization Problems (1) Theorem 1. A necessary condition for w* to be a minimum of f 2 C, is df(w*)/d(w) =0. This condition, together with convexity, is also a sufficient solution.

Constrained Optimization Problems Solved by converting constrained problems to unconstrained problems. When the problems are constrained, there is a need to define the Lagrangian function, which incorporates information about both the objective function and the constraints, and which can be used to detect solutions. The Lagrangian is defined as the objective function plus a linear combination of the constraints, where the coefficients of the combination are call the Lagrange multipliers. Given an optimization problem with objective function f(w), and equality constraints hi(w)=0, i=1,…,m, we define the Lagrangian function as: L(w,) = f(w) + i=1m i hi(w) where the coefficients i are called the Lagrange multipliers.

Constrained Optimization Problems (2) If w* is a local minimum for an optimization problem with only equality constraints, it is possible that df(w*)/d(w) >< 0 (so, w/o constraints it would not be a candidate for extreme point) but that the direction in which we could move to reduce f would violate one or more of the constraints. In order to respect the equality constraints hi we must move in directions that are perpendicular to dhi(w*)/d(w), that is they do not change the constraints. Therefore, in order to respect all the constraints we must move perpendicular to the subspace V spanned by: dhi(w*)/dw, (i =1,m) So, we make the constrained problem be an unconstrained one in which we require that there exists bi such that df(w*)/dw + i=1m i dhi(w*)/dw = 0

Constrained Optimization Problems (3) Theorem 2 (Lagrange): A necessary condition for w* to be a minimum for an optimization problem with objective f(w) and equality constraints hi (1,…,m) is: df(w*)/dw + i=1m i dhi(w*)/dw = 0 For some values *. These conditions are also sufficient provided that L(w,*) is also a convex function of w. Another way to write it is L(w*,)/dw = L(w*,)/dw =0 The first condition results in a new system of equations. The second, returns the equality constraints. By solving jointly the two systems, one obtains the solution.

Lagrange Theorem: Example Consider the standard problem of finding the largest volume box with a given surface area. Let the sides of the box be u,v,w. Optimization problem Minimize: - uvw Subject to: wu+uv+vw=c The Lagrangian: L = -uvw + (uv+wu+vw-c) Necessary conditions: dL/dw= -uv+(u+v); dL/du= -wv+(w+v); dL/dv= -uw+(u+w) Solving the equations gives  v(w-u) =  w (u-v) = 0, implying that the only non trivial solution is w=v=u=(c/3)1/2

Example (Maximum Entropy Distribution) Suppose we wish to find the discrete probability distribution p= (p1,p2,…pn) with maximal entropy H(p) = -i=1n pi log{pi}. The optimization problem: minimize -H(p), subject to i=1n pi =1 L(p,) = i=1n pi log{pi} + (i=1n pi -1 ) Differentiating, we get: log{e pi) +  = 0 (i=1,…n) Which implies that all the pi s are the same p=(1/n…1/n). It is not hard to show that the function H(p) is convex, so the uniform distribution is the global solution.

General Optimization Problems We now move to the general case, where the optimization problem contains both equality and inequality constraints. Define the generalize Lagrangian: L(w,, ) = f(w) + i=1k i gi(w) + i=1m i hi(w) Theorem3(Kuhn-Tucker):Given an optimization problem on Rn: Minimize f(w), w 2 Rn Subject to gi(w) · 0, i=1,…k hi(w) = 0, i=1,…m Necessary and sufficient conditions for w* to be an optimum are the existence of *, * such that: dL(w*, *, * )/dw = dL(w*, *, * )/d = 0 And i* gi(w*) = 0 i=1,…,k gi(w*) · 0 i=1,…,k i* ¸ 0 i=1,…,k

General Optimization Problems (2) The last conditions are called KKT: Karush-Kuhn-Tucker conditions. i* gi(w*) = 0 i=1,…,k gi(w*) · 0 i=1,…,k i* ¸ 0 i=1,…,k It implies that solutions can be in one of two positions with respect to the inequality constraints. Either the solution is in the interior of the feasible region, so that the constraint is inactive (gi(w*) · 0) , or on the boundary defined by that constraints, which is active. In the first case, the i s need to be zero. In the second case, the constraint is active (gi(w*) = 0), and i can be non-zero. This condition will give rise to the notion of support vectors.