Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

ECG Signal processing (2)
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
Crash Course on Machine Learning Part III
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Support Vector Machines Joseph Gonzalez TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A AA.
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Duality Dual problem Duality Theorem Complementary Slackness
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines and Kernel Methods
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture 10: Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Learning from Big Data Lecture 5
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
PREDICT 422: Practical Machine Learning
Support Vector Machine
Recap Finds the boundary with “maximum margin”
ECE 5424: Introduction to Machine Learning
Geometrical intuition behind the dual problem
An Introduction to Support Vector Machines
Kernels Usman Roshan.
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Recap Finds the boundary with “maximum margin”
Recitation 6: Kernel SVM
Usman Roshan CS 675 Machine Learning
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
SVMs for Document Ranking
Support Vector Machines
Introduction to Machine Learning
Presentation transcript:

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof. A. Zisserman’s course material)

Linear Classifier Linear classifier is an appropriate choice Training loss is 0 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice x1x1 x2x2 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Linear classifier is not an appropriate choice x1x1 x2x2 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Feature vector not appropriate x1x1 x2x2 x1x1 x2x2

Feature Vector We were using Φ(x) = [x 1 x 2 ] x1x1 x2x2 Instead, let us use Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] x12x12 x22x22 √2x 1 x 2

Feature Vector Use a D dimensional feature vector Parameters will also be D dimensional For a large D, data may be linearly separable Large number of parameters to learn Accurate classification Inefficient optimization Can we somehow avoid this?D >> n

Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

Optimization – Simple Example min ξ s.t. ξ ≥ ✔

Optimization – Simple Example min ξ s.t. ξ ≥ ✔ ξ ≥ 4 ξ ≥ 2 We have to use the maximum lower bound Let us make this a bit more abstract

Constrained Optimization - Example min ξ s.t. ξ ≥ w 1 + w 2 w 1 + w 2 2w 1 +3w 2 max{w 1 + w 2, 2w 1 +3w 2 } ✔ ξ ≥ 2w 1 +3w 2 We have to use the maximum lower bound Let us consider the other direction

Unconstrained Optimization - Example min f(w 1,w 2 )+ max{w 1 +w 2,2w 1 +3w 2 }

Unconstrained Optimization - Example min f(w 1,w 2 )+ ξ s.t. ξ ≥ w 1 + w 2 ξ ≥ 2w 1 +3w 2 Equivalent constrained optimization problem We will call ξ a slack variable Reformulate SVM learning problem

Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

SVM Learning Problem max y {w T Ψ(x i,y) +Δ(y i,y)} - w T Ψ(x i,y i )∑i∑i λ||w|| 2 +min

SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y) +Δ(y i,y) - w T Ψ(x i,y i ) ≤ ξ i s.t. for all y Slight abuse of notation

SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) +Δ(y i,y) ≤ ξ i s.t. for all y Slight abuse of notation Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) Convex Quadratic Program

Convex Quadratic Program z T Qz + z T q + Cmin z s.t. z T a i ≤ b i i =1,…,m Q 0 Many efficient solvers But we already know how to optimize Reformulation allows us to write down the dual

Reformulation SVM Dual –Example –Generalization Kernels Outline

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ (w 1 2 +w 2 2 ) + 0 ≤ (w 1 2 +w 2 2 ) + ξ Lower bound on the objective

Example (w 1 2 +w 2 2 ) + 0 Lower bound on the objective min w 0w 1 = 0 Set derivatives with respect to w to 0 w 2 = 0

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ w 1 + w 2 + 2/3 ≤ ξ Lower bound on the objective X 1/3 ∑

Example w 1 + w 2 + 2/3 Lower bound on the objective (w 1 2 +w 2 2 ) +min w Set derivatives with respect to w to 0 1/6w 1 = -1/2w 2 = -1/2 How can I find the maximum lower bound?

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ (α 1 +α 2 +α 3 )ξ

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ ξ

Example (w 1 2 +w 2 2 ) + w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 )

Example w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 Maximum lower bound? α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1s.t. -(2α 1 +α 2 ) 2 /4 - (α 1 +2α 2 ) 2 /4 + (α 1 +α 2 )max α

Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Weak Duality Value of dual for any feasible α Value of primal for any feasible w ≤

Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Strong Duality Value of dual for the optimal α Value of primal for the optimal w =

Reformulation SVM Dual –Example –Generalization Kernels Outline

ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. for all i, y Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) SVM Learning Problem α i (y) α i (y) ≥ 0 for all i∑ y α i (y) = 1 for all y

ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. SVM Learning Problem α i (y)for all y ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) ≤ ξ i for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

∑i∑i SVM Learning Problem λ||w|| 2 + ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ Linear combination of joint feature vector Set derivatives with respect to w to 0 for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ -∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) + ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) max α s.t.

SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t.

SVM Dual Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. How to deal with high dimensional features?

Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

Prediction w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λβ i (y) = α i (y)/2λ We consider M-SVM Binary classification in example sheet

Prediction w = -∑ i ∑ y β i (y)Ψ(x i,y i,y)

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) y(w) = Given test input x w T Ψ(x,ŷ)argmax ŷ

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) w T Ψ(x,ŷ)

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) We need to compute dot products of features Let us take a closer look at one product term

Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) Ψ(x,2) = 0 Φ(x) … = 0 if y ≠ ŷ

Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) Ψ(x,2) = 0 Φ(x) … = Φ(x i ) T Φ(x) if y = ŷ

Dot Product Ψ(x i,y) T Ψ(x,ŷ)= Φ(x i ) T Φ(x) if y = ŷ We do not need the feature vector Φ(.) We need a function that computes dot product Kernel Isn’t that as expensive as feature computation? O(D) operation for D-dimensional features

Kernel x1x1 x2x2 Φ(x) = [x 1 x 2 ] Corresponding feature? We can use the kernel k(x,x’) = x 1 x’ 1 + x 2 x’ 2

Kernel x1x1 x2x2 Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] Corresponding feature? k(x,x’) = x 1 2 (x’ 1 ) 2 + x 2 2 (x’ 2 ) 2 + 2x 1 x’ 1 x 2 x’ 2

Kernel x1x1 x2x2 Infinite dimensional Corresponding feature? k(x,x’) = exp(-||x-x’|| 2 /2σ 2 )

Prediction - Summary ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) Compute non-zero dot products using kernels argmax y Many dot products are 0 Compute scores for all possible y y(w) = Choose maximum score to make a prediction

Kernel Commonly used kernels k(x,x’) = x T x’Linear k(x,x’) = (1+x T x’) d Polynomial Φ(.) has all polynomial terms up to degree d k(x,x’) = exp(-||x-x’|| 2 /2σ 2 ) Gaussian or RBF Φ(.) is infinite dimensional

Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

SVM Dual Problem ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Need to compute Q Only requires dot products Kernel trick

Computational Efficiency ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Is this a convex quadratic program? Q 0Mercer Kernels

Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

Data Not linearly separable in original space Use an RBF kernel

Results σ = 1.0λ = 0

Results σ = 1.0λ = 0.01 Increase in λ increases margin

Results σ = 1.0λ = 0.1

Results σ = 1.0λ = 0

Results σ = 0.25λ = 0

Results σ = 0.1λ = 0How does σ affect prediction?

Results σ = 0.1λ = 0Example sheet

Questions?