Support Vector Machines Tao Department of computer science University of Illinois.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Lecture 9 Support Vector Machines

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Pattern Recognition and Machine Learning

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

SVM—Support Vector Machines

LOGO Classification IV Lecturer: Dr. Bo Yuan

Support Vector Machine

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

The Nature of Statistical Learning Theory by V. Vapnik

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Support Vector Machines Kernel Machines

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Binary Classification Problem Linearly Separable Case

Support Vector Machines

Sparse Kernels Methods Steve Gunn.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

SVM Support Vectors Machines

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Support Vector Machine & Image Classification Applications

Support Vector Machine (SVM) Based on Nello Cristianini presentation

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.

Support Vector Machines

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

SVMs in a Nutshell.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support vector machines

Support Vector Machine

Support Vector Machines

An Introduction to Support Vector Machines

Support vector machines

Support vector machines

Support vector machines

Presentation transcript:

Support Vector Machines Tao Department of computer science University of Illinois

Adapting many contents and even slides from: Gentle Guide to Support Vector Machines, Ming-Hsuan Yang Support Vector and Kernel Methods, Thorsten Joachims

Problem Optimal hyper-plane to classify data points How to choose this hyperplane?

What is “optimal”? Intuition: to maximize the margin

What is “optimal”? Statistically: risk Minimization Risk function Risk p (h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y) (h in H) h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function Minimization h opt = argmin h {Risk p (h)}

In practice… Given N observations: (X,Y) (Y are labels, 1,-1) Looking for a mapping: x->f(x,α) (1,-1) Expected risk: Empirical risk Question: Are they consistent in terms of minimization?

Vapnik/Chervonenkis (VC) dimension Definition: VC dimension of H is equal to the maximum number h of examples that can be split into two sets in all 2 h ways using function from H. Example : In R 2 space, VC dimension is 3 (R n, vc: n+1) But, 4 points:

Upper bound for expected risk This bound for the expected risk holds with probability 1-η. h: VC dimension the second term: VC confidence Training error Avoid overfitting

Error vs. vc dimension

Want to minimize expected risk? It is not enough just to minimize the empirical risk Need to choose an appropriate VC Make both parts small Solution: Structural Risk Minimization (SRM)

Structural risk minimization Nested structure of hypothesis space: h(n) ≤ h(n+1), h(n) is the VC dimension of H n Tradeoff between VC dimension and empirical risk Problem: VC dimension minimum empirical risk

Linear SVM Given x i in R n Linearly separable: exist w in R n and b in R, s.t y i (w ● x i +b) ≥ 1 Scale (w,b) in order to make the distance of the closest points, say x j, equals to 1/||w|| Optimal separating hyper-plane (OSH): to maximize the 1/||w||

Linear SVM example Given (x,y), find (w,b), s.t. +b = 0 additional requirement: min i | +b| = 1 f(x,w,b) = sgn(x●w+b)

VC dimension upper bound Lemma [Vapnik 1995] Let R be the radius of smallest ball to cover all x: {||x-a||<R}, let f w,b = sgn((w ●x)+b) be the decision functions ||w|| ≤ A Then, VC dimension h < R 2 A 2 +1 ||w|| = 1/δ, δ is margin length δ R w

So … Maximizing the margin δ ═> Minimizing ||w|| ═> Smallest acceptable VC dimension ═> Constructing an optimal hyper-plane Is everything clear?? How to do it? Quadratic Programming!

Constrained quadratic programming Minimize ½ Subject to y i ( +b) ≥ 1 Solve it: Lagrange multipliers to find the saddle point For more details, go to the book: An introduction to Support Vector Machines

What is “support vectors”? y i (w ● x i +b) ≥ 1 Most of x i achieves inequality signs; The x i, achieving equal signs, are called support vectors. Support vector

Inseparable data

Soft margin classifier Loose the margin by introducing N nonnegative variable ξ = (ξ 1,ξ 2,…, ξ n ) So that y i ( +b) ≥ 1- ξ i Problem: Minimize ½ + C ∑ ξ i Subject to y i ( +b) ≥ 1 – ξ i ξ ≥ 0

C and ξ C: C is small, maximize the minimum distance C is large, minimize the number of misclassified points ξ: >1: misclassified points 0< ξ<1: correctly classified but closer than 1/||w|| =0: margin vectors

Nonlinear SVM R2R2 R3R3

Feature space Input Space Feature Space Φ a | b | c a | b | c | aa | ab | ac | bb | bc | cc Φ

Problem: Very many parameters! O(N p ) attributes in feature space, for N attributes, p degree. Solution: Kernel methods!

Dual representations Lagrange multipliers: Require: substitute

Constrained QP using dual D is an N×N matrix such that D i,j = y i y j Observations: the only way the data points appear in the training problem is in the form of dot products---

Go back to nonlinear SVM… Original: Expanding to high dimensional space: Problem: Φ is computationally expensive. Fortunately: We only need Φ(x i )●Φ(x j )

Kernel function K(x i,x j ) = Φ(x i )●Φ(x j ) Without knowing exact Φ Replace by K(x i,x j ) All previous derivations in linear SVM hold

How to decide to kernel function? Mercer condition (necessary and sufficient): K(u,v) is symmetric

Some examples for kernel functions

Multiple classes (k) One-against-the rest: k SVM’s One-against-one: k(k-1)/2 SVM’s K-class SVM John Platt’s DAG method

Application in text classification Counting each term in an article An article, therefore, becomes a vector (x) Further reading and advanced topics. the theory of linear … … is much … … The problem of linear regression Is much older than the Classification … Read Problem … Class … 24…5……24…5…… count Attributes: terms Value: occurrence or frequency

Conclusions Linear SVM VC dimension Soft margin classifier Dual representation Nonlinear SVM Kernel methods Multi-classifier

Thank you!