CS 478 – Tools for Machine Learning and Data Mining SVM.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Classification and Decision Boundaries
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Support Vector Machines Kernel Machines
Support Vector Machine (SVM) Classification
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
An Introduction to Support Vector Machine (SVM)
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Support Vector Machines
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
COSC 4335: Other Classification Techniques
Usman Roshan CS 675 Machine Learning
COSC 4368 Machine Learning Organization
SVMs for Document Ranking
Presentation transcript:

CS 478 – Tools for Machine Learning and Data Mining SVM

Maximal-Margin Classification (I) Consider a 2-class problem in R d As needed (and without loss of generality), relabel the classes to -1 and +1 Suppose we have a separating hyperplane – Its equation is: w.x + b = 0 w is normal to the hyperplane |b|/||w|| is the perpendicular distance from the hyperplane to the origin ||w|| is the Euclidean norm of w

Maximal-Margin Classification (II) We can certainly choose w and b in such a way that: – w.x i + b > 0 when y i = +1 – w.x i + b < 0 when y i = -1 Rescaling w and b so that the closest points to the hyperplane satisfy |w.x i + b| = 1, we can rewrite the above to – w.x i + b ≥ +1 when y i = +1(1) – w.x i + b ≤ -1 when y i = -1(2)

Maximal-Margin Classification (III) Consider the case when (1) is an equality – w.x i + b = +1 (H+) Normal w Distance from origin |1-b|/||w|| Similarly for (2) – w.x i + b = -1 (H-) Normal w Distance from origin |-1-b|/||w|| We now have two hyperplanes (// to original)

Maximal-Margin Classification (IV)

Maximal-Margin Classification (V) Note that the points on H- and H+ are sufficient to define H- and H+ and therefore are sufficient to build a linear classifier Define the margin as the distance between H- and H+ What would be a good choice for w and b? – Maximize the margin

Maximal-Margin Classification (VI) From the equations of H- and H+, we have – Margin= |1-b|/||w|| - |-1-b|/||w|| = 2/||w|| So, we can maximize the margin by: – Minimizing ||w|| 2 – Subject to: y i (w.x i + b) - 1 ≥ 0 (see (1) and (2) above)

Minimizing ||w|| 2 Use Lagrange multipliers for each constraint (1 per training instance) – For constraints of the form c i ≥ 0 (see above) The constraint equations are multiplied by positive Lagrange multipliers, and Subtracted from the objective function Hence, we have the Lagrangian

Maximizing L D It turns out, after some transformations beyond the scope of our discussion that minimizing L P is equivalent to maximizing the following dual Lagrangian: – Where denotes the dot product subject to :

SVM Learning (I) We could stop here and we would have a nice linear classification algorithm. SVM goes one step further: – It assumes that non-linearly separable problems in low dimensions may become linearly separable in higher dimensions (e.g., XOR)

SVM Learning (II) SVM thus: – Creates a non-linear mapping from the low dimensional space to a higher dimensional space – Uses MM learning in the new space Computation is efficient when “good” transformations are selected – The kernel trick

Choosing a Transformation (I) Recall the formula for L D Note that it involves a dot product – Expensive to compute in high dimensions What if we did not have to?

Choosing a Transformation (II) It turns out that it is possible to design transformations φ such that: – can be expressed in terms of Hence, one needs only compute in the original lower dimensional space Example: – φ: R 2  R 3 where φ(x)=(x 1 2, √2x 1 x 2, x 2 2 )

Choosing a Kernel Can start from a desired feature space and try to construct kernel More often one starts from a reasonable kernel and may not analyze the feature space Some kernels are better fit for certain problems, domain knowledge can be helpful Common kernels: – Polynomial – Gaussian – Sigmoidal – Application specific

SVM Notes Excellent empirical and theoretical potential Multi-class problems not handled naturally How to choose kernel – main learning parameter – Also includes other parameters to be defined (degree of polynomials, variance of Gaussians, etc.) Speed and size: both training and testing, how to handle very large training sets not yet solved MM can lead to overfit due to noise, or problem may not be linearly separable within a reasonable feature space – Soft Margin is a common solution, allows slack variables – α i constrained to be >= 0 and less than C. The C allows outliers. How to pick C?