Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CSE & CSE Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
CSCE822 Data Mining and Warehousing
Machine learning continued Image source:
LOGO Classification IV Lecturer: Dr. Bo Yuan
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Statistical Classification Methods
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machine PNU Artificial Intelligence Lab. Kim, Minho.
Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
An Introduction to Support Vector Machine (SVM)
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CpSc 810: Machine Learning Support Vector Machine.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Kernels Slides from Andrew Moore and Mingyue Tan.
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines
Machine Learning Week 2.
Support Vector Machines
Support Vector Machine & Its Applications
Introduction to Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Support Vector Machine
Presentation transcript:

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof. Andrew Moore’s SVM tutorial at

Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification  Text Categorization if time permits Discussion

Main Ideas Max-Margin Classifier  Formalize notion of the best linear separator Lagrangian Multipliers  Way to convert a constrained optimization problem to one that is easier to solve Kernels  Projecting data into higher-dimensional space makes it linearly separable Complexity  Depends only on the number of training examples, not on dimensionality of the kernel space!

Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0

Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best?

Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class

Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well.

Estimate the Margin What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0

EXAMPLE: Find the distance from the point (2,1) to the line 4x+2y+7=0.

Linear SVM Mathematically What we know: w. x + + b = +1 w. x - + b = -1 w. (x + -x -) = 2 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 X-X- x+x+ M=Margin Width

Linear SVM Mathematically Goal: 1) Correctly classify all training data if y i = +1 if y i = -1 for all i 2) Maximize the Margin same as minimize We can formulate a Quadratic Optimization Problem and solve for w and b Minimize subject to

Recap of Constrained Optimization Suppose we want to: minimize f(x) subject to g(x) = 0 A necessary condition for x 0 to be a solution:  : the Lagrange multiplier For multiple constraints g i (x) = 0, i=1, …, m, we need a Lagrange multiplier  i for each of the constraints

以 f(x,y)=x , g(x,y)=x 2 +y 2 -1 答案有兩組,分別是 x=1 , y=0 , 和 x=-1 , y=0 , 。 對應的是 x2+y2-1=0 這個圓的左、右兩個端點。它 們的 x 坐標分別是 1 和 -1 ,一個是最大可能,另一個 是最小可能。

Recap of Constrained Optimization The case for inequality constraint g i (x)  0 is similar, except that the Lagrange multiplier  i should be positive If x 0 is a solution to the constrained optimization problem There must exist  i  0 for i=1, …, m such that x 0 satisfy The function is also known as the Lagrangrian; we want to set its gradient to 0

Back to the Original Problem The Lagrangian is  Note that ||w|| 2 = w T w Setting the gradient of w.r.t. w and b to zero, we have The result when we differentiate the original Lagrangian w.r.t. b

The Dual Problem This is a quadratic programming (QP) problem  A global maximum of  i can always be found w can be recovered by

Dataset with noise Hard Margin: So far we require all data points be classified correctly - No training error What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING!

Non-linearly Separable Problems We allow “ error ”  i in classification; it is based on the output of the discriminant function w T x+b  i approximates the number of misclassified samples Class 1 Class 2

Soft Margin Hyperplane If we minimize  i  i,  i can be computed by   i are “ slack variables ” in optimization  Note that  i =0 if there is no error for x i   i is an upper bound of the number of errors We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes

The Optimization Problem The dual of this new constrained optimization problem is w is recovered as This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on  i now Once again, a QP solver can be used to find  i

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about … mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

The Kernel Trick Recall the SVM optimization problem The data points only appear as inner product As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly Many common geometric operations (angles, distances) can be expressed by inner products Define the kernel function K by

An Example for  (.) and K(.,.) Suppose  (.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out  (.) explicitly This use of kernel function to avoid carrying out  (.) explicitly is known as the kernel trick

Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width   Closely related to radial basis function neural networks  The feature space is infinite-dimensional Sigmoid with parameter  and   It does not satisfy the Mercer condition on all  and 

Non-linear SVMs Mathematically Dual problem formulation: The solution is: Optimization techniques for finding α i ’s remain the same! Find α 1 …α N such that Q(α) =Σα i - ½ΣΣα i α j y i y j K(x i, x j ) is maximized and (1) Σα i y i = 0 (2) α i ≥ 0 for all α i f(x) = Σα i y i K(x i, x j )+ b

SVM locates a separating hyperplane in the feature space and classify points in that space It does not need to represent the space explicitly, simply by defining a kernel function The kernel function plays the role of the dot product in the feature space. Nonlinear SVM - Overview

Example Suppose we have 5 1D data points  x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2  K(x,y) = (xy+1) 2  C is set to 100 We first find  i (i=1, …, 5) by

Example By using a QP solver, we get   1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833  Note that the constraints are indeed satisfied  The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2 and x 5 lie on the line and x 4 lies on the line All three give b=9

Example Value of discriminant function class 2 class 1

Properties of SVM Flexibility in choosing a similarity function Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space Overfitting can be controlled by soft margin approach Nice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution Feature Selection

Software Some implementation (such as LIBSVM) can handle multi-class classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available Weka

SVM Applications SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition

Application 1: Cancer Classification High Dimensional - p>1000; n<100 Imbalanced - less positive samples Many irrelevant features Noisy Genes Patientsg-1g-2 …… g-p P-1 p-2 ……. p-np-n FEATURE SELECTION In the linear case, w i 2 gives the ranking of dim i SVM is sensitive to noisy (mis-labeled) data 

Weakness of SVM It is sensitive to noise - A relatively small number of mislabeled examples can dramatically decrease the performance It only considers two classes - how to do multi-class classification with SVM? - Answer: 1) with output arity m, learn m SVM’s  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Application 2: Text Categorization Task: The classification of natural text (or hypertext) documents into a fixed number of predefined categories based on their content. - filtering, web searching, sorting documents by topic, etc.. A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category

Representation of Text IR ’ s vector space model (aka bag-of-words representation) A doc is represented by a vector indexed by a pre-fixed set or dictionary of terms Values of an entry can be binary or weights Normalization, stop words, word stems Doc x => φ(x)

Text Categorization using SVM The distance between two documents is φ(x)·φ(z) K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination. Why SVM? -High dimensional input space -Few irrelevant features (dense concept) -Sparse document vectors (sparse instances) -Text categorization problems are linearly separable