Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

Similar presentations


Presentation on theme: "Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof."— Presentation transcript:

1 Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof. Andrew Moore’s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials

2 Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification  Text Categorization if time permits Discussion

3 Main Ideas Max-Margin Classifier  Formalize notion of the best linear separator Lagrangian Multipliers  Way to convert a constrained optimization problem to one that is easier to solve Kernels  Projecting data into higher-dimensional space makes it linearly separable Complexity  Depends only on the number of training examples, not on dimensionality of the kernel space!

4 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0

5 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best?

6 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class

7 Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

8 Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well.

9 Estimate the Margin What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0

10 EXAMPLE: Find the distance from the point (2,1) to the line 4x+2y+7=0.

11 Linear SVM Mathematically What we know: w. x + + b = +1 w. x - + b = -1 w. (x + -x -) = 2 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 X-X- x+x+ M=Margin Width

12 Linear SVM Mathematically Goal: 1) Correctly classify all training data if y i = +1 if y i = -1 for all i 2) Maximize the Margin same as minimize We can formulate a Quadratic Optimization Problem and solve for w and b Minimize subject to

13 Recap of Constrained Optimization Suppose we want to: minimize f(x) subject to g(x) = 0 A necessary condition for x 0 to be a solution:  : the Lagrange multiplier For multiple constraints g i (x) = 0, i=1, …, m, we need a Lagrange multiplier  i for each of the constraints

14 以 f(x,y)=x , g(x,y)=x 2 +y 2 -1 答案有兩組,分別是 x=1 , y=0 , 和 x=-1 , y=0 , 。 對應的是 x2+y2-1=0 這個圓的左、右兩個端點。它 們的 x 坐標分別是 1 和 -1 ,一個是最大可能,另一個 是最小可能。

15 Recap of Constrained Optimization The case for inequality constraint g i (x)  0 is similar, except that the Lagrange multiplier  i should be positive If x 0 is a solution to the constrained optimization problem There must exist  i  0 for i=1, …, m such that x 0 satisfy The function is also known as the Lagrangrian; we want to set its gradient to 0

16 Back to the Original Problem The Lagrangian is  Note that ||w|| 2 = w T w Setting the gradient of w.r.t. w and b to zero, we have The result when we differentiate the original Lagrangian w.r.t. b

17 The Dual Problem This is a quadratic programming (QP) problem  A global maximum of  i can always be found w can be recovered by

18 Dataset with noise Hard Margin: So far we require all data points be classified correctly - No training error What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING!

19 Non-linearly Separable Problems We allow “ error ”  i in classification; it is based on the output of the discriminant function w T x+b  i approximates the number of misclassified samples Class 1 Class 2

20 Soft Margin Hyperplane If we minimize  i  i,  i can be computed by   i are “ slack variables ” in optimization  Note that  i =0 if there is no error for x i   i is an upper bound of the number of errors We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes

21 The Optimization Problem The dual of this new constrained optimization problem is w is recovered as This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on  i now Once again, a QP solver can be used to find  i

22 Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about … mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2

23 Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

24 The Kernel Trick Recall the SVM optimization problem The data points only appear as inner product As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly Many common geometric operations (angles, distances) can be expressed by inner products Define the kernel function K by

25 An Example for  (.) and K(.,.) Suppose  (.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out  (.) explicitly This use of kernel function to avoid carrying out  (.) explicitly is known as the kernel trick

26 Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width   Closely related to radial basis function neural networks  The feature space is infinite-dimensional Sigmoid with parameter  and   It does not satisfy the Mercer condition on all  and 

27 Non-linear SVMs Mathematically Dual problem formulation: The solution is: Optimization techniques for finding α i ’s remain the same! Find α 1 …α N such that Q(α) =Σα i - ½ΣΣα i α j y i y j K(x i, x j ) is maximized and (1) Σα i y i = 0 (2) α i ≥ 0 for all α i f(x) = Σα i y i K(x i, x j )+ b

28 SVM locates a separating hyperplane in the feature space and classify points in that space It does not need to represent the space explicitly, simply by defining a kernel function The kernel function plays the role of the dot product in the feature space. Nonlinear SVM - Overview

29 Example Suppose we have 5 1D data points  x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2  K(x,y) = (xy+1) 2  C is set to 100 We first find  i (i=1, …, 5) by

30 Example By using a QP solver, we get   1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833  Note that the constraints are indeed satisfied  The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2 and x 5 lie on the line and x 4 lies on the line All three give b=9

31 Example Value of discriminant function 12456 class 2 class 1

32 Properties of SVM Flexibility in choosing a similarity function Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space Overfitting can be controlled by soft margin approach Nice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution Feature Selection

33 Software Some implementation (such as LIBSVM) can handle multi-class classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available Weka

34 SVM Applications SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition

35 Application 1: Cancer Classification High Dimensional - p>1000; n<100 Imbalanced - less positive samples Many irrelevant features Noisy Genes Patientsg-1g-2 …… g-p P-1 p-2 ……. p-np-n FEATURE SELECTION In the linear case, w i 2 gives the ranking of dim i SVM is sensitive to noisy (mis-labeled) data 

36 Weakness of SVM It is sensitive to noise - A relatively small number of mislabeled examples can dramatically decrease the performance It only considers two classes - how to do multi-class classification with SVM? - Answer: 1) with output arity m, learn m SVM’s  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

37 Application 2: Text Categorization Task: The classification of natural text (or hypertext) documents into a fixed number of predefined categories based on their content. - email filtering, web searching, sorting documents by topic, etc.. A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category

38 Representation of Text IR ’ s vector space model (aka bag-of-words representation) A doc is represented by a vector indexed by a pre-fixed set or dictionary of terms Values of an entry can be binary or weights Normalization, stop words, word stems Doc x => φ(x)

39 Text Categorization using SVM The distance between two documents is φ(x)·φ(z) K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination. Why SVM? -High dimensional input space -Few irrelevant features (dense concept) -Sparse document vectors (sparse instances) -Text categorization problems are linearly separable


Download ppt "Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof."

Similar presentations


Ads by Google