Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machines Tao Department of computer science University of Illinois.

Similar presentations


Presentation on theme: "Support Vector Machines Tao Department of computer science University of Illinois."— Presentation transcript:

1 Support Vector Machines Tao Department of computer science University of Illinois

2 Adapting many contents and even slides from: Gentle Guide to Support Vector Machines, Ming-Hsuan Yang Support Vector and Kernel Methods, Thorsten Joachims

3 Problem Optimal hyper-plane to classify data points How to choose this hyper- plane?

4 What is “optimal”? Intuition: to maximize the margin

5 What is “optimal”? Statistically: risk Minimization Risk function Risk p (h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y) (h in H) h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function Minimization h opt = argmin h {Risk p (h)}

6 In practice… Given N observations: (X,Y) (Y are labels, 1,-1) Looking for a mapping: x->f(x,α) (1,-1) Expected risk: Empirical risk Question: Are they consistent in terms of minimization?

7 Vapnik/Chervonenkis (VC) dimension Definition: VC dimension of H is equal to the maximum number h of examples that can be split into two sets in all 2 h ways using function from H. Example : In R 2 space, VC dimension is 3 (R n, vc: n+1) But, 4 points:

8 Upper bound for expected risk This bound for the expected risk holds with probability 1-η. h: VC dimension the second term: VC confidence Training error Avoid overfitting

9 Error vs. vc dimension

10 Want to minimize expected risk? It is not enough just to minimize the empirical risk Need to choose an appropriate VC Make both parts small Solution: Structural Risk Minimization (SRM)

11 Structural risk minimization Nested structure of hypothesis space: h(n) ≤ h(n+1), h(n) is the VC dimension of H n Tradeoff between VC dimension and empirical risk Problem: VC dimension minimum empirical risk

12 Linear SVM Given x i in R n Linearly separable: exist w in R n and b in R, s.t y i (w ● x i +b) ≥ 1 Scale (w,b) in order to make the distance of the closest points, say x j, equals to 1/||w|| Optimal separating hyper-plane (OSH): to maximize the 1/||w||

13 Linear SVM example Given (x,y), find (w,b), s.t. +b = 0 additional requirement: min i | +b| = 1 f(x,w,b) = sgn(x●w+b)

14 VC dimension upper bound Lemma [Vapnik 1995] Let R be the radius of smallest ball to cover all x: {||x-a||<R}, let f w,b = sgn((w ●x)+b) be the decision functions ||w|| ≤ A Then, VC dimension h < R 2 A 2 +1 ||w|| = 1/δ, δ is margin length δ R w

15 So … Maximizing the margin δ ═> Minimizing ||w|| ═> Smallest acceptable VC dimension ═> Constructing an optimal hyper-plane Is everything clear?? How to do it? Quadratic Programming!

16 Constrained quadratic programming Minimize ½ Subject to y i ( +b) ≥ 1 Solve it: Lagrange multipliers to find the saddle point For more details, go to the book: An introduction to Support Vector Machines

17 What is “support vectors”? y i (w ● x i +b) ≥ 1 Most of x i achieves inequality signs; The x i, achieving equal signs, are called support vectors. Support vector

18 Inseparable data

19 Soft margin classifier Loose the margin by introducing N nonnegative variable ξ = (ξ 1,ξ 2,…, ξ n ) So that y i ( +b) ≥ 1- ξ i Problem: Minimize ½ + C ∑ ξ i Subject to y i ( +b) ≥ 1 – ξ i ξ ≥ 0

20 C and ξ C: C is small, maximize the minimum distance C is large, minimize the number of misclassified points ξ: >1: misclassified points 0< ξ<1: correctly classified but closer than 1/||w|| =0: margin vectors

21 Nonlinear SVM R2R2 R3R3

22 Feature space Input Space Feature Space Φ a | b | c a | b | c | aa | ab | ac | bb | bc | cc Φ

23 Problem: Very many parameters! O(N p ) attributes in feature space, for N attributes, p degree. Solution: Kernel methods!

24 Dual representations Lagrange multipliers: Require: substitute

25 Constrained QP using dual D is an N×N matrix such that D i,j = y i y j Observations: the only way the data points appear in the training problem is in the form of dot products---

26 Go back to nonlinear SVM… Original: Expanding to high dimensional space: Problem: Φ is computationally expensive. Fortunately: We only need Φ(x i )●Φ(x j )

27 Kernel function K(x i,x j ) = Φ(x i )●Φ(x j ) Without knowing exact Φ Replace by K(x i,x j ) All previous derivations in linear SVM hold

28 How to decide to kernel function? Mercer condition (necessary and sufficient): K(u,v) is symmetric

29 Some examples for kernel functions

30 Multiple classes (k) One-against-the rest: k SVM’s One-against-one: k(k-1)/2 SVM’s K-class SVM John Platt’s DAG method

31 Application in text classification Counting each term in an article An article, therefore, becomes a vector (x) Further reading and advanced topics. the theory of linear … … is much … … The problem of linear regression Is much older than the Classification … Read Problem … Class … 24…5……24…5…… count Attributes: terms Value: occurrence or frequency

32 Conclusions Linear SVM VC dimension Soft margin classifier Dual representation Nonlinear SVM Kernel methods Multi-classifier

33 Thank you!


Download ppt "Support Vector Machines Tao Department of computer science University of Illinois."

Similar presentations


Ads by Google