Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill  2005-3-17.

Similar presentations


Presentation on theme: "1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill  2005-3-17."— Presentation transcript:

1 1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music Tech @ McGill  2005-3-17

2 2 Content  Related problems in pattern classification  VC theory and VC dimension  Overview of SVM  Application example

3 3 Related problems in pattern classification  Small sample-size effect (peaking effect)  Overly small or large sample-size results great error.  Inaccurate estimate of probability densities via finite sample sets for global set in typical Bayesian classifier.  Training data vs. test data  Empirical risk vs. structural risk  Misclassifying yet-to-be-seen data Picture taken from (Ridder 1997)

4 4 Related problems in pattern classification  Avoid solving a more general problem as an intermediate step. (Vapnik 1995)  Do it without estimation of probability of densities.  ANN  Depends on knowledge  Empirical-risk method (ERM):  Problem of generalization (hard to control over-fitting)  To find theoretical analysis for validity of ERM.

5 5 VC theory and VC dimension  VC dimension: (classifier complexity)  The maximum size of a sample set that a decision function can separate.  Finite VC dimension  coherence of ERM  Theoretical basis of ANN and SVM  Linear decision function:  VC dim = number of parameters  Non-linear decision function:  VC dim <= number of parameters

6 6 Overview of SVM  Structural-risk method (SRM)  Minimize ER  Control VC dimension  Result: tradeoff between ER and over-fitting  Focus on the explicit problem of classification:  To find the optimal hyperplane for dividing two classes  Supervised learning

7 7 Margin and Support Vectors (SV)  In the case of 2-category, linearly-separable data.  Small vs. large margin Picture taken from (Ferguson 2004)

8 8 Margin and Support Vectors (SV)  In the case of 2-category, linearly-separable data.  Find a hyperplane that has the largest margin to sample vectors of both classes.  D(x) = w t x +b => D(x’) = a t x’  Multiple solutions: weight space  Find a weight that causes the largest margin  Margin determined by SVs Picture taken from (Ferguson 2004)

9 9 Mathematical detail  y i D(x i ) >= 1, y = 1, -1  y i D(x i ’) / ||a|| >= margin  D(x i’ ) = a t x’  Max margin -> minimum ||a||  Quadratic programming  To find the minimum ||a|| under linear constraints  Weights: denoted by Lagrange multipliers  Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction)  The parameters of decision function and its complexity can be completely determined by SVs.

10 10 Linearly non-separable case  Example: XOR problem  Sample set size: 4  VC dim = 3 Pictures taken from (Ferguson 2004)

11 11 Linearly non-separable case  Map data to higher-dimension space  Linearly-separable in such Higher-D spaces  Make linear decision in higher-D spaces  Example: XOR  6-D space:  D(x) = x 1 x 2 Picture taken from (Ferguson 2004)

12 12 Linearly non-separable case  Hyperplane in both original and higher-D spaces (trajectory to 2-D plane)  The 4 samples are SVs. Picture taken from (Ferguson 2004; Luo 2002)

13 13 Linearly non-separable case  Modify the quadratic programming :  “Soft margin”  Slack-variable: y i D(x i ) >= 1- ε i  Penalty function  Upper bound for Lagrange multipliers: C.  Kernel function:  Dot-product in higher-D space in terms of original parameters  Resulting a symmetrical, positive semi-definite matrix.  Satisfying Mercer’s theorem.  Standard candidate: Polynomial, Gussian-Radial-basis Function  Selection of kernel depends on knowledge.

14 14 Implementation with large sample set  Large computation: One Lagrange multiplier per sample  Reductionist approach  Divide sample set into batches (subsets)  Accumulate SV set from batch-by-batch operations  Assumption: local non-SV samples are not global SVs either.  Several algorithms that varies in terms of size of subsets  Vapnik: Chunking algorithm  Osuna: Osuna algorithm  Platt: SMO algorithm  Only 2 samples per operation  Most popular

15 15 From 2-category to multi-category SVM  No uniform way to extend  Common ways:  One-against-all  One-against-one: binary tree

16 16 Advantages of SVM  Strong mathematical basis  Decision function and its complexity can be completely determined by SVs.  Training time does not depend on dimensionality of feature space, only on fixed input space.  Nice generalization  Insensitive to “curse of dimensionality”  Versatile choices of kernel function.  Feature-less classification  Kernel -> data-similarity measure

17 17 Drawback of SVM  Still rely on knowledge  Choices of C, kernel and penalty function  C: how far the decision function is adapted to avoid any error  Kernel: how much freedom SVM should adapt itself (dimension)  Overlapping classes  Reductionism may discard promising SVs at any batch step.  The classification can be limited by the size of the problem.  No uniform way to extend 2-category to multi-category  “Still not an ideal optimally-generalizing classifier.”

18 18 Applications  Vapnik et al. at AT&T:  Handwritten number recognition  Error rate is lower than that of ANN  Speech recognition  Face recognition  MIR  SVM-light: open source C library

19 19 Application example of SVM in MIR  Li, Guo 2000: (Microsoft Research China)  Problem:  classify 16 classes of sounds in a database of 409 sounds  Features:  Concatenated perceptual and cepstral feature vectors.  Similarity measure:  Distance from boundary (SV-based boundary)  Evaluation:  Average retrieval accuracy  Average retrieval efficiency

20 20 Application example of SVM in MIR  Details in applying SVM  Both linear and kernel-based approaches are tested  Kernel: Exponential Radial Basis Function  C: 200  Randomly partition corpus into training/test sets.  One-against-one/binary tree in multi-category task.  Compared with other approaches  NFL: Nearest Feature Line, unsupervised approach  Muscle Fish: normalized Euclidean metric and nearest-neighbor

21 21 Application example of SVM in MIR  Average error rates comparison  Different feature-set over different approaches Picture taken from (Li & Guo 2000)

22 22 Application example of SVM in MIR  Complexity comparison  SVM:  Training: yes  Classification complexity: C * (C-1) / 2 (binary tree)  Inner-class complexity: number of SVs  NFL:  Training: no  Classification complexity: linear to number of classes  Inner-class complexity: Nc * (Nc-1) / 2

23 23 Future work  Speed up quadratic programming  Choice of kernel functions  Find opportunities in solving impossible-so-far missions  Generalize the non-linear kernel approach to approaches other than SVM  Kernel PCA (principle component analysis)

24 24 Bibliography  Summary:  http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_summary.pdf http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_summary.pdf  HTML bibliography:  http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_bib.htm http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_bib.htm


Download ppt "1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill  2005-3-17."

Similar presentations


Ads by Google