Download presentation
Presentation is loading. Please wait.
Published byIrene Butler Modified over 9 years ago
1
1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17
2
2 Content Related problems in pattern classification VC theory and VC dimension Overview of SVM Application example
3
3 Related problems in pattern classification Small sample-size effect (peaking effect) Overly small or large sample-size results great error. Inaccurate estimate of probability densities via finite sample sets for global set in typical Bayesian classifier. Training data vs. test data Empirical risk vs. structural risk Misclassifying yet-to-be-seen data Picture taken from (Ridder 1997)
4
4 Related problems in pattern classification Avoid solving a more general problem as an intermediate step. (Vapnik 1995) Do it without estimation of probability of densities. ANN Depends on knowledge Empirical-risk method (ERM): Problem of generalization (hard to control over-fitting) To find theoretical analysis for validity of ERM.
5
5 VC theory and VC dimension VC dimension: (classifier complexity) The maximum size of a sample set that a decision function can separate. Finite VC dimension coherence of ERM Theoretical basis of ANN and SVM Linear decision function: VC dim = number of parameters Non-linear decision function: VC dim <= number of parameters
6
6 Overview of SVM Structural-risk method (SRM) Minimize ER Control VC dimension Result: tradeoff between ER and over-fitting Focus on the explicit problem of classification: To find the optimal hyperplane for dividing two classes Supervised learning
7
7 Margin and Support Vectors (SV) In the case of 2-category, linearly-separable data. Small vs. large margin Picture taken from (Ferguson 2004)
8
8 Margin and Support Vectors (SV) In the case of 2-category, linearly-separable data. Find a hyperplane that has the largest margin to sample vectors of both classes. D(x) = w t x +b => D(x’) = a t x’ Multiple solutions: weight space Find a weight that causes the largest margin Margin determined by SVs Picture taken from (Ferguson 2004)
9
9 Mathematical detail y i D(x i ) >= 1, y = 1, -1 y i D(x i ’) / ||a|| >= margin D(x i’ ) = a t x’ Max margin -> minimum ||a|| Quadratic programming To find the minimum ||a|| under linear constraints Weights: denoted by Lagrange multipliers Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction) The parameters of decision function and its complexity can be completely determined by SVs.
10
10 Linearly non-separable case Example: XOR problem Sample set size: 4 VC dim = 3 Pictures taken from (Ferguson 2004)
11
11 Linearly non-separable case Map data to higher-dimension space Linearly-separable in such Higher-D spaces Make linear decision in higher-D spaces Example: XOR 6-D space: D(x) = x 1 x 2 Picture taken from (Ferguson 2004)
12
12 Linearly non-separable case Hyperplane in both original and higher-D spaces (trajectory to 2-D plane) The 4 samples are SVs. Picture taken from (Ferguson 2004; Luo 2002)
13
13 Linearly non-separable case Modify the quadratic programming : “Soft margin” Slack-variable: y i D(x i ) >= 1- ε i Penalty function Upper bound for Lagrange multipliers: C. Kernel function: Dot-product in higher-D space in terms of original parameters Resulting a symmetrical, positive semi-definite matrix. Satisfying Mercer’s theorem. Standard candidate: Polynomial, Gussian-Radial-basis Function Selection of kernel depends on knowledge.
14
14 Implementation with large sample set Large computation: One Lagrange multiplier per sample Reductionist approach Divide sample set into batches (subsets) Accumulate SV set from batch-by-batch operations Assumption: local non-SV samples are not global SVs either. Several algorithms that varies in terms of size of subsets Vapnik: Chunking algorithm Osuna: Osuna algorithm Platt: SMO algorithm Only 2 samples per operation Most popular
15
15 From 2-category to multi-category SVM No uniform way to extend Common ways: One-against-all One-against-one: binary tree
16
16 Advantages of SVM Strong mathematical basis Decision function and its complexity can be completely determined by SVs. Training time does not depend on dimensionality of feature space, only on fixed input space. Nice generalization Insensitive to “curse of dimensionality” Versatile choices of kernel function. Feature-less classification Kernel -> data-similarity measure
17
17 Drawback of SVM Still rely on knowledge Choices of C, kernel and penalty function C: how far the decision function is adapted to avoid any error Kernel: how much freedom SVM should adapt itself (dimension) Overlapping classes Reductionism may discard promising SVs at any batch step. The classification can be limited by the size of the problem. No uniform way to extend 2-category to multi-category “Still not an ideal optimally-generalizing classifier.”
18
18 Applications Vapnik et al. at AT&T: Handwritten number recognition Error rate is lower than that of ANN Speech recognition Face recognition MIR SVM-light: open source C library
19
19 Application example of SVM in MIR Li, Guo 2000: (Microsoft Research China) Problem: classify 16 classes of sounds in a database of 409 sounds Features: Concatenated perceptual and cepstral feature vectors. Similarity measure: Distance from boundary (SV-based boundary) Evaluation: Average retrieval accuracy Average retrieval efficiency
20
20 Application example of SVM in MIR Details in applying SVM Both linear and kernel-based approaches are tested Kernel: Exponential Radial Basis Function C: 200 Randomly partition corpus into training/test sets. One-against-one/binary tree in multi-category task. Compared with other approaches NFL: Nearest Feature Line, unsupervised approach Muscle Fish: normalized Euclidean metric and nearest-neighbor
21
21 Application example of SVM in MIR Average error rates comparison Different feature-set over different approaches Picture taken from (Li & Guo 2000)
22
22 Application example of SVM in MIR Complexity comparison SVM: Training: yes Classification complexity: C * (C-1) / 2 (binary tree) Inner-class complexity: number of SVs NFL: Training: no Classification complexity: linear to number of classes Inner-class complexity: Nc * (Nc-1) / 2
23
23 Future work Speed up quadratic programming Choice of kernel functions Find opportunities in solving impossible-so-far missions Generalize the non-linear kernel approach to approaches other than SVM Kernel PCA (principle component analysis)
24
24 Bibliography Summary: http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_summary.pdf http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_summary.pdf HTML bibliography: http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_bib.htm http://www.music.mcgill.ca/~damonli/MUMT611/wee k9_bib.htm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.