1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill  2005-3-17.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Machine learning continued Image source:
Discriminative and generative methods for bags of features
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Reduced Support Vector Machine
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
SVM Support Vectors Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
An Introduction to Support Vector Machines Martin Law.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Omer Boehm A tutorial about SVM Omer Boehm
LECTURE 16: SUPPORT VECTOR MACHINES
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines Most of the slides were taken from:
The following slides are taken from:
LECTURE 17: SUPPORT VECTOR MACHINES
Presentation transcript:

1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill 

2 Content  Related problems in pattern classification  VC theory and VC dimension  Overview of SVM  Application example

3 Related problems in pattern classification  Small sample-size effect (peaking effect)  Overly small or large sample-size results great error.  Inaccurate estimate of probability densities via finite sample sets for global set in typical Bayesian classifier.  Training data vs. test data  Empirical risk vs. structural risk  Misclassifying yet-to-be-seen data Picture taken from (Ridder 1997)

4 Related problems in pattern classification  Avoid solving a more general problem as an intermediate step. (Vapnik 1995)  Do it without estimation of probability of densities.  ANN  Depends on knowledge  Empirical-risk method (ERM):  Problem of generalization (hard to control over-fitting)  To find theoretical analysis for validity of ERM.

5 VC theory and VC dimension  VC dimension: (classifier complexity)  The maximum size of a sample set that a decision function can separate.  Finite VC dimension  coherence of ERM  Theoretical basis of ANN and SVM  Linear decision function:  VC dim = number of parameters  Non-linear decision function:  VC dim <= number of parameters

6 Overview of SVM  Structural-risk method (SRM)  Minimize ER  Control VC dimension  Result: tradeoff between ER and over-fitting  Focus on the explicit problem of classification:  To find the optimal hyperplane for dividing two classes  Supervised learning

7 Margin and Support Vectors (SV)  In the case of 2-category, linearly-separable data.  Small vs. large margin Picture taken from (Ferguson 2004)

8 Margin and Support Vectors (SV)  In the case of 2-category, linearly-separable data.  Find a hyperplane that has the largest margin to sample vectors of both classes.  D(x) = w t x +b => D(x’) = a t x’  Multiple solutions: weight space  Find a weight that causes the largest margin  Margin determined by SVs Picture taken from (Ferguson 2004)

9 Mathematical detail  y i D(x i ) >= 1, y = 1, -1  y i D(x i ’) / ||a|| >= margin  D(x i’ ) = a t x’  Max margin -> minimum ||a||  Quadratic programming  To find the minimum ||a|| under linear constraints  Weights: denoted by Lagrange multipliers  Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction)  The parameters of decision function and its complexity can be completely determined by SVs.

10 Linearly non-separable case  Example: XOR problem  Sample set size: 4  VC dim = 3 Pictures taken from (Ferguson 2004)

11 Linearly non-separable case  Map data to higher-dimension space  Linearly-separable in such Higher-D spaces  Make linear decision in higher-D spaces  Example: XOR  6-D space:  D(x) = x 1 x 2 Picture taken from (Ferguson 2004)

12 Linearly non-separable case  Hyperplane in both original and higher-D spaces (trajectory to 2-D plane)  The 4 samples are SVs. Picture taken from (Ferguson 2004; Luo 2002)

13 Linearly non-separable case  Modify the quadratic programming :  “Soft margin”  Slack-variable: y i D(x i ) >= 1- ε i  Penalty function  Upper bound for Lagrange multipliers: C.  Kernel function:  Dot-product in higher-D space in terms of original parameters  Resulting a symmetrical, positive semi-definite matrix.  Satisfying Mercer’s theorem.  Standard candidate: Polynomial, Gussian-Radial-basis Function  Selection of kernel depends on knowledge.

14 Implementation with large sample set  Large computation: One Lagrange multiplier per sample  Reductionist approach  Divide sample set into batches (subsets)  Accumulate SV set from batch-by-batch operations  Assumption: local non-SV samples are not global SVs either.  Several algorithms that varies in terms of size of subsets  Vapnik: Chunking algorithm  Osuna: Osuna algorithm  Platt: SMO algorithm  Only 2 samples per operation  Most popular

15 From 2-category to multi-category SVM  No uniform way to extend  Common ways:  One-against-all  One-against-one: binary tree

16 Advantages of SVM  Strong mathematical basis  Decision function and its complexity can be completely determined by SVs.  Training time does not depend on dimensionality of feature space, only on fixed input space.  Nice generalization  Insensitive to “curse of dimensionality”  Versatile choices of kernel function.  Feature-less classification  Kernel -> data-similarity measure

17 Drawback of SVM  Still rely on knowledge  Choices of C, kernel and penalty function  C: how far the decision function is adapted to avoid any error  Kernel: how much freedom SVM should adapt itself (dimension)  Overlapping classes  Reductionism may discard promising SVs at any batch step.  The classification can be limited by the size of the problem.  No uniform way to extend 2-category to multi-category  “Still not an ideal optimally-generalizing classifier.”

18 Applications  Vapnik et al. at AT&T:  Handwritten number recognition  Error rate is lower than that of ANN  Speech recognition  Face recognition  MIR  SVM-light: open source C library

19 Application example of SVM in MIR  Li, Guo 2000: (Microsoft Research China)  Problem:  classify 16 classes of sounds in a database of 409 sounds  Features:  Concatenated perceptual and cepstral feature vectors.  Similarity measure:  Distance from boundary (SV-based boundary)  Evaluation:  Average retrieval accuracy  Average retrieval efficiency

20 Application example of SVM in MIR  Details in applying SVM  Both linear and kernel-based approaches are tested  Kernel: Exponential Radial Basis Function  C: 200  Randomly partition corpus into training/test sets.  One-against-one/binary tree in multi-category task.  Compared with other approaches  NFL: Nearest Feature Line, unsupervised approach  Muscle Fish: normalized Euclidean metric and nearest-neighbor

21 Application example of SVM in MIR  Average error rates comparison  Different feature-set over different approaches Picture taken from (Li & Guo 2000)

22 Application example of SVM in MIR  Complexity comparison  SVM:  Training: yes  Classification complexity: C * (C-1) / 2 (binary tree)  Inner-class complexity: number of SVs  NFL:  Training: no  Classification complexity: linear to number of classes  Inner-class complexity: Nc * (Nc-1) / 2

23 Future work  Speed up quadratic programming  Choice of kernel functions  Find opportunities in solving impossible-so-far missions  Generalize the non-linear kernel approach to approaches other than SVM  Kernel PCA (principle component analysis)

24 Bibliography  Summary:  k9_summary.pdf k9_summary.pdf  HTML bibliography:  k9_bib.htm k9_bib.htm