Kernels Slides from Andrew Moore and Mingyue Tan.

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.
An Introduction of Support Vector Machine

An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
Discriminative and generative methods for bags of features
Support Vector Machine
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines and Kernel Methods
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Support Vector Machine & Image Classification Applications
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Statistical Classification Methods
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machine PNU Artificial Intelligence Lab. Kim, Minho.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM – Support Vector Machines Presented By: Bella Specktor.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CpSc 810: Machine Learning Support Vector Machine.
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Support Vector Machine & Its Applications
Support Vector Machine
Introduction to Machine Learning
Presentation transcript:

Kernels Slides from Andrew Moore and Mingyue Tan

How to classify non-linearly separable databases?

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Nonlinear SVMs: The Kernel Trick 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j ) = φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = 1+ x i1 2 x j x i1 x j1 x i2 x j2 + x i2 2 x j x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] An example: This slide is courtesy of

Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function

Modification Due to Kernel Function For testing, the new data z is classified as class 1 if f  0, and as class 2 if f <0 Original With kernel function

An Example for  (.) and K(.,.) Suppose f(.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

Kernel Functions In practical use of SVM, the user specifies the kernel function; the transformation f(.) is not explicitly stated Given a kernel function K(x i, x j ), the transformation f(.) is given by its eigenfunctions (a concept in functional analysis)  Eigenfunctions can be difficult to construct explicitly  This is why people only specify the kernel function without worrying about the exact transformation Another view: kernel function, being an inner product, is really a similarity measure between the objects

Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width s  Closely related to radial basis function neural networks  The feature space is infinite-dimensional Sigmoid with parameter k and q

Nonlinear SVMs: The Kernel Trick Mercer’s Theorem: Let X be a finite input space with K(x,z) a symmetric function on X. Then K(x,z) is a kernel function if and only if the Gram matrix is positive semi-definite (has non- negative eigenvalues). K = K(x i, x j ) for i,j = 1 to n Gram matrix: K(x 1,x 1 )K(x 1,x 2 )K(x 1,x 3 )…K(x 1,x N ) K(x 2,x 1 )K(x 2,x 2 )K(x 2,x 3 )K(x 2,x N ) …………… K(x N,x 1 )K(x N,x 2 )K(x N,x 3 )…K(x N,x N ) K=

More on Kernel Functions Not all similarity measure can be used as kernel function, however  The kernel function needs to satisfy the Mercer function, i.e., the function is “positive-definite”  This implies that the n by n kernel matrix, in which the (i,j)-th entry is the K(x i, x j ), is always positive definite  This also means that the QP is convex and can be solved in polynomial time

Example Suppose we have 5 1D data points  x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2  K(x i, x j ) = (x i x j +1) 2  C is set to 100 We first find  i (i=1, …, 5) by

Example By using a QP solver, we get  a 1 =0, a 2 =2.5, a 3 =0, a 4 =7.333, a 5 =4.833  Note that the constraints are indeed satisfied  The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2 and x 5 lie on the line and x 4 lies on the line All three give b=9

Properties of SVM Flexibility in choosing a similarity function Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space Overfitting can be controlled by soft margin approach Nice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution Feature Selection

SVM Applications SVM has been used successfully in many real-world problems  text (and hypertext) categorization  image classification  bioinformatics (Protein classification, Cancer classification)  hand-written character recognition

Some Issues Choice of kernel  Start with linear kernel  Gaussian or polynomial kernel is default  if ineffective, more elaborate kernels are needed  domain experts can give assistance in formulating appropriate similarity measures Choice of kernel parameters  e.g. σ in Gaussian kernel  σ is the distance between closest points with different classifications  In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.

Weakness of SVM It is sensitive to noise - A relatively small number of mislabeled examples can dramatically decrease the performance Sensitive to class imbalance  Resampling methods: Support cluster machine  Ensemble learning methods

Multiclass SVM One vs All (OVA) -> m SVMs  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. What is the problem here?

Multiclass SVM One vs One (OVO) –> m(m-1)/2 SVMs  SVM 1 learns “Output==1” vs “Output == 2”  SVM 2 learns “Output==1” vs “Output == 3”  :  SVM m learns “Output==m-1” vs “Output == m” 2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Multiclass SVM Directed Acyclic Graph (DAG) –> m-1 SVMs  Training is same as OVO  During testing, create a list of classes.  Test with the SVM corresponding to the first and last element in list L. If the classifier prefers one of the two classes, the other one is eliminated from the list.  Repeat the process with m-1 SVMs – the last one remaining is the class

Application 1: Cancer Classification High Dimensional - Dimension: >1000; - Patients <100 Imbalanced - less positive samples Many irrelevant features Noisy Genes Patientsg-1g-2……g-p P-1 p-2 ……. p-np-n FEATURE SELECTION In the linear case, w i 2 gives the ranking of dim i

Application 2: Text Categorization Task: The classification of natural text (or hypertext) documents into a fixed number of predefined categories based on their content. - filtering, web searching, sorting documents by topic, etc.. A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category What is a multi-label classification?

Representation of Text IR’s vector space model (aka bag-of-words representation) A document is represented by a vector indexed by a pre-fixed set or dictionary of terms Values of an entry can be binary or weights Normalization, stop words, word stems Doc x => φ(x)

Text Categorization using SVM The distance between two documents is φ(x)·φ(z) K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination. Why SVM?  High dimensional input space  Few irrelevant features (dense concept)  Sparse document vectors (sparse instances)  Text categorization problems are linearly separable

Some Issues Optimization criterion – Hard margin v.s. Soft margin  a lengthy series of experiments in which various parameters are tested Perform grid search for best parameters

Additional Resources An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience;

End of SVM Week of 27 th : Unsupervised learning by Anil (your TA) Best practices for feature extraction (Gaurav)

Happy Diwali