Kernels Slides from Andrew Moore and Mingyue Tan.

Kernels Slides from Andrew Moore and Mingyue Tan

How to classify non-linearly separable databases?

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2

Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Nonlinear SVMs: The Kernel Trick 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j ) = φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = 1+ x i1 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] An example: This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function

Modification Due to Kernel Function For testing, the new data z is classified as class 1 if f  0, and as class 2 if f <0 Original With kernel function

An Example for  (.) and K(.,.) Suppose f(.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

Kernel Functions In practical use of SVM, the user specifies the kernel function; the transformation f(.) is not explicitly stated Given a kernel function K(x i, x j ), the transformation f(.) is given by its eigenfunctions (a concept in functional analysis)  Eigenfunctions can be difficult to construct explicitly  This is why people only specify the kernel function without worrying about the exact transformation Another view: kernel function, being an inner product, is really a similarity measure between the objects

Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width s  Closely related to radial basis function neural networks  The feature space is infinite-dimensional Sigmoid with parameter k and q

Nonlinear SVMs: The Kernel Trick Mercer’s Theorem: Let X be a finite input space with K(x,z) a symmetric function on X. Then K(x,z) is a kernel function if and only if the Gram matrix is positive semi-definite (has non- negative eigenvalues). K = K(x i, x j ) for i,j = 1 to n Gram matrix: K(x 1,x 1 )K(x 1,x 2 )K(x 1,x 3 )…K(x 1,x N ) K(x 2,x 1 )K(x 2,x 2 )K(x 2,x 3 )K(x 2,x N ) …………… K(x N,x 1 )K(x N,x 2 )K(x N,x 3 )…K(x N,x N ) K=

More on Kernel Functions Not all similarity measure can be used as kernel function, however  The kernel function needs to satisfy the Mercer function, i.e., the function is “positive-definite”  This implies that the n by n kernel matrix, in which the (i,j)-th entry is the K(x i, x j ), is always positive definite  This also means that the QP is convex and can be solved in polynomial time

Example Suppose we have 5 1D data points  x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2  K(x i, x j ) = (x i x j +1) 2  C is set to 100 We first find  i (i=1, …, 5) by

Example By using a QP solver, we get  a 1 =0, a 2 =2.5, a 3 =0, a 4 =7.333, a 5 =4.833  Note that the constraints are indeed satisfied  The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2 and x 5 lie on the line and x 4 lies on the line All three give b=9

Properties of SVM Flexibility in choosing a similarity function Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space Overfitting can be controlled by soft margin approach Nice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution Feature Selection

SVM Applications SVM has been used successfully in many real-world problems  text (and hypertext) categorization  image classification  bioinformatics (Protein classification, Cancer classification)  hand-written character recognition

Some Issues Choice of kernel  Start with linear kernel  Gaussian or polynomial kernel is default  if ineffective, more elaborate kernels are needed  domain experts can give assistance in formulating appropriate similarity measures Choice of kernel parameters  e.g. σ in Gaussian kernel  σ is the distance between closest points with different classifications  In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.

Weakness of SVM It is sensitive to noise - A relatively small number of mislabeled examples can dramatically decrease the performance Sensitive to class imbalance  Resampling methods: Support cluster machine  Ensemble learning methods

Multiclass SVM One vs All (OVA) -> m SVMs  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. What is the problem here?

Multiclass SVM One vs One (OVO) –> m(m-1)/2 SVMs  SVM 1 learns “Output==1” vs “Output == 2”  SVM 2 learns “Output==1” vs “Output == 3”  :  SVM m learns “Output==m-1” vs “Output == m” 2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Multiclass SVM Directed Acyclic Graph (DAG) –> m-1 SVMs  Training is same as OVO  During testing, create a list of classes.  Test with the SVM corresponding to the first and last element in list L. If the classifier prefers one of the two classes, the other one is eliminated from the list.  Repeat the process with m-1 SVMs – the last one remaining is the class

Application 1: Cancer Classification High Dimensional - Dimension: >1000; - Patients <100 Imbalanced - less positive samples Many irrelevant features Noisy Genes Patientsg-1g-2……g-p P-1 p-2 ……. p-np-n FEATURE SELECTION In the linear case, w i 2 gives the ranking of dim i

Application 2: Text Categorization Task: The classification of natural text (or hypertext) documents into a fixed number of predefined categories based on their content. - email filtering, web searching, sorting documents by topic, etc.. A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category What is a multi-label classification?

Representation of Text IR’s vector space model (aka bag-of-words representation) A document is represented by a vector indexed by a pre-fixed set or dictionary of terms Values of an entry can be binary or weights Normalization, stop words, word stems Doc x => φ(x)

Text Categorization using SVM The distance between two documents is φ(x)·φ(z) K(x,z) = 〈 φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination. Why SVM?  High dimensional input space  Few irrelevant features (dense concept)  Sparse document vectors (sparse instances)  Text categorization problems are linearly separable

Some Issues Optimization criterion – Hard margin v.s. Soft margin  a lengthy series of experiments in which various parameters are tested Perform grid search for best parameters

Additional Resources An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955- 974, 1998. The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 http://www.kernel-machines.org/

End of SVM Week of 27 th : Unsupervised learning by Anil (your TA) Best practices for feature extraction (Gaurav)

Happy Diwali

Kernels Slides from Andrew Moore and Mingyue Tan.

Similar presentations

Presentation on theme: "Kernels Slides from Andrew Moore and Mingyue Tan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernels Slides from Andrew Moore and Mingyue Tan.

Similar presentations

Presentation on theme: "Kernels Slides from Andrew Moore and Mingyue Tan."— Presentation transcript:

Similar presentations

About project

Feedback