An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Pattern Recognition and Machine Learning
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Advanced Topics in Computer and Human Vision
Chapter 5 Part II 5.3 Spread of Data 5.4 Fisher Discriminant.
Support Vector Machines and Kernel Methods
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
CS 9633 Machine Learning Support Vector Machines
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Principal Component Analysis
The following slides are taken from:
Support vector machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Usman Roshan CS 675 Machine Learning
Support vector machines
LECTURE 09: DISCRIMINANT ANALYSIS
Support vector machines
SVMs for Document Ranking
Presentation transcript:

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics in Machine Learning 9 March, 2006

Outline Problem Description Nonlinear Algorithms in Kernel Feature Space Supervised Learning:  Nonlinear SVM  Kernel Fisher Discriminant Analysis Supervised Learning:  Kernel Principle Component Analysis Applications Model specific kernels

Problem Description 2-class Classification: estimate a function, using input-output training data such that will correctly classify unseen examples. i.e., Find a mapping: Assume: Training and test data are drawn from the same probability distribution

Problem Description A learning machine is a family of functions For a task of learning two classes f(x,  ) 2 {-1,1}, 8 x,  Too complex ) Overfitting Not complex enough ) Underfitting Want to find the right balance between accuracy and complexity

Problem Description Best is the one that minimizes the expected error: Empirical Risk (training error): R emp (  )! R(  ) as n!1

Structural Risk Minimization Construct a nested family of function classes, with non-decreasing VC dimension. Let be the solutions of the empirical risk minimization. SRM chooses the function class and the function such that an upper bound on the generalization error is minimized.

Nonlinear Algorithms in Kernel Feature Space Via a non-linear mapping the data is mapped into a potentially much higher dimensional feature space. Given this mapping, we can compute scalar products in feature spaces using kernel functions.  does not need to be known explicitly ) every linear algorithm that only uses scalar products can implicitly be executed in by using kernels.

Nonlinear Algorithms in Kernel Feature Space: Example

Supervised Learning: Nonlinear SVM Consider linear classifiers in feature space using dot products. Conditions for classification without training error: GOAL: Find and b such that the empirical risk and regularization term are minimized. But we cannot explicitly access w in the feature space, so we introduce Lagrange multipliers,  i, one for each of the above constraints.

Supervised Learning: Nonlinear SVM Last class we saw that the nonlinear SVM primal problem is: Which leads to the dual:

Supervised Learning: Nonlinear SVM Using KKT second order optimality conditions on the dual SVM problem, we obtain: The solution is sparse in  ) many patterns are outside the margin area and the optimal  i ’s are zero. Without sparsity, SVM would be impractical for large data sets.

Supervised Learning: Nonlinear SVM The dual problem can be rewritten as: Where Since objective function is convex, every local max is a global max, but there can be several optimal solutions (in terms of  i ) Once  i ’s are found using QP solvers, simply plug into prediction rule:

Supervised Learning: KFD Discriminant analysis seeks to find a projection of the data in a direction that is efficient for discrimination. Image from: R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, John Wiley & Sons, INC., 2001.

Supervised Learning: KFD Solve Fisher’s linear discriminant in kernel feature space. Aims at finding linear projections such that the classes are well separated.  How far are the projected means apart? (should be large)  How big is the variance of the data in this direction? (should be small) Recall, that this can be achieved by maximizing the Rayleigh quotient: where

Supervised Learning: KFD In kernel feature space, express w in terms of mapped training patterns: To get: Where,

Supervised Learning: KFD Projection of a test point onto the discriminant is computed by: Can solve the generalized eigenvalue problem: But N and M may be large and non-sparse, can transform KFD into a convex QP problem. Question – can we use numerical approximations to the eigenvalue problem?

Supervised Learning: KFD Can reformulate as constrained optimization problem. FD tries to minimize the distance between the variance of the data along the projection whilst maximizing the distance between the means: This QP is equivalent to J(  ) since  M is a matrix of rank 1 (columns are linearly dependent)  Solutions w in J(  ) are invariant under scaling. ) Can fix the distance of the means to some arbitrary, positive value and just minimize the variance.

Connection Between Boosting and Kernel Methods Can show that Boosting maximizes the smallest margin . Recall, SVM attempts to maximize w In general, using an arbitrary l p norm constraint on the weight vector leads to maximizing the l q distance between the hyperplane and the training points.  Boosting uses l 1 norm  SVM uses l 2 norm

Unsupervised Methods: Linear PCA  Principal Components Analysis (PCA) attempts to efficiently represent the data by finding orthonormal axes which maximally decorrelate the data  Given centered observations: PCA finds the principal axes by diagonalizing the covariance matrix Note that C is positive definite,and thus can be diagonalized with nonnegative eigenvalues.

Unsupervised Methods: Linear PCA Eigenvectors lie in the span of x 1, …, x n : Thus it can be shown that, But is just a scalar, so all solutions v with  0 lie in the span of x 1, …, x n, i.e.

Unsupervised Methods: Kernel PCA If we first map the data into another space, Then assuming we can center the data, we can write the covariance matrix as: Which can be diagonalized with nonnegative eigenvalues satisfying:

Unsupervised Methods: Kernel PCA As in linear PCA, all solutions v with  0 lie in the span of  (x i ), …,  (x m ) i.e. Substituting, we get: Where K is the inner product kernel: Premultiplying both sides by  (x k ) T, we finally get:

Unsupervised Methods: Kernel PCA The resulting set of eigenvectors are then used to extract the Principle Components of a test point by:

Unsupervised Methods: Kernel PCA Nonlinearities only enter the computation at two points:  In the calculation of the matrix K  In the evaluation of new points Drawback of PCA:  For large data sets, storage and computational complexity issues. Can use sparse approximations of K. Question: Can we think of other unsupervised methods which can make use of kernels?  Kernel k-means, Kernal ICA, Spectral Clustering

Unsupervised Methods: Linear PCA

Unsupervised Methods: Kernel PCA

Applications Support Vector Machines and Kernel Fisher Discriminant:  Bioinformatics: protein classification  OCR  Face Recognition  Content based image retrieval  Decision Tree Predictive Modeling  … Kernel PCA  Denoising  Compression  Visualization  Feature extraction for classification

Kernels for Specific Applications Image Segmentation: Gaussian weighted  2 -distance between local color histograms.  Can be shown to be robust for color and texture discrimination Text classification: Vector Space kernels Structured Data (strings, trees, etc.): Spectrum kernels Generative models: P-kernels, Fisher kernels