Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
Support vector machine
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machine
Pattern Recognition and Machine Learning
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Support Vector Machines Kernel Machines
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Christopher M. Bishop, Pattern Recognition and Machine Learning.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Sparse Kernel Machines
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
Presentation transcript:

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU

Sparse Kernel Methods 2 General Model of Learning Learning Model (Vapnik, 2000)  Generator (G): generates random vectors x, drawn independently from a fixed but unknown distribution F(x).  Supervisor (S): returns an output value y according to a conditional distribution F(y|x), also fixed but unknown.  Learning Machine (LM): capable of implementing a set of functions  The Learning Problem: Choose a function that best approximates the supervisor’s response based on a training set of N i.i.d. observations drawn from the distribution F(x,y) = F(x)F(y|x), GS LM

Sparse Kernel Methods 3 Risk Minimization To best approximate the supervisor’s response, find the function that minimizes  Risk Functional  L: Loss function  Note F(x, y) is fixed but unknown and the only information available is contained in the training set.  How to estimate the Risk?

Sparse Kernel Methods 4 Classification and Regression Classification Problem  Supervisor’s output y  {0, 1}  The loss function Regression Problem  Supervisor’s output y: real value  The loss function

Sparse Kernel Methods 5 Empirical Risk Minimization Framework Empirical Risk  F(x, y) is unknown.  Estimation of R(w) based on training data Empirical Risk Minimization Framework  Find a function that minimizes the empirical risk.  Fundamental assumption in inductive learning  For classification problem, this leads to finding a function with the minimum training error.  For regression problem, leads to the least squares error method.

Sparse Kernel Methods 6 Over-Fitting Problem Over-Fitting  Small training error (empirical risk), but large generalization error!  Consider the problem of polynomial curve fitting. Polynomials of sufficiently high degree can perfectly fit a given finite set of training data. However, when applied to new (unknown) data, the prediction quality can be very poor. Why Over-Fitting?  Many Possible Causes: Insufficient data, Noise, etc  Source of continuing debate  However, we know that over-fitting is closely related to model complexity (the expressive power of the learning machine).

Sparse Kernel Methods 7 Over-Fitting Problem: Illustration (Bishop, 2006) M: Degree of polynomial, Green: True function, Red: Least Squares Error Estimation

Sparse Kernel Methods 8 How to Avoid Over-Fitting Problem General Idea  Penalize models with high complexity  Occam’s Razor Regularization  Add regularized functional to risk functional  E.g., Ridge regression SRM (Structural Risk Minimization) Principle  Due to Vapnik (1996) h: Capacity of a set of functions

Sparse Kernel Methods 9 How to Avoid Over-Fitting Problem Bayesian Methods  Incorporate Prior Knowledge on the form of functions  Prior distribution F(w)  Final result: Predictive distribution F(y|D), where D is a training set, is obtained by marginalizing on w. Remark. 1) Bayesian framework gives probabilistic generative models. 2) Strong connection with regularization theory 3) Kernels can be generated based on generative models.

Sparse Kernel Methods 10 Motivation: Linear Regression Primal Problem  Use Ridge regression  Solution

Sparse Kernel Methods 11 Motivation: Linear Regression Dual Problem

Sparse Kernel Methods 12 Motivation: Linear Regression Discussion  In primal formulation, we should invert D X D matrix.  In dual formulation, invert N X N matrix.  Dual representation shows the predicted value is a linear combination of observed values with weights given by the function k(). So Why Dual?  Note the solution of the dual problem is determined by K.  K is called a Gram matrix and is defined by the function k() called kernel function.  The major observation here is that we can solve the regression problem only by knowing the Gram matrix K, or alternatively the kernel function k.  Can generalize to other form of functions if we define new kernel function!

Sparse Kernel Methods 13 Beyond Linear Relations Extension to Nonlinear Function  Feature Space Transform  And Define the set of functions  For example, polynomials of degree D.  By using feature space transform, we can extend the linear relation to nonlinear relations.  These models is still a linear model since the function is linear in the unknowns (w). ※  (x): a vector of basis functions.

Sparse Kernel Methods 14 Beyond Linear Relations Problems in Feature Space Transform  Difficulty in finding the appropriate transform.  Curse of Dimensionality: The number of parameters rapidly increases. So Kernel Functions!  Note in dual formulation, the only necessary information is the kernel function.  Kernel function is defined as an inner product of two vectors.  If we can find an appropriate kernel function, we can solve the problem without explicitly considering the feature space transform.  Some kernel functions have the effect of considering infinite dimensional feature space.

Sparse Kernel Methods 15 Kernel Functions A Kernel is a function k that for all x, z  X satisfies where  is a mapping from X to a feature space F Example

Sparse Kernel Methods 16 Characterization of Kernel Functions How to Find a Kernel Function?  First define feature space transform, then define a kernel as an inner product in the space.  Direct method to characterize a kernel Characterization of Kernel (Shawe-Taylor and Cristianini, 2004) A function, which is either continuous or has a finite domain, can be decomposed as if and only if it is a finite positive semi-definite function, that is, for any choice of finite set, the matrix is positive semi-definite.  For proof, see the reference (RKHS, Reproducing Kernel Hilbert Space)  Alternative characterization is given by Mercer’s Theorem.

Sparse Kernel Methods 17 Examples of Kernel Functions Example  1 st : Polynomial Kernel, 2 nd : Gaussian Kernel  3 rd : Kernel derived from generative model, where p(x) is a probability.  4 th : Kernel defined on power set of a given set S.  There are many known techniques for constructing new kernels from existing kernels, see reference.

Sparse Kernel Methods 18 Kernel in Practice In practical applications, you can choose a kernel that reflects the similarity between two objects.  Note  Hence if appropriately normalized, the kernel represents the similarity between two objects in some feature space. Remark. 1) Kernel Trick: Develop a learning algorithm based on inner products. Then replace the inner product with a kernel (e.g., Regression, Classification, etc). 2) Generalized Distance: We can generalize the notion of kernel to the case where it represents dissimilarity in some feature space (conditionally positive semi-definite kernel). Then we can use the kernel in learning algorithms based on distance between objects (e.g., clustering, Nearest Neighbor, etc).

Sparse Kernel Methods 19 Support Vector Machines 2 Class Classification Problem  Given a training set, where, find a function that satisfies for all points having and for points having  Equivalently,, for all n.

Sparse Kernel Methods 20 Support Vector Machines: Linearly Separable Case Linearly Separable Case  If we can find such a function f(x).  In this case, the points (training data) are separated by a hyperplane (separating hyperplane) f(x) = 0 in the feature space.  There can be infinitely many such functions. Margin  Margin: the distance between the hyperplane and the closest point.

Sparse Kernel Methods 21 Support Vector Machines: Linearly Separable Case Maximum Margin Classifiers  Find a hyperplane with the maximum margin.  Why Maximum Margin? Recall SRM. Maximum margin hyperplane corresponds to the case with the smallest capacity (Vapnik, 2000). So it is the solution when we choose SRM framework.

Sparse Kernel Methods 22 Support Vector Machines: Linearly Separable Case Formulation: Quadratic Programming 1) The parameters are normalized so that the margin = 1. 2) Then the margin =

Sparse Kernel Methods 23 Support Vector Machines: Linearly Separable Case Dual Formulation 1) Obtained by applying Lagrange Duality 2) 3) The hyperplane found is

Sparse Kernel Methods 24 Support Vector Machines: Linearly Separable Case Discussion  KKT condition  So only if Such vectors are called support vectors. Note the maximum margin hyperplane is dependent only on the support vectors.  Sparsity.  Note to solve the dual problem, we only need the kernel function k, and so we need not consider the feature space transform explicitly.  The form of the maximum margin hyperplane shows that the prediction is given by a combination of observations (with weights given by kernels), specifically, the support vectors.

Sparse Kernel Methods 25 Support Vector Machines: Linearly Separable Case Example (Bishop, 2006) Gaussian kernels are used here.

Sparse Kernel Methods 26 Support Vector Machines: Overlapping Classes Overlapping Class  Introduce slack variables.  The results are almost the same except additional constraints.  For details, see reference.

Sparse Kernel Methods 27 SVM for Regression  -insensitive Error Function

Sparse Kernel Methods 28 SVM for Regression Formulation

Sparse Kernel Methods 29 SVM for Regression Solution  Similar to SVM for classification, use Lagrange dual.  Then the solution is given by  By considering KKT condition, we can show that the dual variable is positive only if the corresponding point is either on the boundary of or outside the  -tube.  So Sparsity results.

Sparse Kernel Methods 30 Summary Classification and Regression based on Kernels  Dual formulation  extension to arbitrary kernels  Sparsity: Support Vectors Some Limitations of SVM  Choice of Kernel  Solution Algorithm: Efficiency of solving large scale QP problems.  Multi-class Problem

Sparse Kernel Methods 31 Related Topics Relevance Vector Machines  Use prior knowledge on the distribution of the functions (parameters) Choose ,  that maximizes the above function (marginal likelihood function). Then using them, find the predictive distribution of y given a new value x by using posterior of w.

Sparse Kernel Methods 32 Related Topics Gaussian Process  For any finite set of points, jointly have a Gaussian distribution.  Usually, due to lake of prior knowledge, the mean = 0.  The covariance is defined by a kernel function k.  The regression problem given a set of observations reduces to finding a conditional distribution of y.

Sparse Kernel Methods 33 References General introductory material for machine learning [1] Pattern Recognition and Machine Learning by C. M. Bishop, Springer, Very well written book with an emphasis on Bayesian methods. Fundamentals of Statistical learning theory and kernel methods [2] Statistical Learning Theory by V. Vapnik, John Wiley and Sons, 1996 [3] The Nature of Statistical Learning Theory, 2 nd Ed. by V. Vapnik, Springer, 2000 Both books deal with essentially the same topic but in [3], mathematical details are kept at minimum, while [2] gives all the details. Origin of SVM. Kernel Engineering [4] Kernel Methods for Pattern Analysis by J. Shawe-Taylor and N. Cristianini, Cambridge University Press, 2004 Deals with various kernel methods with applications to problems with texts, sequences, trees, etc. Gaussian Process [5] Gaussian Processes for Machine Learning by C. Rasmussen and C. Williams, MIT Press, 2006 Presents up-to-date survey on Gaussian process and related topics.