Ch 6. Kernel Methods by Aizerman et al. (1964). Re-introduced in the context of large margin classifiers by Boser et al. (1992). Vapnik (1995), Burges.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
x – independent variable (input)
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Methods for Classification
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Chapter 6-2 Radial Basis Function Networks 1. Topics Basis Functions Radial Basis Functions Gaussian Basis Functions Nadaraya Watson Kernel Regression.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Principles of Pattern Recognition
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Dropout as a Bayesian Approximation
Machine Learning 5. Parametric Methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Support Vector Machines
Machine Learning Basics
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Linear Discrimination
Probabilistic Surrogate Models
Presentation transcript:

Ch 6. Kernel Methods by Aizerman et al. (1964). Re-introduced in the context of large margin classifiers by Boser et al. (1992). Vapnik (1995), Burges (1998), Cristianini and Shawe-Taylor (2000), M uller et al. (2001), Schölkopf and Smola (2002),and Herbrich (2002). C. M. Bishop, 2006.

2 Recall, in linear methods for classification and regression Classical Approaches: Linear, parametric or non parametric. A set of training data is used to obtain a parameter vector. Step1: Train Step 2: Recognize Kernel Methods: Memory-based store the entire training set in order to make predictions for future data points (nearest neighbors). Transform data to higher dimensional space for linear separability

November, 2006CCKM'06 3 Kernel methods approach The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: The expectation is that the feature space has a much higher dimension than the input space.

November, 2006CCKM'06 4 Example Consider the mapping If we consider a linear equation in this feature space: We actually have an ellipse – i.e. a non-linear shape in the input space.

5 Capacity of feature spaces The capacity is proportional to the dimension 2-dim:

November, 2006CCKM'06 6 Form of the functions So kernel methods use linear functions in a feature spac e: For regression this could be the function For classification require thresholding

November, 2006CCKM'06 7 Problems of high dimensions Capacity may easily become too large and lead to over-fi tting: being able to realise every classifier means unlikely to generalise well Computational costs involved in dealing with large vecto rs

November, 2006CCKM'06 8 Recall Two theoretical approaches converged on similar algorithm s: 1.Bayesian approach led to Bayesian inference using Gaussian Proce sses 2.Frequentist Approach: MLE First we briefly discuss the Bayesian approach before menti oning some of the frequentist results

November, 2006CCKM'06 9 I. Bayesian approach The Bayesian approach relies on a probabilistic analysis by positing  a pdf model  a prior distribution over the function class Inference involves updating the prior distribution with t he likelihood of the data Possible outputs:  MAP function  Bayesian posterior average

November, 2006CCKM'06 10 Bayesian approach Avoids overfitting by  Controlling the prior distribution  Averaging over the posterior

November, 2006CCKM'06 11 Bayesian approach Subject to assumptions about pdf model and prior distri bution:  Can get error on the output  Compute evidence for the model and use for model selection Approach developed for different pdf models  eg classification  Typically requires approximate inference

November, 2006CCKM' Frequentist approach Source of randomness is assumed to be a distribution that g enerates the training data i.i.d. – with the same distribution generating the test data Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived Main focus is on generalisation error analysis

November, 2006CCKM'06 13 Generalization What do we mean by generalisation?

November, 2006CCKM'06 14 Generalization of a learner

November, 2006CCKM'06 15 Example of Generalisation We consider the Breast Cancer dataset Use the simple Parzen window classifier: weight vector is where is the average of the positive (nega tive) training examples. Threshold is set so hyperplane bisects the line joining these two points.

November, 2006CCKM'06 16 Example of Generalisation By repeatedly drawing random training sets S of size m we estimate the distribution of by using the test set error as a proxy for the true general isation We plot the histogram and the average of the distributio n for various sizes of training set 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.

November, 2006CCKM'06 17 Example of Generalisation Since the expected classifier is in all cases the same we do not expect large differences in the average of the dis tribution, though the non-linearity of the loss function m eans they won't be the same exactly.

November, 2006CCKM'06 18 Error distribution: full dataset

November, 2006CCKM'06 19 Error distribution: dataset size: 342

November, 2006CCKM'06 20 Error distribution: dataset size: 273

November, 2006CCKM'06 21 Error distribution: dataset size: 205

November, 2006CCKM'06 22 Error distribution: dataset size: 137

November, 2006CCKM'06 23 Error distribution: dataset size: 68

November, 2006CCKM'06 24 Error distribution: dataset size: 34

November, 2006CCKM'06 25 Error distribution: dataset size: 27

November, 2006CCKM'06 26 Error distribution: dataset size: 20

November, 2006CCKM'06 27 Error distribution: dataset size: 14

November, 2006CCKM'06 28 Error distribution: dataset size: 7

November, 2006CCKM'06 29 Observations Things can get bad if number of training examples small compared to dimension Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly w rong Key ingredient of learning – keep flexibility high while still ensuring good generalisation

November, 2006CCKM'06 30 Controlling generalisation The critical method of controlling generalisation for classi fication is to force a large margin on the training data:

31 Kernel methods approach The kernel methods approach is to stick with linear funct ions but work in a high dimensional feature space: The expectation is that the feature space has a much hig her dimension than the input space.

Study: Hilbert Space Functionals: A mapfrom vector space to a field Duality: Inner product Norm Similarity Distance Metric 32 (C) 2006, SNU Biointelligence Lab,

Kernel Functions k(x, x )=φ(x) T φ(x ). For example k(x, x )=(x T x’+c) M What if x and x’ are two images? The kernel represents a particular weighted sum of all possible products of M pixels in the first image with M pixe ls in the second image. 33

34 Kernel Function: Evaluated at the training data points k(x, x’ )=φ(x) T φ(x’ ). Linear Kernels: k;(x,x’) = x T x’ Stationary kernels: Invariant to translation Homogeneous kernels, i.e., radial basis functions:

Kernel Trick if we have an algorithm in which the input vector x enters only in the form of scalar products, then we can replace that scalar product with some other choice of kernel. 35

Dual Representations Consider a linear regression model for regularized SSE function If we set Where n th row of is And

Dual Representations (2/4) We can now reformulate the least-squares algorithm in terms of a (dual representation). We substitute into to obtain Define Gram Matrix with entries

Dual Representations (3/4) The sum-of-squares error function can be written as Setting the gradient of with respect to a to zero, we obtain optimal a Recall a was a function of w

Dual Representations (4/4) We obtain the following prediction for a new input x by substituting this as where we define the vector k(x) with elements Prediction y(x) is computed from thelinear combo of t Y(x) is expressed entirely in terms of the kernel function k(x,x’). w is expressed in terms of linear combo of a w =aTф(x)

Recall Linear regresion solution : w= [Φ T Φ +λI] -1 Φ T t Dual Representation: a= [K +λI] -1 t Note K is NxN Φ is MxM 40

Constructing Kernels (1/5) Kernel function is defined as inner product of two functions Example of kernel

42 (C) 2006, SNU Biointelligence Lab, Basis Functions and corresponding Kernels Figure 6.1 Upper Plot: basis functions (polynomials, Gaussians, logistic sigmoid), and lower plots are kernel functions.

43 (C) 2006, SNU Biointelligence Lab, Constructing Kernels A necessary and sufficient condition for a function to be a valid kernel is that the Gram matrix K should be positive semi definite. Techniques for constructing new kernels: given k 1 (x,x’) and k 2 (x,x’), the following new kernels will also be valid.

44 (C) 2006, SNU Biointelligence Lab, Gaussian Kernel Show that: The feature vector that corresponds to the Gaussian kernel has infinity dimensionality.

Construction of Kernels from Generative Models Given p(x), define a kernel function k((x,x’) = p(x)p(x’) A kernel function measuring the similarity of two sequences: z is hidden variable Leads to hidden Markov model if x and x’ are sequences of outcomes 45

Fisher Kernel Consider Fisher Score: Then fisher kernel is defined as Where F is the Fisher information matrix, 46 (C) 2006, SNU Biointelligence Lab,

47 Sigmoid kernel This sigmoid kernel form gives the support machine a superficial resemblance to neural network model.

48 (C) 2006, SNU Biointelligence Lab, How to select the functions? x Assume fixed nonlinear transformation  Transform inputs using a vector of basis functions  The resulting decision boundaries will be linear in the feature space y(x)= W T Φ

49 (C) 2006, SNU Biointelligence Lab, Radial Basis Function Networks Each basis function depends only on the radial distance from a center μ j, so that

50 (C) 2006, SNU Biointelligence Lab, Radial Basis Function Networks (2/3) Let’s consider of the interpolation problem when the input variables are noisy. If the noise on the input vector x is described by a variable ξ having a distribution ν(ξ), the sum-of-squares error function becomes as follows: Using the calculus of variation,

51 (C) 2006, SNU Biointelligence Lab, Radial Basis Function Networks (3/3) Figure 6.2 Gaussian basis functions and their normalized basis functions

52 (C) 2006, SNU Biointelligence Lab, Nadaraya-Watson model (1/2)

53 (C) 2006, SNU Biointelligence Lab, Nadaraya-Watson model (2/2) Figure 6.3 Illustration of the Nadaraya-Watson kernel regression model for sinusoid data set. The original sine function is the green curve. The data points are blue points and resulting regression line is red.

54 (C) 2006, SNU Biointelligence Lab, Gaussian Processes We extend the role of kernels to probabilistic discriminative models. We dispense with the parametric model and instead define a prior probability distribution over functions directly.

55 (C) 2006, SNU Biointelligence Lab, Linear regression revisited (1/3) - Prior distribution

56 (C) 2006, SNU Biointelligence Lab, Linear regression revisited (2/3) -A key point about Gaussian stochastic processes is that the joint distribution over N variables is specified completely by the second-order statistics, namely the mean and the covariance.

57 (C) 2006, SNU Biointelligence Lab, Linear regression revisited (3/3) Figure 6.4 Samples from Gaussian processes for a ‘Gaussian’ kernel (left) And an exponential kernel (right).

58 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (1/7)

59 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (2/7) One widely used kernel function for Gaussian process regression Figure 6.5 Samples from a Gaussian process prior Defined by the above covariance function.

60 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (3/7) Above mean and variance can be obtained from (2.81) and (2.80).

61 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (4/7) An advantage of a Gaussian processes viewpoint is that we can consider covariance functions that can only be expressed in terms of an infinite number of basis functions.

62 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (5/7) Figure 6.6 Illustration of the sampling of data points {t n } from a Gaussian process. The blue curve shows a sample function and the red points show the value of y n. The corresponding values of {t n }, shown in green, are obtained by adding independent Gaussian noise to each of the {y n }.

Gaussian processes for regression (6/7) Gaussian process regression for the case of one training point and one test point, in which the red ellipses show contours of the joint dis-tribution p(t1,t2). t1 is the training data point, and conditioningon thevalueof t1, corresponding to the vertical blue line, we obtain p(t2|t1) shown as a function of t2 by the green curve.

64 (C) 2006, SNU Biointelligence Lab, Gaussian processes for regression (8/7) Figure 6.8 Illustration of Gaussian process regression applied to the sinusoidal data set. The green curve shows the sinusoidal function from which the data points, shown in blue, are obtained by sampling and addition of Gaussian noise. The red line shows the mean of the Gaussian process predictive distribution.

65 (C) 2006, SNU Biointelligence Lab, Learning the hyperparameters In practice, rather than fixing the covariance function, we may prefer to use a parametric family of functions and then infer the parameter values from the data. The simplest approach is to make a point estimate at θ by maximizing the log likelihood function. The standard form for a multivariate Gaussian distribution is right above.

66 (C) 2006, SNU Biointelligence Lab, Automatic relevance determination (1/2) If a particular parameter η i becomes small, the function becomes insensitive to the corresponding input variable x i. In figure 10, x 1 is from evaluating the function sin(2π x 1 ), and then adding Gaussian noise. Values of x 2 are given by copying the corresponding values of x 1 and adding noise, and values of x 3 are sampled from an independent Gaussian distribution. Figure 10.η1η1 (red),η2η2(green),η3η3(blue).

67 (C) 2006, SNU Biointelligence Lab, Automatic relevance determination (2/2) Figure 6.9 Samples from the ARD prior for Gaussian processes. The left plot corresponds to η1 = η2 = 1, and the right plot Corresponds to η1 = 1, η2 = 0.01.

68 (C) 2006, SNU Biointelligence Lab, Gaussian processes for classification (1/2) a N+1 is the independent variable of logistic function. -One technique is based on variational inference. This approach yields a lower bound on the likelihood function. - The second approach uses expectation propagation.

69 (C) 2006, SNU Biointelligence Lab, Gaussian processes for classification (2/2) Figure 6.11 The left plot shows a sample from a Gaussian process prior over functions a(x), and right plot shows the result of Transforming this sample using a logistic sigmoid function.

70 (C) 2006, SNU Biointelligence Lab, Laplace approximation (1/8) The third approach to Gaussian process classification is based on the Laplace approximation.

71 (C) 2006, SNU Biointelligence Lab, Laplace approximation (2/8) We then obtain the Laplace approximation by Taylor expanding the logarithm of P(a N |t N ), which up to an additive normalization constant is given by the quantity We resort to the iterative scheme based on the Newton- Raphson method, which gives rise to an iterative reweighted least squares (IRLS) algorithm.

72 (C) 2006, SNU Biointelligence Lab, Laplace approximation (3/8) - Where W N is a diagonal matrix elements. -Using the Newton-Raphson formula, the iterative update equation for a N is given by

73 (C) 2006, SNU Biointelligence Lab, Laplace approximation (4/8) At the mode, the gradient of Ψ will vanish, and hence a* N will satisfy Our Gaussian approximation to the posterior distribution P(a N |t N ) is given by Where H is Hessian

74 (C) 2006, SNU Biointelligence Lab, Laplace approximation (5/8) By solving the integral for P(a N+1 |t N ) We are interested in the decision boundary corresponding to P(t N+1 |t N ) = 0.5

75 (C) 2006, SNU Biointelligence Lab, Laplace approximation (6/8)

76 (C) 2006, SNU Biointelligence Lab, Laplace approximation (7/8) Rearranging the terms gives

77 (C) 2006, SNU Biointelligence Lab, Laplace approximation (8/8) Figure 6.12 Illustration of the use of a Gaussian proccess for Classification. The true distribution is green, and the decision Boundary from the Gaussian process is black. On the right is the predicted posterior probability for the blue and red classes together with the Gaussian process decision boundary.

78 (C) 2006, SNU Biointelligence Lab, Connection to neural networks Neal has shown that, for a broad class of prior distributions over w, the distribution of functions generated by a neural network will tend to a Gaussian process in the linit M->00, where M is the number of hidden units. By working directly with the covariance function we have implicitly marginalized over the distribution of weights. If the weight prior is governed by hyperparameters, then their values will determine the length of scales of the distribution over functions. Note that we cannot marginalize out the hyperparameters analytically, and must instead resort to techniques of the kind discussed in Section 6.4.