The Kernel Trick Kenneth D. Harris 3/6/15.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CSI :Florida A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE RELEVANCE VECTOR MACHINE R. Close, J. Wilson, P. Gader.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Pattern Recognition and Machine Learning: Kernel Methods.
Support vector machine
Machine learning continued Image source:
Computer vision: models, learning and inference Chapter 8 Regression.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machine
Pattern Recognition and Machine Learning
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Ridge regression and Bayesian linear regression Kenneth D. Harris 6/5/15.
The Nature of Statistical Learning Theory by V. Vapnik
Predictions 1. Multiple linear regression Kenneth D. Harris April 29, 2015.
Lasso, Support Vector Machines, Generalized linear models Kenneth D. Harris 20/5/15.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Kernel Regression Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
Once Size Does Not Fit All: Regressor and Subject Specific Techniques for Predicting Experience in Natural Environments Denis Chigirev, Chris Moore, Greg.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Radial Basis Function G.Anuradha.
Computer vision: models, learning and inference
Machine learning, pattern recognition and statistical data modelling
Linear regression project
Lecture 09: Gaussian Processes
What is Regression Analysis?
Lecture 10: Gaussian Processes
Support Vector Machines and Kernels
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
Marios Mattheakis and Pavlos Protopapas
Introduction to Machine Learning
Presentation transcript:

The Kernel Trick Kenneth D. Harris 3/6/15

Multiple linear regression What are you predicting? Data type Continuous Dimensionality 1 What are you predicting it from? p How many data points do you have? Enough What sort of prediction do you need? Single best guess What sort of relationship can you assume? Linear

GLMs, SVMs… What are you predicting? Data type Discrete, integer, whatever Dimensionality 1 What are you predicting it from? Continuous p How many data points do you have? Not enough What sort of prediction do you need? Single best guess or probability distribution What sort of relationship can you assume? Linear – nonlinear

Kernel approach What are you predicting? Data type Discrete, integer, whatever Dimensionality 1 What are you predicting it from? Anything p How many data points do you have? Not enough What sort of prediction do you need? Single best guess or probability distribution What sort of relationship can you assume? Nonlinear

The basic idea Before, our predictor variables lived in a Euclidean space, and predictions from them were linear. Now they live in any sort of space. But we have a measure of how similar any two predictors are.

The Kernel Matrix For data in Euclidean space, define the Kernel Matrix as 𝐊=𝐗 𝐗 𝐓 𝐊 is a 𝑁×𝑁 matrix containing the dot products of the predictors for each pair of data points: 𝐾 𝑖𝑗 = 𝐱 𝑖 ⋅ 𝐱 𝑗 . It tells you how similar every two data points are. The covariance matrix 𝐗 𝐓 𝐗 is a 𝑝×𝑝 matrix that tells you how similar any two variables are.

The Kernel Trick You can fit many models only using the kernel matrix. The original observations don’t come into it at all, other than via the kernel matrix. So you never actually needed the predictors 𝐱 𝑖 , just a measure of their similarity 𝐾 𝑖𝑗 . It doesn’t matter if they live in a Euclidean space or not, as long as you define and compute a kernel. Even when they do live in a Euclidean space, you can use a kernel that isn’t their actual dot product.

Some of what you can do with the kernel trick Support vector machines (where it was first used) Kernel ridge regression Kernel PCA Density estimation Kernel logistic regression and other GLMs Bayesian methods (also called Gaussian Process Regression) Kernel adaptive filters (for time series) Many more….

Some of what you can do with the kernel trick Support vector machines (where it was first used) Kernel ridge regression Kernel PCA Density estimation Kernel logistic regression and other GLMs Bayesian methods (also called Gaussian Process Regression) Kernel adaptive filters (for time series) Many more….

The Matrix Inversion Lemma For any 𝑛×𝑚 matrix U and 𝑚×𝑛 matrix V: 𝐈 𝑛 +𝐔𝐕 −1 = 𝐈 𝑛 − 𝐔 𝐈 𝑚 +𝐕𝐔 −𝟏 𝐕 Proof: multiply 𝐈 𝑛 +𝐔𝐕 −𝟏 by 𝐈 𝑛 − 𝐔 𝐈 𝑚 +𝐕𝐔 −𝟏 𝐕, and watch everything cancel. This is not an approximation or a Taylor series: it is exact. We replaced inverting a 𝑛×𝑛 matrix with inverting an 𝑚×𝑚 matrix.

Kernel Ridge Regression Remember ridge regression model: 𝐲 =𝐗𝐰 𝐿= 1 2 𝐲− 𝐲 2 + 1 2 𝜆 𝐰 2 Optimum weight is 𝐰= 𝐗 𝑇 𝐗+𝜆 𝐈 𝑝 −1 𝐗 𝐓 𝐲 Can show this is equal to: 𝐰= 𝐗 𝑇 𝐗 𝐗 𝑇 +𝜆 𝐈 𝑁 −1 𝐲 Covariance matrix Kernel matrix

Response to a new observation Given a new observation 𝐱 𝑡𝑒𝑠𝑡 , what do we predict? 𝑦 = 𝐱 𝑡𝑒𝑠𝑡 ⋅𝐰= 𝐱 𝑡𝑒𝑠𝑡 𝐗 𝑇 𝐗 𝐗 𝑇 +𝜆 𝐈 𝑁 −1 𝐲 = 𝑖 𝐱 𝑡𝑒𝑠𝑡 ⋅ 𝐱 𝑖 𝛼 𝑖 Where 𝛂= 𝐗 𝐗 𝑇 +𝜆 𝐈 𝑁 −1 𝐲, the “dual weight”, depends on 𝐗 only via the kernel matrix. The prediction is a sum of the 𝛼 𝑖 times 𝐱 𝑡𝑒𝑠𝑡 ⋅ 𝐱 𝑖 , which again only depends on 𝐱 𝑡𝑒𝑠𝑡 through its dot product with the training set predictors.

Network view Input 𝐱 𝑡𝑒𝑠𝑡 𝐾 𝐱 𝑡𝑒𝑠𝑡 , 𝐱 𝑖 𝑓= 𝑖 𝛼 𝑖 𝐾 𝐱 𝑡𝑒𝑠𝑡 , 𝐱 𝑖 𝐾 𝐱 𝑡𝑒𝑠𝑡 , 𝐱 𝑖 𝑓= 𝑖 𝛼 𝑖 𝐾 𝐱 𝑡𝑒𝑠𝑡 , 𝐱 𝑖 Input 1 node per training set point 1 node per input dimension

Intuition 1: “Feature space” 𝑦 =𝐰⋅𝛟(𝐱) 𝛟 Consider a nonlinear function 𝛟 into a higher-dimensional “feature space” such that 𝐾 𝐱 𝑖 , 𝐱 𝑗 =𝛟 𝐱 𝑖 ⋅𝛟 𝐱 𝑗 But you never actually do it – you just use the equivalent kernel

Quadratic Kernel 𝐾 𝐱 𝑖 , 𝐱 𝑗 = 𝐱 𝑖 ⋅ 𝐱 𝑗 +𝑐 2 Higher dimensional space contains all pairwise products of variables. A hyperplane in the higher-dimensional space corresponds to an ellipsoid in the original space.

Radial basis function kernel 𝐾 𝐱 𝑖 , 𝐱 𝑗 = exp − 𝐱 𝑖 − 𝐱 𝑗 2 2 𝜎 2 Predictors are considered similar if they are close together Feature space would be infinite dimensional – but it doesn’t matter since you never actually use it.

Something analogous in the brain? DiCarlo, Zoccolan, Rust, “How does the brain solve visual object recognition?”, Neuron 2012

Intuition 2: “Function space” We are trying to fit a function 𝑓(𝐱) that minimizes 𝐿= 𝑖 𝐸 𝑓 𝐱 𝐢 , 𝑦 𝑖 + 1 2 𝜆 𝑓 2 𝐸 𝑓 𝐱 𝐢 , 𝑦 𝑖 is the error function: could be squared error, hinge loss, whatever 𝑓 2 is the penalty term – penalizes “rough” functions. For kernel ridge regression, 𝑦 =𝑓 𝐱 . Weights are gone!

Function norms 𝑓 is a “function norm” - has to be larger for wiggly functions, smaller for smooth functions. 𝑓 large 𝑓 small

Norms and Kernels If we are given a kernel 𝐾 𝐱 𝑖 , 𝐱 𝑗 , we can define a function norm by 𝑓 2 =∫ 𝑓( 𝐱 1 )𝐾 −1 𝐱 1 , 𝐱 2 𝑓( 𝐱 2 )𝑑 𝐱 1 𝑑 𝐱 2 Here 𝐾 −1 𝐱 1 , 𝐱 2 is the “inverse filter” of 𝐾: if 𝐾 is smooth, 𝐾 −1 is a high-pass filter, which is why wiggly functions have a larger norm. This is called a “Reproducing Kernel Hilbert Space” norm. (Doesn’t matter why – but you may hear the term)

Representer theorem For this kind of norm, the 𝑓 that minimizes our loss function 𝐿= 𝑖 𝐸 𝑓 𝐱 𝐢 , 𝑦 𝑖 + 1 2 𝜆 𝑓 2 will always be of the form: 𝑓 𝐱 = 𝒊 𝛼 𝑖 𝐾 𝐱, 𝐱 𝑖 So to find the best function 𝑓 you just need to find the best vector 𝛂.

Two views of the same technique Nonlinearly map data into a high-dimensional feature space, then fit a linear function with a weight penalty Fit a nonlinear function, penalized by its roughness

Practical issues Need to choose a good kernel RBF very popular Need to choose 𝜎 2 Too small: overfitting; too big: poor fit Can apply to any sort of data, if you pick a good kernel Genome sequences Text Neuron morphologies Computation cost 𝑂 𝑁 2 Good for high-dimensional problems, not always good when you have lots of data May need to store entire training set But with support vector machine most 𝛼 𝑖 are zero so you don’t

If you are serious… And if you are really serious: T. Evgeniou, M. Pontil, T. Poggio. Regularization networks and support vector machines. Adv Comp Math 2000 Ryan M. Rifkin and Ross A. Lippert. Value Regularization and Fenchel Duality. J. Machine Learning Res 2007