Christopher M. Bishop, Pattern Recognition and Machine Learning.

Christopher M. Bishop, Pattern Recognition and Machine Learning

Outline IIntroduction to kernel methods SSupport vector machines (SVM) RRelevance vector machines (RVM) AApplications CConclusions 2

Supervised Learning  In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning y(x) (x,t) (1,60,pass) (2,53,fail) (3,77,pass) (4,34,fail) ﹕ output 3

Classification x2x2 x1x1 y=0 y>0 y<0 t=-1 t=1 4

Regression 01 x t 0 1 new x 5

Linear Models  Linear models for regression and classification: if we apply feature extraction, input model parameter 6

Problems with Feature Space  Why feature extraction? Working in high dimensional feature spaces solves the problem of expressing complex functions  Problems: - there is a computational problem (working with very large vectors) - curse of dimensionality 7

Kernel Methods (1)  Kernel function: inner products in some feature space  nonlinear similarity measure  Examples - polynomial: - Gaussian: 8

Kernel Methods (2)   Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally  only require inner products between data (input) 9

Kernel Methods (3)  We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ  no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but just working out the inner product in the data space 10

Kernel Methods (4)  Kernel methods exploit information about the inner products between data items  We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function  If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target 11

Kernel Methods (5)  Two basic modules for kernel methods General purpose learning model Problem specific kernel function 12

Kernel Methods (6)  Limitation: the kernel function k(x n,x m ) must be evaluated for all possible pairs x n and x m of training points when making predictions for new data points  Sparse kernel machine makes prediction only by a subset of the training data points 13

Support Vector Machines (1)  Support Vector Machines are a system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory  Generalization theory describes how to control the learning machines to prevent them from overfitting 15

Support Vector Machines (2)  To avoid overfitting, SVM modify the error function to a “regularized form” where hyperparameter λ balances the trade-off  The aim of E W is to limit the estimated functions to smooth functions  As a side effect, SVM obtain a sparse model 16

Support Vector Machines (3) 17 Fig. 1 Architecture of SVM

SVM for Classification (1)  The mechanism to prevent overfitting in classification is “maximum margin classifiers”  SVM is fundamentally a two-class classifier 18

Maximum Margin Classifiers (1)  The aim of classification is to find a D-1 dimension hyperplane to classify data in a D dimension space  2D example: 19

Maximum Margin Classifiers (2) margin support vectors 20

Maximum Margin Classifiers (3) small marginlarge margin 21

Maximum Margin Classifiers (4)  Intuitively it is a “robust” solution - If we’ve made a small error in the location of the boundary, this gives us least chance of causing a misclassification  The concept of max margin is usually justified using Vapnik’s Statistical learning theory  Empirically it works well 22

SVM for Classification (2)  After the optimization process, we obtain the prediction model: where (x n,t n ) are N training data we can find that an will be zero except for that of the support vectors  sparse 23

SVM for Classification (3) 24 Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function

SVM for Classification (4)  For overlapping class distributions, SVM allow some of the training points to be misclassified  soft margin 25 penalty

SVM for Classification (5)  For multiclass problems, there are some methods to combine multiple two-class SVMs - one versus the rest - one versus one  more training time 26 Fig. 3 Problems in multiclass classification using multiple SVMs

SVM for Regression (1)  For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function” 27 quadratic error function ε-insensitive error funciton

SVM for Regression (2) 28 Fig. 4 ε-tube No error × Error = |y(x)-t|- ε

SVM for Regression (3)  After the optimization process, we obtain the prediction model: we can find that an will be zero except for that of the support vectors  sparse 29

SVM for Regression (4) 30 Fig. 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube

Disadvantages  It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set  Predictions are not probabilistic  The estimation of error/margin trade-off parameters must utilize cross-validation which is a waste of computation  Kernel functions are limited  Multiclass classification problems 31

Relevance Vector Machines (1)  The relevance vector machine (RVM) is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations  RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM 33

Relevance Vector Machines (2)  RVM intend to mirror the structure of the SVM and use a Bayesian treatment to remove the limitations of SVM the kernel functions are simply treated as basis functions, rather than dot-product in some space 34

Bayesian Inference  Bayesian inference allows one to model uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence. 35

Relevance Vector Machines (3)  In the Bayesian framework, we use a prior distribution over w to avoid overfitting where α is a hyperparameter which control the model parameter w 36

Relevance Vector Machines (4)  Goal: find most probable α* and β* to compute the predictive distribution over t new for a new input x new, i.e. p(t new | x new, X, t, α*, β*)  Maximize the likelihood function to obtain α* and β* : p(t|X, α, β) 37 Training data and their target values

Relevance Vector Machines (5)  RVM utilize the “automatic relevance determination” to achieve sparsity where α m represents the precision of w m  In the procedure of finding α m *, some α m will become infinity which leads the corresponding w m to be zero  remain relevance vectors ! 38

Comparisons - Regression 39 RVM (on standard deviation predictive distribution) SVM

Comparisons - Regression 40

Comparison - Classification 41 RVM SVM

Comparison - Classification 42

Comparisons  RVM are much sparser and make probabilistic prediction  RVM gives better generalization in regression  SVM gives better generalization in classification  RVM is computationally demanding while learning 43

Applications (1)  SVM for face detection 45

Applications (2) 46 Marti Hearst, “ Support Vector Machines”,1998

Applications (3)  In the feature-matching based object tracking, SVM are used to detect false feature matches 47 Weiyu Zhu et al., “Tracking of Object with SVM Regression”, 2001

Applications (4)  Recovering 3D human poses by RVM 48 A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression” 2004

Conclusions  The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks  The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions 50

References  www.support-vector.net www.support-vector.net  N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000  M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001 51

Underfitting and Overfitting 52 underfitting-too simpleoverfitting-too complex Adapted from http://www.dtreg.com/svm.htm new data

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Similar presentations

Presentation on theme: "Christopher M. Bishop, Pattern Recognition and Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Similar presentations

Presentation on theme: "Christopher M. Bishop, Pattern Recognition and Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback