Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Support Vector Machines
SVM—Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Model Assessment, Selection and Averaging
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Pattern Recognition and Machine Learning
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Bayesian Learning Rong Jin.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Efficient Model Selection for Support Vector Machines
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
An Introduction to Support Vector Machines (M. Law)
Christopher M. Bishop, Pattern Recognition and Machine Learning.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 C.A.L. Bailer-Jones. Machine Learning. Support vector machines Machine learning, pattern recognition and statistical data modelling Lecture 9. Support.
CS 9633 Machine Learning Support Vector Machines
CEE 6410 Water Resources Systems Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Sparse Kernel Machines
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
LECTURE 16: SUPPORT VECTOR MACHINES
Bias and Variance of the Estimator
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hyperparameters, bias-variance tradeoff, validation
LECTURE 17: SUPPORT VECTOR MACHINES
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Support Vector Machines 2
Presentation transcript:

Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow

Our research team My colleague Dmitry Kropotov, PhD student of MSU Students: Nikita Ptashko Pavel Tolpegin Igor Tolstov

Overview Problem formulation Problem formulation Ways of solution Ways of solution Bayesian paradigm Bayesian paradigm Bayesian regularization of kernel classifiers Bayesian regularization of kernel classifiers

Quality vs. Reliability A general problem: What means to use for solving a task? Either sophisticated and complex, but accurate; or simple but reliable? A trade-off between quality and reliability is needed

Machine learning interpretation

Regularization The easiest way to establish a compromise is to regularize criterion function using some heuristic regularizer The general problem is HOW to express accuracy and reliability in the same terms. In other words how to define regularization coefficient ?

General ways of compromise I Structural Risk Minimization (SRM) – penalizes flexibility of classifiers expressed in VC- dimension of given classifier. Drawback: VC-dimension is very difficult to compute and its estimates are too rough. The upper bound for test error is too high and often exceeds 1

General ways of compromise II Minimal Description Length (MDL) – penalizes algorithmic complexity of classifier. Classifier is considered as a coding algorithm. We encode both training data and algorithm itself trying to minimize the total description length

Important aspect All the described schemes penalize the flexibility or complexity of classifier, but is it what we really need?… “Complex classifier does not always mean bad classifier.” Ludmila Kuncheva private communication

Maximal likelihood principle Well-known maximal likelihood principle states that we should select the classifier with the largest likelihood (i.e. accuracy on the training sample)

Bayesian view LikelihoodPrior Evidence

Model Selection Suppose we have different classifier families and want to know what family is better without performing computationally expensive cross- validation techniques. This problem is also known as model selection task

Bayesian framework I Find the best model, i.e. the optimal value of hyperparameter If all models are equally likely then Note that it is exactly the evidence which should be maximized to find best model

Bayesian framework II Now compute posterior parameter distribution… … and final likelihood of test data

Why do we need model selecton? The answer is simple: Many classifiers (e.g. neural networks or support vector machines) require some additional parameters to be set by user before training starts. IDEA: These parameters can be viewed as model hyperparameters and Bayesian framework can be applied to select their best values

What is evidence Red model has larger likelihood, but green model has better evidence. It is more stable and we may hope for better generalization

Support vector machines Separating surface is defined as linear combination of kernel functions The weights are determined solving QP optimization problem

Bottlenecks of SVM SVM proved to be one of the best classifiers due to the use of maximal margin principle and kernel trick BUT… How to define the best kernel for a particular task and regularization coefficient C ? Bad kernels may lead to very poor performance due to overfitting or undertraining

Relevance Vector Machines Probabilistic approach to kernel models. Weights are interpreted as random variables with gaussian prior distribution Maximal evidence principle is used to select best values. Most of them tend to infinity. Hence the corresponding weights have zero values that makes the classifier quite sparse

Sparseness of RVM SVM (C=10)RVM

Numerical implementation of RVM We use Laplace approximation to avoid integration. Then likelihood can be written as Where Then evidence can be computed analytically. Iterative optimization of becomes possible

Evidence interpretation Then evidence is given by but… This is exactly STABILITY with respect to weights changes ! The larger is Hessian the less is evidence

Kernel selection IDEA: To use the same techniques for kernel determination, e.g. for finding the best width of gaussian kernel

Sudden problem It appeared that narrow gaussians are more stable with respect to weight changes

Solution We allow the centres of kernels be located in random points (relevant points). The trade-off between narrow (high accuracy on the training set) and wide (stable answers) gaussian can finally be found. The classifier we got appeared even more sparse than RVM!

Sparseness of GRVM RVM GRVM

Some experimental results ErrorsKernels RVM LOO SVM LOO RVM ME RVM LOO SVM LOO RVM ME Australian Bupa Hepatitis Pima Credit

Future work Develop quick optimization procedures Develop quick optimization procedures Optimize and simultaneously during evidence maximization Optimize and simultaneously during evidence maximization Use different width for different features to get more sophisticated kernels Use different width for different features to get more sophisticated kernels Apply this approach to polynomial kernels Apply this approach to polynomial kernels Apply this approach to regression tasks Apply this approach to regression tasks

Thank you! Contact information: