Download presentation
Presentation is loading. Please wait.
Published byArchibald Banks Modified over 9 years ago
1
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow
2
Our research team My colleague Dmitry Kropotov, PhD student of MSU Students: Nikita Ptashko Pavel Tolpegin Igor Tolstov
3
Overview Problem formulation Problem formulation Ways of solution Ways of solution Bayesian paradigm Bayesian paradigm Bayesian regularization of kernel classifiers Bayesian regularization of kernel classifiers
4
Quality vs. Reliability A general problem: What means to use for solving a task? Either sophisticated and complex, but accurate; or simple but reliable? A trade-off between quality and reliability is needed
5
Machine learning interpretation
6
Regularization The easiest way to establish a compromise is to regularize criterion function using some heuristic regularizer The general problem is HOW to express accuracy and reliability in the same terms. In other words how to define regularization coefficient ?
7
General ways of compromise I Structural Risk Minimization (SRM) – penalizes flexibility of classifiers expressed in VC- dimension of given classifier. Drawback: VC-dimension is very difficult to compute and its estimates are too rough. The upper bound for test error is too high and often exceeds 1
8
General ways of compromise II Minimal Description Length (MDL) – penalizes algorithmic complexity of classifier. Classifier is considered as a coding algorithm. We encode both training data and algorithm itself trying to minimize the total description length
9
Important aspect All the described schemes penalize the flexibility or complexity of classifier, but is it what we really need?… “Complex classifier does not always mean bad classifier.” Ludmila Kuncheva private communication
10
Maximal likelihood principle Well-known maximal likelihood principle states that we should select the classifier with the largest likelihood (i.e. accuracy on the training sample)
11
Bayesian view LikelihoodPrior Evidence
12
Model Selection Suppose we have different classifier families and want to know what family is better without performing computationally expensive cross- validation techniques. This problem is also known as model selection task
13
Bayesian framework I Find the best model, i.e. the optimal value of hyperparameter If all models are equally likely then Note that it is exactly the evidence which should be maximized to find best model
14
Bayesian framework II Now compute posterior parameter distribution… … and final likelihood of test data
15
Why do we need model selecton? The answer is simple: Many classifiers (e.g. neural networks or support vector machines) require some additional parameters to be set by user before training starts. IDEA: These parameters can be viewed as model hyperparameters and Bayesian framework can be applied to select their best values
16
What is evidence Red model has larger likelihood, but green model has better evidence. It is more stable and we may hope for better generalization
17
Support vector machines Separating surface is defined as linear combination of kernel functions The weights are determined solving QP optimization problem
18
Bottlenecks of SVM SVM proved to be one of the best classifiers due to the use of maximal margin principle and kernel trick BUT… How to define the best kernel for a particular task and regularization coefficient C ? Bad kernels may lead to very poor performance due to overfitting or undertraining
19
Relevance Vector Machines Probabilistic approach to kernel models. Weights are interpreted as random variables with gaussian prior distribution Maximal evidence principle is used to select best values. Most of them tend to infinity. Hence the corresponding weights have zero values that makes the classifier quite sparse
20
Sparseness of RVM SVM (C=10)RVM
21
Numerical implementation of RVM We use Laplace approximation to avoid integration. Then likelihood can be written as Where Then evidence can be computed analytically. Iterative optimization of becomes possible
22
Evidence interpretation Then evidence is given by but… This is exactly STABILITY with respect to weights changes ! The larger is Hessian the less is evidence
23
Kernel selection IDEA: To use the same techniques for kernel determination, e.g. for finding the best width of gaussian kernel
24
Sudden problem It appeared that narrow gaussians are more stable with respect to weight changes
25
Solution We allow the centres of kernels be located in random points (relevant points). The trade-off between narrow (high accuracy on the training set) and wide (stable answers) gaussian can finally be found. The classifier we got appeared even more sparse than RVM!
26
Sparseness of GRVM RVM GRVM
27
Some experimental results ErrorsKernels RVM LOO SVM LOO RVM ME RVM LOO SVM LOO RVM ME Australian14.911.5410.583718819 Bupa2526.9221.1561797 Hepatitis36.1731.91 3410211 Pima22.0821.6521.212930913 Credit16.3515.3815.875721736
28
Future work Develop quick optimization procedures Develop quick optimization procedures Optimize and simultaneously during evidence maximization Optimize and simultaneously during evidence maximization Use different width for different features to get more sophisticated kernels Use different width for different features to get more sophisticated kernels Apply this approach to polynomial kernels Apply this approach to polynomial kernels Apply this approach to regression tasks Apply this approach to regression tasks
29
Thank you! Contact information: VetrovD@yandex.ruVetrovD@yandex.ru, DKropotov@yandex.ru DKropotov@yandex.ru VetrovD@yandex.ruDKropotov@yandex.ruhttp://vetrovd.narod.ru
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.