Download presentation
Presentation is loading. Please wait.
Published byMalcolm Alexander Modified over 8 years ago
1
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005
2
3.3 Relevance Vector Machine ● [M.Tipping, JMLR 2001] ● Modification to Gaussian process – GP ● Prior ● Likelihood ● Posterior – RVM ● Prior ● Likelihood same as GP ● Posterior
3
● Reasons – To get sparce representation of – Expected risk of classifier, ● Thus, we favor weight vectors with a small number of non- zero coeffs. – One way to achieve this is to modify prior: – Consider ● Then wi=0 is only possible ● Computation of is easier than before
4
● Prediction funcion – GP – RVM
5
● How can we learn the sparce vector – To find the best, employ evidence maximizaion – The evidence is given explicitly by, – Derived update rules (App'x B.8):
6
● Evidence Maximization – Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in – For faster convergence, delete ith column from whenever < pre-def threshold – After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in
8
● Application to Classification – Consider latent target variables – Training objects: – Test object: – Compute the predictive distribution of at the new object, ● by applying a latent weight vector to all the m+1 objects ● and marginalizing over all, we get
9
– Note – As in the case of GP, we cannot solve this analytically because is no longer Gaussian – Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov.
10
● Kernel trick – Think about a RKHS generated by – Then ith component of training objects is represented as – Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that – In this sense, all the training objects which have non-zero are termed relevance vectors
13
3.4 Bayes Point Machines ● [R. Herbrich, JMLR 2000] ● In GP and RVMs, we tried to solve classification problem via regression estimation ● Before we assumed prior dist. and used logit transformations to model the likelihood distribution, ● Now we try to model it directly
14
● Prior – For classification, only the spatial direction of. Note that – Thus we consider only the vectors on unit sphere – Then assume a uniform prior over this ball-shaped hypothesis space
15
● Likelihood – Use PAC likelihood (0-1 loss) ● Posterior – Remark: using PAC likelihood,
16
● Predictive distribution – In two class case, the Bayesian decision can be written as: ● That is, the Bayes classification strategy performs majority voting involving all version space classifiers ● However, the expectation is hard to solve ● Hence we approximate it by a single classifier
17
– That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error – However this also is intractable because we need to know input distribution and posterior – Another reasonable approximation:
18
● Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector ● Estimate by MCMC sampling (‘kernel billiard algorithm’)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.