Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
CSI :Florida A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE RELEVANCE VECTOR MACHINE R. Close, J. Wilson, P. Gader.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Biointelligence Laboratory, Seoul National University
CHAPTER 10: Linear Discrimination
Pattern Recognition and Machine Learning
Support vector machine
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Classification and risk prediction
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Statistical Decision Theory, Bayes Classifier
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
An Introduction to Support Vector Machines (M. Law)
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Optimal Bayes Classification
Bayesian Generalized Kernel Mixed Models Zhihua Zhang, Guang Dai and Michael I. Jordan JMLR 2011.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Ensemble Methods in Machine Learning
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 1.31 Criteria for optimal reception of radio signals.
Usman Roshan CS 675 Machine Learning
CEE 6410 Water Resources Systems Analysis
Probability Theory and Parameter Estimation I
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Computer vision: models, learning and inference
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Support Vector Machines
Mathematical Foundations of BME
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005

3.3 Relevance Vector Machine ● [M.Tipping, JMLR 2001] ● Modification to Gaussian process – GP ● Prior ● Likelihood ● Posterior – RVM ● Prior ● Likelihood same as GP ● Posterior

● Reasons – To get sparce representation of – Expected risk of classifier, ● Thus, we favor weight vectors with a small number of non- zero coeffs. – One way to achieve this is to modify prior: – Consider ● Then wi=0 is only possible ● Computation of is easier than before

● Prediction funcion – GP – RVM

● How can we learn the sparce vector – To find the best, employ evidence maximizaion – The evidence is given explicitly by, – Derived update rules (App'x B.8):

● Evidence Maximization – Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in – For faster convergence, delete ith column from whenever < pre-def threshold – After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in

● Application to Classification – Consider latent target variables – Training objects: – Test object: – Compute the predictive distribution of at the new object, ● by applying a latent weight vector to all the m+1 objects ● and marginalizing over all, we get

– Note – As in the case of GP, we cannot solve this analytically because is no longer Gaussian – Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov.

● Kernel trick – Think about a RKHS generated by – Then ith component of training objects is represented as – Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that – In this sense, all the training objects which have non-zero are termed relevance vectors

3.4 Bayes Point Machines ● [R. Herbrich, JMLR 2000] ● In GP and RVMs, we tried to solve classification problem via regression estimation ● Before we assumed prior dist. and used logit transformations to model the likelihood distribution, ● Now we try to model it directly

● Prior – For classification, only the spatial direction of. Note that – Thus we consider only the vectors on unit sphere – Then assume a uniform prior over this ball-shaped hypothesis space

● Likelihood – Use PAC likelihood (0-1 loss) ● Posterior – Remark: using PAC likelihood,

● Predictive distribution – In two class case, the Bayesian decision can be written as: ● That is, the Bayes classification strategy performs majority voting involving all version space classifiers ● However, the expectation is hard to solve ● Hence we approximate it by a single classifier

– That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error – However this also is intractable because we need to know input distribution and posterior – Another reasonable approximation:

● Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector ● Estimate by MCMC sampling (‘kernel billiard algorithm’)