CS 678 - Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
x – independent variable (input)
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Lecture 5: Learning models using EM
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Bayesian Learning Rong Jin.
Introduction to Bayesian Parameter Estimation
Thanks to Nir Friedman, HU
CS Instance Based Learning1 Instance Based Learning.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
CS 391L: Machine Learning: Ensembles
Naive Bayes Classifier
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Randomized Algorithms for Bayesian Hierarchical Clustering
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
Simple examples of the Bayesian approach For proportions and means.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ensemble Methods in Machine Learning
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination

COD Classifier Output Distance How functionally similar are different learning models Independent of accuracy 2CS Ensembles and Bayes

Mapping Learning Algorithm Space Based on 30 Irvine Date Sets 3CS Ensembles and Bayes

Mapping Learning Algorithm Space – 2 dimensional rendition 4CS Ensembles and Bayes

Mapping Task Space – How similarly do different tasks react to different learning algorithms (COD based similarity) 5CS Ensembles and Bayes

Mapping Task Space – How similarly do different tasks react to different learning algorithms (COD based similarity) 6CS Ensembles and Bayes

Bayesian Learning P(h|D) - Posterior probability of h, this is what we usually want to know in machine learning P(h) - Prior probability of the hypothesis independent of D - do we usually know? – Could assign equal probabilities – Could assign probability based on inductive bias (e.g. simple hypotheses have higher probability) P(D) - Prior probability of the data P(D|h) - Probability “likelihood” of data given the hypothesis – Approximated by the accuracy of h on the data set P(h|D) = P(D|h)P(h)/P(D) Bayes Rule P(h|D) increases with P(D|h) and P(h). In learning when seeking to discover the best h given a particular D, P(D) is the same in all cases and thus is dropped. Good approach when P(D|h)P(h) is more appropriate to calculate than P(h|D) – If we do have knowledge about the prior P(h) then that is useful info – P(D|h) can be easy to compute in many cases (generative models) 7CS Ensembles and Bayes

Bayesian Learning Maximum a posteriori (MAP) hypothesis h MAP = argmax h  H P(h|D) = argmax h  H P(D|h)P(h)/P(D) ∝ argmax h  H P(D|h)P(h) Maximum Likelihood (ML) Hypothesis h ML = argmax h  H P(D|h) MAP = ML if all priors P(h) are equally likely Note that prior can be like an inductive bias (i.e. simpler hypothesis are more probable) For Machine Learning P(D|h) is usually measured using the accuracy of the hypothesis on the data – If the hypothesis is very accurate on the data, that implies that the data is more likely given that the hypothesis is true (correct) Example (assume only 3 possible hypotheses) with different priors 8CS Ensembles and Bayes

Bayesian Learning (cont) Brute force approach is to test each h  H to see which maximizes P(h|D) Note that the argmax is not the real probability since P(D) is unknown, but not needed if we're just trying to find the best hypothesis Can still get the real probability (if desired) by normalization if there is a limited number of hypotheses – Assume only two possible hypotheses h 1 and h 2 – The true posterior probability of h 1 would be 9CS Ensembles and Bayes

Bayesian Learning Bayesian view is that we measure uncertainty, which we can do even if there are not a lot of examples – What is the probability that your team will win the championship this year? Cannot do this with a frequentist approach – What is the probability that a particular coin will come up as heads? – Without much data we put our initial belief in the prior But as more data comes available we transfer more of our belief to the data (likelihood) With infinite data, we do not consider the prior at all CS Ensembles and Bayes10

Bayesian Example Assume that we want to learn the mean μ of a random variable x where the variance σ 2 is known and we have not yet seen any data P(μ|D,σ 2 ) = P(D|μ,σ 2 )P(μ)/P(D) ∝ P(D|μ,σ 2 )P(μ) A Bayesian would want to represent the prior μ 0 and the likelihood μ as parameterized distributions (e.g. Gaussian, Multinomial, Uniform, etc.) Let's assume a Gaussian Since the prior is a Gaussian we would like to multiply it by whatever the distribution of the likelihood is in order to get a posterior which is also a parameterized distribution CS Ensembles and Bayes11

Conjugate Priors P(μ|D, σ 0 2 ) = P(D|μ)P(μ)/P(D) ∝ P(D|μ)P(μ) If the posterior is the same distribution as the prior after multiplication then we say the prior and posterior are conjugate distributions and the prior is a conjugate prior for the likelihood In the case of a known variance and a Gaussian prior we can use a Gaussian likelihood and the product (posterior) will also be a Gaussian If the likelihood is multinomial then we would need to use a Dirichlet prior and the posterior would be a Dirichlet CS Ensembles and Bayes12

Some Discrete Conjugate Distributions CS Ensembles and Bayes13

Some Continuous Conjugate Distributions 14CS Ensembles and Bayes

15 More Continuous Conjugate Distributions

Bayesian Example Prior(μ) = P(μ) = N(μ|μ 0,σ 0 2 ) Posterior(μ) = P(μ|D) = N(μ|μ N,σ N 2 ) Note how belief transfers from prior to data as more data is seen CS Ensembles and Bayes16

Bayesian Example CS Ensembles and Bayes17

Bayesian Example If for this problem the mean had been known and the variance was the unknown then the conjugate prior would need to be the inverse gamma distribution – If we use precision (the inverse of variance) then we use a gamma distribution If both the mean and variance were unknown (the typical case) then the conjugate prior distribution is a combination of a Normal (Gaussian) and an inverse gamma and is called a normal-inverse gamma distribution For the typical multi-variate case this would be the multi- variate normal-inverse gamma distribution also known as the inverse Wishart distribution CS Ensembles and Bayes18

Bayesian Inference A Bayesian would frown on using an MLP/Decision Tree/Nearest Neighbor model, etc. as the maximum likelihood part of the equation – Why? CS Ensembles and Bayes19

Bayesian Inference A Bayesian would frown on using an MLP/Decision Tree/SVM/Nearest Neighbor model, etc. as the maximum likelihood part of the equation – Why? – These models are not standard parameterized distributions and there is no direct way to multiply the model with a prior distribution to get a posterior distribution – Can do things to make MLP, Decision Tree, etc. outputs to be probabilities and even add variance but not really exact probabilities/distributions Softmax, Ad hoc, etc. A distribution would be nice, but usually the most important goal is best overall accuracy – If can have an accurate model that is a distribution, then advantageous, otherwise… CS Ensembles and Bayes20

Bayes Optimal Classifiers Best question is what is the most probable classification for a given instance, rather than what is the most probable hypothesis for a data set Let all possible hypotheses vote for the instance in question weighted by their posterior (an ensemble approach) - usually better than the single best MAP hypothesis Bayes Optimal Classification: Example: 3 hypotheses with different priors and posteriors and show results for ML, MAP, Bag, and Bayes optimal – Discrete and probabilistic outputs 21CS Ensembles and Bayes

22 Bayes Optimal Classifiers (Cont) No other classification method using the same hypothesis space can outperform a Bayes optimal classifier on average, given the available data and prior probabilities over the hypotheses Large or infinite hypothesis spaces make this impractical in general Also, it is only as accurate as our knowledge of the priors (background knowledge) for the hypotheses, which we often do not know – But if we do have some insights, priors can really help – for example, it would automatically handle overfit, with no need for a validation set, early stopping, etc. If our priors are bad, then Bayes optimal will not be optimal. For example, if we just assumed uniform priors, then you might have a situation where the many lower posterior hypotheses could dominate the fewer high posterior ones. However, this is an important theoretical concept, and it leads to many practical algorithms which are simplifications based on the concepts of full Bayes optimality

Bayesian Model Averaging The most common Bayesian approach to "model combining" – A Bayesian would not call BMA a model combining approach and it really isn't the goal Assumes the the correct h is in the hypothesis space H and that the data was generated by this correct h (with possible noise) The Bayes equation simply expresses the uncertainty that the correct h has been chosen Looks like model combination, but as more data is given, the P(h|D) for the highest likelihood model dominates – A problem with practical Bayes optimal. The MAP hypothesis will eventually dominate. CS Ensembles and Bayes23

Bayesian Model Averaging Even if the top 3 models have accuracy 90.1%, 90%, and 90%, only the top model will be significantly considered as the data increases – All posteriors must sum to 1 and as data increases the variance decreases and the probability mass converges to the MAP hypothesis This is overfit for typical ML, but exactly what BMA seeks – And in fact empirically, BMA is usually less accurate than even simple model combining techniques (bagging, etc.) How to select the M models – Heuristic, keep models with combination of simplicity and highest accuracy – Gibbs – Randomly sample models based on their probability – MCMC – Start at model M i, sample, then probabilistically transition to itself or neighbor model Gibbs an MCMC require ability to generate many arbitrary models and possibly many samples before convergence CS Ensembles and Bayes24

Model Combination - Ensembles One of the significant potential advantages of model combination is an enrichment of the original hypothesis space H, or easier ability to arrive at accurate members of H There are three members of H to the right BMA would give almost all weight to the top sphere The optimal solution is a uniform vote between the 3 spheres (all h's) This optimal solution h' is not in the original H, but is part of the larger H' created when we combine models CS Ensembles and Bayes25

Bayesian Model Combination Could do Bayesian model combination where we still have priors but they are over combinations of models E is the space of model combinations using hypotheses from H This would move confidence over time to one particular combination of models Ensembles, on the other hand, are typically ad-hoc but still often lead empirically to more accurate overall solutions – BMC would actually be the more fair comparison between ensembles and Bayes Optimal, since in that case Bayes would be trying to find exactly one ensemble, where usually it tries to find one h CS Ensembles and Bayes26