Bayesian Learning Rong Jin.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Is Random Model Better? -On its accuracy and efficiency-
: INTRODUCTION TO Machine Learning Parametric Methods.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Boosting Rong Jin.
CS479/679 Pattern Recognition Dr. George Bebis
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Sparse vs. Ensemble Approaches to Supervised Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Machine Learning CMPT 726 Simon Fraser University
End of Chapter 8 Neil Weisenfeld March 28, 2005.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Sparse vs. Ensemble Approaches to Supervised Learning
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Ensemble Learning (2), Tree and Forest
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Classification Ensemble Methods 1
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
CS479/679 Pattern Recognition Dr. George Bebis
Probability Theory and Parameter Estimation I
COMP61011 : Machine Learning Ensemble Models
Data Mining Lecture 11.
Bias and Variance of the Estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Bayesian Learning Rong Jin

Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging

Maximum Likelihood Learning (ML) Find the model that best model by maximizing the log-likelihood of the training data Logistic regression Parameters are found by maximizing the likelihood of training data

Maximum A Posterior Learning (MAP) In ML learning, models are solely determined by the training examples Very often, we have prior knowledge/preference about parameters/models ML learning is unable to incorporate the prior knowledge/preference on parameters/models Maximum a posterior learning (MAP) Knowledge/preference about parameters/models are incorporated through a prior Prior for parameters

Example: Logistic Regression ML learning Prior knowledge/Preference No feature should dominate over all other features  Prefer small weights Gaussian prior for parameters/models:

Example: Logistic Regression ML learning Prior knowledge/Preference No feature should dominate over all other features  Prefer small weights Gaussian prior for parameters/models:

Example (cont’d) MAP learning for logistic regression Compared to regularized logistic regression

Example (cont’d) MAP learning for logistic regression Compared to regularized logistic regression

Minimum Description Length Principle Occam’s razor: prefer the simplest hypothesis Simplest hypothesis  hypothesis with shortest description length Minimum description length Prefer shortest hypothesis LC (x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h Complexity of Model # of Mistakes

Minimum Description Length Principle Sender Receiver Send only D ? Send only h ? D Send h + D/h ?

Example: Decision Tree H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given tree h Note LC2(D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions hMDL trades off tree size for training errors

MAP vs. MDL MAP learning: Fact from information theory The optimal (shortest expected coding length) code for an event with probability p is –log2p Interpret MAP using MDL principle Description length of h under optimal coding Description length of exceptions under optimal coding

Problems with Maximum Approaches Consider Three possible hypotheses: Maximum approaches will pick h1 Given new instance x Maximum approaches will output + However, is this most probably result?

Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: The most probably class is -

Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: The most probably class is -

When do We Need Bayesian Average? Bayes optimal classification When do we need Bayesian average? Multiple mode case Optimal mode is flat When NOT Bayesian Average? Can’t estimate Pr(h|D) accurately

Computational Issues with Bayes Optimal Classifier Bayes optimal classification Computational issues: Need to sum over all possible models/hypotheses h It is expensive or impossible when the model/hypothesis space is large Example: decision tree Solution: sampling !

Gibbs Classifier Gibbs algorithm Surprising fact: Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: Improve by sampling multiple hypotheses from P(h|D) and average their classification results Markov chain Monte Carlo (MCMC) sampling Importance sampling

Bagging Classifiers In general, sampling from P(h|D) is difficult because P(h|D) is rather difficult to compute Example: how to compute P(h|D) for decision tree? P(h|D) is impossible to compute for non-probabilistic classifier such as SVM P(h|D) is extremely small when hypothesis space is large Bagging Classifiers: Realize sampling P(h|D) through a sampling of training examples

Boostrap Sampling Bagging = Boostrap aggregating Boostrap sampling: given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0.37 of examples from D

Bagging Algorithm Create k boostrap samples D1, D2,…, Dk Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights

Bagging  Bayesian Average P(h|D) Bayesian Average … h1 h2 hk Sampling D Bagging … D1 D2 Dk Boostrap Sampling h1 h2 hk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

Empirical Study of Bagging Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predicate the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree performances better than a single decision tree

Bias-Variance Tradeoff Why Bagging works better than a single classifier? Bias-variance tradeoff Real value case Output y for x follows y~f(x)+, ~N(0,) (x|D) is a predictor learned from training data D Bias-variance decomposition Model variance: The simpler the (x|D), the smaller the variance Model bias: The simpler the (x|D), the larger the bias Irreducible variance

Bias-Variance Tradeoff Fit with Complicated Models Small model bias Large model variance True Model

Bias-Variance Tradeoff Large model bias Small model variance True Model Fit with Simple Models

Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree