Download presentation
1
Bayesian Learning Rong Jin
2
Outline MAP learning vs. ML learning
Minimum description length principle Bayes optimal classifier Bagging
3
Maximum Likelihood Learning (ML)
Find the model that best model by maximizing the log-likelihood of the training data Logistic regression Parameters are found by maximizing the likelihood of training data
4
Maximum A Posterior Learning (MAP)
In ML learning, models are solely determined by the training examples Very often, we have prior knowledge/preference about parameters/models ML learning is unable to incorporate the prior knowledge/preference on parameters/models Maximum a posterior learning (MAP) Knowledge/preference about parameters/models are incorporated through a prior Prior for parameters
5
Example: Logistic Regression
ML learning Prior knowledge/Preference No feature should dominate over all other features Prefer small weights Gaussian prior for parameters/models:
6
Example: Logistic Regression
ML learning Prior knowledge/Preference No feature should dominate over all other features Prefer small weights Gaussian prior for parameters/models:
7
Example (cont’d) MAP learning for logistic regression
Compared to regularized logistic regression
8
Example (cont’d) MAP learning for logistic regression
Compared to regularized logistic regression
9
Minimum Description Length Principle
Occam’s razor: prefer the simplest hypothesis Simplest hypothesis hypothesis with shortest description length Minimum description length Prefer shortest hypothesis LC (x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h Complexity of Model # of Mistakes
10
Minimum Description Length Principle
Sender Receiver Send only D ? Send only h ? D Send h + D/h ?
11
Example: Decision Tree
H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given tree h Note LC2(D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions hMDL trades off tree size for training errors
12
MAP vs. MDL MAP learning: Fact from information theory
The optimal (shortest expected coding length) code for an event with probability p is –log2p Interpret MAP using MDL principle Description length of h under optimal coding Description length of exceptions under optimal coding
13
Problems with Maximum Approaches
Consider Three possible hypotheses: Maximum approaches will pick h1 Given new instance x Maximum approaches will output + However, is this most probably result?
14
Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification: Example: The most probably class is -
15
Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification: Example: The most probably class is -
16
When do We Need Bayesian Average?
Bayes optimal classification When do we need Bayesian average? Multiple mode case Optimal mode is flat When NOT Bayesian Average? Can’t estimate Pr(h|D) accurately
17
Computational Issues with Bayes Optimal Classifier
Bayes optimal classification Computational issues: Need to sum over all possible models/hypotheses h It is expensive or impossible when the model/hypothesis space is large Example: decision tree Solution: sampling !
18
Gibbs Classifier Gibbs algorithm Surprising fact:
Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: Improve by sampling multiple hypotheses from P(h|D) and average their classification results Markov chain Monte Carlo (MCMC) sampling Importance sampling
19
Bagging Classifiers In general, sampling from P(h|D) is difficult because P(h|D) is rather difficult to compute Example: how to compute P(h|D) for decision tree? P(h|D) is impossible to compute for non-probabilistic classifier such as SVM P(h|D) is extremely small when hypothesis space is large Bagging Classifiers: Realize sampling P(h|D) through a sampling of training examples
20
Boostrap Sampling Bagging = Boostrap aggregating
Boostrap sampling: given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0.37 of examples from D
21
Bagging Algorithm Create k boostrap samples D1, D2,…, Dk
Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights
22
Bagging Bayesian Average
P(h|D) Bayesian Average … h1 h2 hk Sampling D Bagging … D1 D2 Dk Boostrap Sampling h1 h2 hk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)
23
Empirical Study of Bagging
Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predicate the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree performances better than a single decision tree
24
Bias-Variance Tradeoff
Why Bagging works better than a single classifier? Bias-variance tradeoff Real value case Output y for x follows y~f(x)+, ~N(0,) (x|D) is a predictor learned from training data D Bias-variance decomposition Model variance: The simpler the (x|D), the smaller the variance Model bias: The simpler the (x|D), the larger the bias Irreducible variance
25
Bias-Variance Tradeoff
Fit with Complicated Models Small model bias Large model variance True Model
26
Bias-Variance Tradeoff
Large model bias Small model variance True Model Fit with Simple Models
27
Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.