Bayesian Learning Rong Jin
Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging
Maximum Likelihood Learning (ML) Find the model that best model by maximizing the log-likelihood of the training data Logistic regression Parameters are found by maximizing the likelihood of training data
Maximum A Posterior Learning (MAP) In ML learning, models are solely determined by the training examples Very often, we have prior knowledge/preference about parameters/models ML learning is unable to incorporate the prior knowledge/preference on parameters/models Maximum a posterior learning (MAP) Knowledge/preference about parameters/models are incorporated through a prior Prior for parameters
Example: Logistic Regression ML learning Prior knowledge/Preference No feature should dominate over all other features Prefer small weights Gaussian prior for parameters/models:
Example: Logistic Regression ML learning Prior knowledge/Preference No feature should dominate over all other features Prefer small weights Gaussian prior for parameters/models:
Example (cont’d) MAP learning for logistic regression Compared to regularized logistic regression
Example (cont’d) MAP learning for logistic regression Compared to regularized logistic regression
Minimum Description Length Principle Occam’s razor: prefer the simplest hypothesis Simplest hypothesis hypothesis with shortest description length Minimum description length Prefer shortest hypothesis LC (x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h Complexity of Model # of Mistakes
Minimum Description Length Principle Sender Receiver Send only D ? Send only h ? D Send h + D/h ?
Example: Decision Tree H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given tree h Note LC2(D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions hMDL trades off tree size for training errors
MAP vs. MDL MAP learning: Fact from information theory The optimal (shortest expected coding length) code for an event with probability p is –log2p Interpret MAP using MDL principle Description length of h under optimal coding Description length of exceptions under optimal coding
Problems with Maximum Approaches Consider Three possible hypotheses: Maximum approaches will pick h1 Given new instance x Maximum approaches will output + However, is this most probably result?
Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: The most probably class is -
Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: The most probably class is -
When do We Need Bayesian Average? Bayes optimal classification When do we need Bayesian average? Multiple mode case Optimal mode is flat When NOT Bayesian Average? Can’t estimate Pr(h|D) accurately
Computational Issues with Bayes Optimal Classifier Bayes optimal classification Computational issues: Need to sum over all possible models/hypotheses h It is expensive or impossible when the model/hypothesis space is large Example: decision tree Solution: sampling !
Gibbs Classifier Gibbs algorithm Surprising fact: Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: Improve by sampling multiple hypotheses from P(h|D) and average their classification results Markov chain Monte Carlo (MCMC) sampling Importance sampling
Bagging Classifiers In general, sampling from P(h|D) is difficult because P(h|D) is rather difficult to compute Example: how to compute P(h|D) for decision tree? P(h|D) is impossible to compute for non-probabilistic classifier such as SVM P(h|D) is extremely small when hypothesis space is large Bagging Classifiers: Realize sampling P(h|D) through a sampling of training examples
Boostrap Sampling Bagging = Boostrap aggregating Boostrap sampling: given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0.37 of examples from D
Bagging Algorithm Create k boostrap samples D1, D2,…, Dk Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights
Bagging Bayesian Average P(h|D) Bayesian Average … h1 h2 hk Sampling D Bagging … D1 D2 Dk Boostrap Sampling h1 h2 hk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)
Empirical Study of Bagging Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predicate the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree performances better than a single decision tree
Bias-Variance Tradeoff Why Bagging works better than a single classifier? Bias-variance tradeoff Real value case Output y for x follows y~f(x)+, ~N(0,) (x|D) is a predictor learned from training data D Bias-variance decomposition Model variance: The simpler the (x|D), the smaller the variance Model bias: The simpler the (x|D), the larger the bias Irreducible variance
Bias-Variance Tradeoff Fit with Complicated Models Small model bias Large model variance True Model
Bias-Variance Tradeoff Large model bias Small model variance True Model Fit with Simple Models
Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree