Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M

Slides:



Advertisements
Similar presentations
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Advertisements

Biointelligence Laboratory, Seoul National University
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Supervised Learning Recap
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Visual Recognition Tutorial
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
Biointelligence Laboratory, Seoul National University
Machine Learning CS 165B Spring 2012
PATTERN RECOGNITION AND MACHINE LEARNING
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Biointelligence Laboratory, Seoul National University
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
EM and expected complete log-likelihood Mixture of Experts
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Ensemble Methods: Bagging and Boosting
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Biointelligence Laboratory, Seoul National University
Lecture 2: Statistical learning primer for biologists
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Deep Feedforward Networks
Trees, bagging, boosting, and stacking
Boosting and Additive Trees
Latent Variables, Mixture Models and EM
Data Mining Practical Machine Learning Tools and Techniques
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Lecture 18: Bagging and Boosting
Biointelligence Laboratory, Seoul National University
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by J.-H. Eom Biointelligence Laboratory Seoul National University http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 14.1 Bayesian Model Averaging 14.2 Committees 14.3 Boosting 14.3.1 Minimizing exponential error 14.3.2 Error functions for boosting 14.4 Tree-based Models 14.5 Conditional Mixture Models 14.5.1 Mixtures of linear regression models 14.5.2 Mixtures of logistic models 14.5.3 Mixtures of experts (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14. Combining Models Committees Combine multiple models in some ways can improve performance Boosting – variant of the committee method Train multiple model in sequence The error function used to train a particular model depends on the performance of the previous models Achieve substantial improvements in performance than single model Model combination a. Averaging the predictions of a set of models b. Selecting one of the models to make the prediction Difference models become responsible for different input space regions (Ex: Decision tree) Hard split-based & Only one model is responsible for making predictions for any given value of the input variables DT – as a model combination - Mixture distributions - Mixture of expert Input dependant mixing coefficient p(k|x). (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.1 Bayesian Model Averaging ※ BMA Model combination Example: Density estimation using a mixture Gaussians Model (in terms of a joint distribution): Corresponding density over the observed variable x: For Gaussian mixture: For i.i.d. data:  Each observed data point xn has a corresponding latent variable zn. ※※ With several different models (h: 1 ~ H) with prior p(h),  then the marginal distribution over the data set :  BMA One model is responsible for generating whole data set Prob. dist. Over h reflects uncertainty of model (data set size inc.  reduces) Posterior p(h|X) become increasingly focused on just one of the models (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.1 Bayesian Model Averaging (cont’d) BMA Model combination Generated by a single model Generated by different components  Different data points in the data set can potentially be generated from different values of the latent variable z  14.5 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14.2 Committees Ways to construct a committee Average the predictions of a set of individual models (simplest) From frequentist perspective; bias & variance Have only a single data set (in practice) Find a way to introduce variability between the different models within the committee  “Bootstrap” data set Regression problem; to predict the value of a single continuous variable Generate M bootstrap data sets & for training each separate copy of ym(x) of a predictive model (m = 1 ~ M) “Bootstrap aggregation” “bagging” ※ Committee predictions; ※ For true regression (predict h(x) The output of each models: true value + error  (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14.2 Committees (cont’d) The sum-of-squares error: The avg. error of individual acting models: The expected error from the committee : Assume the errors have zero mean & uncorrelated, then  Average error of a model  reduced by a factor of M simply by averaging M version of the model ( key assumption: uncorrelated error of individual models) In practice: highly correlated error & General error reduction is small  but, expected committee error will not exceed the expected error of the constituent models, (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Figure 14.1 14.3 Boosting Boosting Combine multiple ‘base’ classifiers to produce powerful committee AdaBoost – ‘adaptive boosting’ (Freund & Schapire; 96) Most widely used form of boosting Give good results with ‘weak’ learners Characteristics Base classifiers are trained in sequence (*) Each base classifier is trained using a weighted form of the data Weighting coefficient associated with each data point  depend on the performance of the previous classifiers Give greater weight on the misclassified data point by the previous classifiers Final prediction  use weighted majority voting scheme (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting – AdaBoost algorithm 1. Initialize the data weighting coefficients {wn} by setting for n = 1, …, N . 2. For m = 1, …, M: (a) Fit a classifier ym(x) to the training data by minimizing the weighted error function Eq. 14.15 Weighting coefficient (b) Evaluate the quantities Weighted measures of the error rates of each of the base classifiers on the data set Eq. 14.16 Eq. 14.17 (c) Update the data weight coefficients 3. Make the predictions using the final model, which is given by (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting – AdaBoost (cont’d) Decision boundary of most recent base learner Base learners trained so far Combined decision boundary Figure 14.2 – 30 data points, binary class Base learners – consists of a threshold on one of the input vars. ‘Decision stump’ – DT with a single node (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3.1 Minimizing exponential error 14.3 Boosting 14.3.1 Minimizing exponential error “Boosting” Actual performance is much better than the bounds expected ‘Sequential minimization of an exponential error function’  Different & simple interpretation of boosting (Friedman et al; 2000) Exponential error function: Use alternative approach Instead of doing a global error function minimization Fix m-1 classifiers & their coefficients, then minimize with respect to m-th classifier and coefficient Eq. 14.22 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3.1 Minimizing exponential error (cont’d) 14.3 Boosting 14.3.1 Minimizing exponential error (cont’d) Rewrite error function Weight update of data points Classification of new data Evaluate the sign of the combined function (14.21)  (14.19) !! By omitting ½ factor (does not affect the sign) For , we obtain Eq. (14.16, 17) Equivalent to minimizing Eq. (14.15)  (from Eq. (14.22))   Equivalent to Eq. (14.18) !!! (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3.2 Error functions for boosting Exponential error function Expected error: With variational minimization of all possible functions y(x) ½ log -odds C.E. Exp. +: seq. minimization leads to simple AdaBoost scheme -: - penalize large neg. values of ty(x) much more than C.E. - less robust to outliers or misclassified data points - can’t be interpreted as the log-likelihood function Figure 14.4 Absolute error vs. squared error Misclassification error Figure 14.3 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14.4 Tree-based Models Partition the input space into cuboid regions Use axe aligned edges, simple model assignment Model combination method Only one model is responsible for making predictions at any given point in the input space CART (ID3, C4.5, etc.) Optimal DT structure to minimize Sum-of-squares error  infeasible  use greedy optimization Readily interpretable Popular in medical diagnosis fields Tree growing Pruning Growing with stopping criteria Alternative measure for classification - Cross-entropy - Gini index Data sensitive & suboptimal Figure 14.5 Figure 14.6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5 Conditional Mixture Models Relax constraint of axis-aligned split  Less interpretability Allow soft, probabilistic splits that can be functions of all of the input variables Hierarchical mixture of experts - Fully probabilistic tree-base model Alternative way to motivate the H.M.Es. Start with a std probabilistic mixtures of unconditional density models and replace the component densities with conditional distributions Expert mixing coefficients Independent of input variables (the simplest case) Depend on the input variables  mixture of experts model Hierarchical mixture of experts Each component in the mixture model to be itself a mixture of experts (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5.1 Mixtures of liner regression models 14.5 Conditional Mixture Models 14.5.1 Mixtures of liner regression models Consider K linear regression models, governed by its own weight parameter wk Common noise variance, governed by a precision parameter Single target var t  Mixture distribution: Eq. 14.34  Log likelihood: Eq. 14.35 Use EM to maximize likelihood;  The joint dist over latent and observed variables Figure 14.7 * Complete-data log likelihood: E-step: responsibilities Expectation of the complete-data log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5.1 Mixtures of liner regression models 14.5 Conditional Mixture Models 14.5.1 Mixtures of liner regression models M-step: maximize the function Q, fix responsibilities (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Figure 14.8 Figure 14.9 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5.2 Mixtures of logistic models 14.5 Conditional Mixture Models 14.5.2 Mixtures of logistic models (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Figure 14.10 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14.5 Conditional Mixture Models 14.5.3 Mixtures of experts Mixtures of experts model Further increase the capability by allowing the mixing coefficients themselves to be function of the input variable so that  Different components can model the dist. in different regions of input space (they are ‘experts’ at making predictions in their own regions) Gating function  determine which components are dominant in which region experts Gating function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5.3 Mixtures of experts (cont’d) 14.5 Conditional Mixture Models 14.5.3 Mixtures of experts (cont’d) Limitation The use of linear models for gating and expert functions  for more flexible model, use Multilevel gating function ‘Hierarchical mixture of experts’, HME model HME  can be viewed as probabilistic version of decision trees (Section 14.4) Adv. of Mixture of experts Can be optimized by EM in which the M step for each mixture component and gating model involves a convex optimization Although the overall optimization is non-convex Mixture density network (Section 5.6) More flexible (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/