Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M

Similar presentations


Presentation on theme: "Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M"— Presentation transcript:

1 Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by J.-H. Eom Biointelligence Laboratory Seoul National University

2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Contents 14.1 Bayesian Model Averaging 14.2 Committees 14.3 Boosting Minimizing exponential error Error functions for boosting 14.4 Tree-based Models 14.5 Conditional Mixture Models Mixtures of linear regression models Mixtures of logistic models Mixtures of experts (C) 2007, SNU Biointelligence Lab, 

3 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14. Combining Models Committees Combine multiple models in some ways can improve performance Boosting – variant of the committee method Train multiple model in sequence The error function used to train a particular model depends on the performance of the previous models Achieve substantial improvements in performance than single model Model combination a. Averaging the predictions of a set of models b. Selecting one of the models to make the prediction Difference models become responsible for different input space regions (Ex: Decision tree) Hard split-based & Only one model is responsible for making predictions for any given value of the input variables DT – as a model combination - Mixture distributions - Mixture of expert Input dependant mixing coefficient p(k|x). (C) 2007, SNU Biointelligence Lab, 

4 14.1 Bayesian Model Averaging
※ BMA Model combination Example: Density estimation using a mixture Gaussians Model (in terms of a joint distribution): Corresponding density over the observed variable x: For Gaussian mixture: For i.i.d. data:  Each observed data point xn has a corresponding latent variable zn. ※※ With several different models (h: 1 ~ H) with prior p(h),  then the marginal distribution over the data set :  BMA One model is responsible for generating whole data set Prob. dist. Over h reflects uncertainty of model (data set size inc.  reduces) Posterior p(h|X) become increasingly focused on just one of the models (C) 2007, SNU Biointelligence Lab, 

5 14.1 Bayesian Model Averaging (cont’d)
BMA Model combination Generated by a single model Generated by different components  Different data points in the data set can potentially be generated from different values of the latent variable z  14.5 (C) 2007, SNU Biointelligence Lab, 

6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14.2 Committees Ways to construct a committee Average the predictions of a set of individual models (simplest) From frequentist perspective; bias & variance Have only a single data set (in practice) Find a way to introduce variability between the different models within the committee  “Bootstrap” data set Regression problem; to predict the value of a single continuous variable Generate M bootstrap data sets & for training each separate copy of ym(x) of a predictive model (m = 1 ~ M) “Bootstrap aggregation” “bagging” Committee predictions; For true regression (predict h(x) The output of each models: true value + error  (C) 2007, SNU Biointelligence Lab, 

7 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14.2 Committees (cont’d) The sum-of-squares error: The avg. error of individual acting models: The expected error from the committee : Assume the errors have zero mean & uncorrelated, then  Average error of a model  reduced by a factor of M simply by averaging M version of the model ( key assumption: uncorrelated error of individual models) In practice: highly correlated error & General error reduction is small  but, expected committee error will not exceed the expected error of the constituent models, (C) 2007, SNU Biointelligence Lab, 

8 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Figure 14.1 14.3 Boosting Boosting Combine multiple ‘base’ classifiers to produce powerful committee AdaBoost – ‘adaptive boosting’ (Freund & Schapire; 96) Most widely used form of boosting Give good results with ‘weak’ learners Characteristics Base classifiers are trained in sequence (*) Each base classifier is trained using a weighted form of the data Weighting coefficient associated with each data point  depend on the performance of the previous classifiers Give greater weight on the misclassified data point by the previous classifiers Final prediction  use weighted majority voting scheme (C) 2007, SNU Biointelligence Lab, 

9 14.3 Boosting – AdaBoost algorithm
1. Initialize the data weighting coefficients {wn} by setting for n = 1, …, N . 2. For m = 1, …, M: (a) Fit a classifier ym(x) to the training data by minimizing the weighted error function Eq Weighting coefficient (b) Evaluate the quantities Weighted measures of the error rates of each of the base classifiers on the data set Eq Eq (c) Update the data weight coefficients 3. Make the predictions using the final model, which is given by (C) 2007, SNU Biointelligence Lab, 

10 14.3 Boosting – AdaBoost (cont’d)
Decision boundary of most recent base learner Base learners trained so far Combined decision boundary Figure 14.2 – 30 data points, binary class Base learners – consists of a threshold on one of the input vars. ‘Decision stump’ – DT with a single node (C) 2007, SNU Biointelligence Lab, 

11 14.3.1 Minimizing exponential error
14.3 Boosting Minimizing exponential error “Boosting” Actual performance is much better than the bounds expected ‘Sequential minimization of an exponential error function’  Different & simple interpretation of boosting (Friedman et al; 2000) Exponential error function: Use alternative approach Instead of doing a global error function minimization Fix m-1 classifiers & their coefficients, then minimize with respect to m-th classifier and coefficient Eq (C) 2007, SNU Biointelligence Lab, 

12 14.3.1 Minimizing exponential error (cont’d)
14.3 Boosting Minimizing exponential error (cont’d) Rewrite error function Weight update of data points Classification of new data Evaluate the sign of the combined function (14.21)  (14.19) !! By omitting ½ factor (does not affect the sign) For , we obtain Eq. (14.16, 17) Equivalent to minimizing Eq. (14.15) (from Eq. (14.22))  Equivalent to Eq. (14.18) !!! (C) 2007, SNU Biointelligence Lab, 

13 14.3.2 Error functions for boosting
Exponential error function Expected error: With variational minimization of all possible functions y(x) ½ log -odds C.E. Exp. +: seq. minimization leads to simple AdaBoost scheme -: - penalize large neg. values of ty(x) much more than C.E. - less robust to outliers or misclassified data points - can’t be interpreted as the log-likelihood function Figure 14.4 Absolute error vs. squared error Misclassification error Figure 14.3 (C) 2007, SNU Biointelligence Lab, 

14 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14.4 Tree-based Models Partition the input space into cuboid regions Use axe aligned edges, simple model assignment Model combination method Only one model is responsible for making predictions at any given point in the input space CART (ID3, C4.5, etc.) Optimal DT structure to minimize Sum-of-squares error  infeasible  use greedy optimization Readily interpretable Popular in medical diagnosis fields Tree growing Pruning Growing with stopping criteria Alternative measure for classification - Cross-entropy - Gini index Data sensitive & suboptimal Figure 14.5 Figure 14.6 (C) 2007, SNU Biointelligence Lab, 

15 14.5 Conditional Mixture Models
Relax constraint of axis-aligned split  Less interpretability Allow soft, probabilistic splits that can be functions of all of the input variables Hierarchical mixture of experts - Fully probabilistic tree-base model Alternative way to motivate the H.M.Es. Start with a std probabilistic mixtures of unconditional density models and replace the component densities with conditional distributions Expert mixing coefficients Independent of input variables (the simplest case) Depend on the input variables  mixture of experts model Hierarchical mixture of experts Each component in the mixture model to be itself a mixture of experts (C) 2007, SNU Biointelligence Lab, 

16 14.5.1 Mixtures of liner regression models
14.5 Conditional Mixture Models Mixtures of liner regression models Consider K linear regression models, governed by its own weight parameter wk Common noise variance, governed by a precision parameter Single target var t  Mixture distribution: Eq  Log likelihood: Eq Use EM to maximize likelihood;  The joint dist over latent and observed variables Figure 14.7 * Complete-data log likelihood: E-step: responsibilities Expectation of the complete-data log likelihood (C) 2007, SNU Biointelligence Lab, 

17 14.5.1 Mixtures of liner regression models
14.5 Conditional Mixture Models Mixtures of liner regression models M-step: maximize the function Q, fix responsibilities (C) 2007, SNU Biointelligence Lab, 

18 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Figure 14.8 Figure 14.9 (C) 2007, SNU Biointelligence Lab, 

19 14.5.2 Mixtures of logistic models
14.5 Conditional Mixture Models Mixtures of logistic models (C) 2007, SNU Biointelligence Lab, 

20 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Figure 14.10 (C) 2007, SNU Biointelligence Lab, 

21 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14.5 Conditional Mixture Models Mixtures of experts Mixtures of experts model Further increase the capability by allowing the mixing coefficients themselves to be function of the input variable so that  Different components can model the dist. in different regions of input space (they are ‘experts’ at making predictions in their own regions) Gating function  determine which components are dominant in which region experts Gating function (C) 2007, SNU Biointelligence Lab, 

22 14.5.3 Mixtures of experts (cont’d)
14.5 Conditional Mixture Models Mixtures of experts (cont’d) Limitation The use of linear models for gating and expert functions  for more flexible model, use Multilevel gating function ‘Hierarchical mixture of experts’, HME model HME  can be viewed as probabilistic version of decision trees (Section 14.4) Adv. of Mixture of experts Can be optimized by EM in which the M step for each mixture component and gating model involves a convex optimization Although the overall optimization is non-convex Mixture density network (Section 5.6) More flexible (C) 2007, SNU Biointelligence Lab, 

23 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/


Download ppt "Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M"

Similar presentations


Ads by Google