Download presentation
Presentation is loading. Please wait.
Published byDominic Clarke Modified over 9 years ago
1
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00
2
What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the individual classifiers What properties should the base learners have?
3
Why should it work? More accurate ONLY if the individual classifiers disagree Error rate < 0.5 and errors are independent Error rate is highly correlated with the correlations of the errors made by the different learners (Ali & Pazzani)
4
Averaging Fails! Use Delta-functions as classifiers (predict +1 at a point and –1 everywhere else) For training sample size m, construct a set of at most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere Applying boosting to this results in zero training error but bad generalizations Applying the margin analysis results in zero training error but margin is small O(1/m)
5
Ideas? Subsampling training examples Bagging, Cross-Validated Committees, Boosting Manipulating input features Choose different features Manipulating output targets ECOC and variants Injecting randomness NN(different initial weights), DT(pick different splits), injecting noise, MCMC
6
Combining Classifiers Unweighted Voting Bagging, ECOC etc. Weighted Voting Weight accuracy (training or holdout set), LSR (weights 1/variance) Bayesian model averaging
7
BMA All possible models in the model space used weighted by their probability of being the “Correct” model Optimal given the correct model space and priors Not widely used even though it was said not to overfit (Buntine, 1990)
8
BMA - Equations prior likelihood noise model
9
Equations Posterior Uniform Noise Model Pure classification model Model space too large – approximation required Model with highest posterior, Sampling
10
BMA of Bagged C4.5 Rules Bagging as a form of importance sampling where all samples are weighed equally Experimental Results Every version of BMA performed worse than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single rule model – model selection rather than averaging
11
BMA of various learners RISE Rule sets with partitioning 8 databases from UCI BMA worse than RISE in every domain Trading Rules Intuition (there is no single right rule so BMA should help) BMA similar to choosing the single best rule
12
Overfitting in BMA Issue of overfitting is usually ignored (Freund et al. 2000) Is overfitting the explanation for the poor performance of BMA? Preferring a hypothesis that does not truly have the lowest error of any hypothesis considered, but by chance has the lowest error on training data. Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered
13
To BMA or not to BMA? Net effect will depend on which effect prevails? Increased overfitting (small if few models are considered) Reduction in error obtained by giving some weight to alternative models (skewed weights => small effect) Ali & Pazzani (1996) report good results but bagging wasn’t tried Domingos (2000) used bootstrapping before BMA so the models were built from less data
14
Why they work? Bias / Variance Decomposition Training data insufficient for choosing a single best classifier Learning algorithms not “smart” enough! Hypothesis space may not contain the true function
15
Definitions Bias is the persistent/systematic error of a learner independent of the training set. Zero for a learner that always makes the optimal prediction Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set
16
Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be negative and noise is ignored Breiman (1996) – undefined for any given example and variance can be zero even when the learners predictions fluctuate Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the Bayes optimal classifier to be non- zero Friedman (1997) leaves bias and variance for zero-one loss undefined
17
Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire et al. 1997) using the decomposition Incorporates variable misclassification costs Experimental study
18
Unified Decomposition Loss functions Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1 Goal = Minimize average L(t,y) over all weighted examples c 1 N(x) + B(x) + c 2 V(x)
19
Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.
20
Experimental Study 30 UCI datasets Methodology 100 bootstrap samples – averaged over the test set with uniform weights Estimate bias, variance, zero-one loss DT, kNN, boosting
21
Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in the first few rounds Variance reduction is more gradual and the dominant effect
22
kNN results kNN bias increases with k dominates variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.
23
Issues Does not work with “Any” loss function e.g. absolute loss Decomposition is not purely additive unlike the original one for squared- loss
24
Spectrum of ensembles Asymmetry of weights Overfitting BaggingBagging BoostingBoosting BMABMA
25
Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible
26
Bibliography Overview T. Dietterich Bauer & Kohavi Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.