Diverse Ensembles for Active Learning

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Longin Jan Latecki Temple University
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Experimental Evaluation
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Semi-Supervised Clustering
Data Mining Practical Machine Learning Tools and Techniques
Bagging and Random Forests
Ensembles (Bagging, Boosting, and all that)
Eco 6380 Predictive Analytics For Economists Spring 2016
Trees, bagging, boosting, and stacking
Machine Learning: Ensembles
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Basic machine learning background with Python scikit-learn
A “Holy Grail” of Machine Learing
Data Mining Practical Machine Learning Tools and Techniques
A New Boosting Algorithm Using Input-Dependent Regularizer
Introduction to Data Mining, 2nd Edition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
COSC 4335: Other Classification Techniques
Ensembles.
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
CS 391L: Machine Learning: Ensembles
Machine Learning: Lecture 5
Evaluation David Kauchak CS 158 – Fall 2019.
Ensembles (Bagging, Boosting, and all that)
Presentation transcript:

Diverse Ensembles for Active Learning Prem Melville and Raymond J. Mooney June 27, 2004

Motivation Actively selecting most useful training examples is an important approach to reducing amount of supervision Pool-based sample selection is the most popular approach Learner chooses best instance for labeling from a set of unlabeled examples Query by Committee (QBC) is a theoretically well motivated approach to sample selection [Seung et al. 92] Committee of consistent hypotheses is learned Examples that cause maximum disagreement amongst this committee are selected for labeling Bagging and AdaBoost have been used to learn effective committees for QBC [Abe & Mamitsuka 98] Known as Query by Bagging (QBag) and Query by Boosting (QBoost)

Motivation A good ensemble for QBC should be diverse i.e., consistent hypotheses that are very different from each other Only a committee that effectively samples the version space is productive for sample selection [Cohn 94] Decorate is a recently-developed ensemble method that explicitly builds diverse ensembles [Melville & Mooney 03,04] It’s more accurate than Bagging & AdaBoost when training data is limited And does at least as well as AdaBoost when training sets are large How effective are Decorate ensembles for sample selection? Can the added diversity help select more informative examples than QBag and QBoost?

Outline Background on DECORATE Active-DECORATE Experimental Evaluation Additional Experiments Future Work and Conclusions

Outline Background on DECORATE Active-DECORATE Experimental Evaluation Additional Experiments Future Work and Conclusions

Ensemble Diversity Combining classifiers is only useful if they disagree on some inputs Diversity refers to a measure of disagreement (ambiguity) Increasing diversity while maintaining error of ensemble members → decreases ensemble error [Krogh & Vedelsby 95] We use disagreement with ensemble prediction as a measure of diversity If Ci(x) is the prediction of the i-th classifier for the label of x C*(x) is the prediction of the entire ensemble Diversity of the i-th classifier on example x is given by Div. of ensemble of size m, on training set of size n: Our approach: build ensembles consistent with training data while maximizing diversity

DECORATE: Basic Approach The ensemble is generated iteratively Artificially constructed examples are added to training set when building new members Artificial examples are given labels that disagree with current ensemble’s decisions The new classifier is trained on this augmented data Thereby forcing it to differ from the current ensemble Adding it to the ensemble will therefore increase diversity While forcing diversity we still maintain accuracy Reject new classifier if adding it to existing ensemble decreases its accuracy To produce predictions we take the majority vote of the ensemble

Overview of DECORATE Current Ensemble Training Examples + - C1 - + + Base Learner + - Artificial Examples

Overview of DECORATE Current Ensemble Training Examples + C1 - - + + Base Learner C2 + - - + Artificial Examples

Overview of DECORATE Current Ensemble Training Examples + C1 - - + + Base Learner C2 C3 - + + + - Artificial Examples

Artificial Data Examples are generated at each iteration Number of examples is proportional to training size (1:1) Randomly pick points from approx. training data distribution For numeric attributes compute mean and std dev & generate values from the Gaussian For nominal attributes Compute prob. of occurrence of each distinct value & generate values from this distribution To label examples Find class membership probabilities predicted by current ensemble Select labels s.t. probability of selection is inversely proportional to ensemble predictions Use Laplace smoothing – nom. Attribute values not represented in the data have a non-zero prob of occurrence

Outline Background on DECORATE Active-DECORATE Experimental Evaluation Additional Experiments Future Work and Conclusions

Active-DECORATE Unlabeled Examples Current Ensemble Training Examples Utility = 0.1 Current Ensemble Training Examples + C1 - + C2 DECORATE C3 C4

Active-DECORATE Unlabeled Examples Utility = 0.1 0.9 Acquire Label 0.3 0.2 0.5 Current Ensemble Training Examples + - C1 - + C2 DECORATE C3 + C4 QBag/QBoost similarly implemented using Bagging/AdaBoost in place of Decorate

Measure of Utility To evaluate the expected utility of unlabeled examples we use the margins on the examples Similar to [Abe and Mamitsuka 98] Given the class membership probabilities predicted by the committee The margin is defined as diff between highest and second highest predicted class probability Smaller margins imply greater uncertainty in the class label Other measures of utility will be discussed later

Summary of Data Sets Name Cases Classes Attributes Vowel 990 11 14 Statlog 270 2 Primary 339 21 18 Breast-w 699 9 Sonar 208 61 Glass 214 6 Heart-c 303 13 Hepatitis 155 19 Diabetes 768 Iris 150 3 4 Labor 57 16 Lymph 148 Credit-g 1000 Soybean 683 35 Heart-h 294

Experimental Methodology Compared Active-Decorate with QBag, QBoost and Decorate (using random sampling) Used ensembles of size 15 Used J48 as the base learner J48 is a Java implementation of C 4.5 decision tree induction 2x10-fold cross-validations were run on 15 UCI datasets In each fold, learning curves were generated The set of available examples treated as unlabeled pool At each iteration, the active learner selected sample of pts to be labeled and added to training set For passive learner, Decorate, examples were selected randomly At the end of the learning curve, all algos see the same examples The curves evaluate the how well an active learner orders the set of examples in terms of utility

Metrics – Data Utilization Ratio Examples saved Accuracy Active Random Num of training examples Primary aim of active learning – reduce amount of data needed to induce accurate model

Metrics – Data Utilization Ratio Accuracy Examples saved Active Random Num of training examples Define target error rate as the error that Decorate can achieve on a given dataset Error averaged over pts of the learning curve corresponding to last 50 examples Record smallest num of examples required by a learner to achieve same or lower error

Metrics – Data Utilization Ratio Accuracy Examples saved Active Random Num of training examples Data utilization ratio: (num of examples required by active learner) / (num of examples required by Decorate) Reflects how efficiently the active learner is using data Similar to measure used by Abe & Mamitsuka [98]

Metrics - Percentage Error Reduction Accuracy Active Random Num of training examples How much an active learner improves accuracy over random sampling given a fixed amount of labeled data Compute % reduction in error over Decorate Average over points on the learning curve

Metrics - Percentage Error Reduction Accuracy Active Random Num of training examples Towards end of learning curve all methods see almost the same examples Hence, main impact of active learning is lower on curve Capture this by reporting % error reduction on 20% of point on the curve where largest improvements are produced Similar to a measure used by Saar-Tsechansky & Provost [01]

Metrics - Percentage Error Reduction Accuracy Active Random Num of training examples Error reduction is considered significant if difference in error of the 2 systems averaged across selected pts of the curve is statistically significant (p<0.05)

Results – Data Utilization On all but one dataset Active-Decorate produces improvements over Decorate On average it requires 78% of the num of examples that Decorate needs With as few as 29% of examples on soybean On breast-w we notice a ceiling effect were none of the active methods improve on Decorate Active-Decorate outperforms both QBag and QBoost on 10 datasets On some datasets (vowel & primary), QBag & QBoost failed to achieve the target error Decorate itself achieves the target error with far fewer examples than is available e.g. on breast-w it achieves the target error with only 30 of the available 630 examples Hence improving on the data utilization of Decorate is fairly challenging

Results – Error Reduction On all datasets Active-Decorate produces significant reductions in error over Decorate On 8 datasets Active-Decorate produces higher reductions than other active methods It produces a wide range of improvements From moderate (4.2% on credit-g) to high (70.68% on vowel) With an average reduction of 21.2% QBag QBoost ActiveDecorate Mean Err. Red. 13.13% 15.64% 21.15% No. of Wins 4 3 8

Learning Curve for Soybean This graph shows the advantage of Active-Decorate, both in terms of data utilization and error reduction.

Outline Background on DECORATE Active-DECORATE Experimental Evaluation Additional Experiments Future Work and Conclusions

Measures of Utility There are two main aspects of any QBC approach The method employed to construct the committee Measure used to rank utility of unlabeled examples We compared different methods for constructing committees Ranked examples based on margins Alternate approach – use Jensen-Shannon (JS) divergence [Cover & Thomas 91] JS-div is a measure of similarity between probability distributions

Jensen-Shannon Divergence If Pi(x) is the class probability distribution given by i-th classifier for example x, then JS-div of ensemble of size n as: H(P) is the Shannon entropy of distribution P = {pj, j=1,…,K} defined as: Higher values of JS-div indicate greater spread in predicted class probability distribution Zero iff the distributions are identical A similar measure was used by [McCallum & Nigam 98] We ran experiments, as before, comparing JS-div with margins

Results – Utility Measures Data Utilization % Error Reduction Margins JS-Div Mean 0.78 0.83 21.15 17.22 Num. of Wins 7 8 11 4 In terms of data utilization, both methods equally matched On error reduction, using margins is more effective JS-div selects examples to reduce uncertainty in predicted class mem. probs Which indirectly helps improve accuracy Margins focus more directly on determining the decision boundary Cost-sensitive decisions require accurate class probability estimates Using JS-div could be more effective in such cases

Learning Curve for Vowel Often both measures achieve target error with comparable number of examples But error reduction produced by margins is higher

Committees for Sample Selection vs. Prediction All active methods described use committees to select examples In addition to sample selection, they also use the committees for prediction We are evaluating the combination of sample selection and ensemble method Active-Decorate does better than QBag Could just be because Decorate is better than Bagging Claim: Decorate not only produces accurate committees, but committees produce are more effective for sample selection

Committees for Sample Selection vs. Prediction Implemented variant of Active-Decorate At each iteration a committee constructed by Bagging is used to select examples given to Decorate Thus separating evaluation of selector from predictor Similarly, implemented a variant using AdaBoost as the selector Compared the 3 variants on 4 datasets On 3 of 4 datasets, using any selector with Decorate as predictor performed better than random selection On the 4th dataset, the trends are same, but not statistically significant Compared to AdaBoost and Bagging, Decorate committees select more informative examples for training Decorate

Learning Curve for Soybean

Related Work Dagan & Engelson [95] measure utility of examples using vote entropy i.e. the entropy of the class distribution based on majority votes of each committee member [McCallum & Nigam 98] showed that it does not perform as well as JS-div Another committee-based active learner – Co-Testing [Muslea et al. 00] Requires 2 redundant views of the data Hence limited applicability Expected-error reduction methods [Cohn et al. 96, Roy & McCallum 01, Zhu et al. 03] Select examples that are expected to minimize error on the actual test distribution Is computationally intense, and must be tailored to specific learners Active meta-learners like Active-Decorate can be applied to any learner

Future Work & Conclusions Active-Decorate is a simple, yet effective approach to active learning Produces significant improvements over Decorate In general, it leads to more effective sample selection than QBag and QBoost Using JS-divergence to evaluate effectiveness of examples is less effective for improving classification accuracy than margins JS-div may be a better measure when the objective is improving class probability estimates Active-Decorate is a meta-learning scheme – so it can be applied to other base learners We can compare with other active learners, such as approaches for SVMs [Tong et al. 01]

Questions? DECORATE is now available as part of the Weka ML package. Machine Learning Group, UT-Austin www.cs.utexas.edu/users/ml

Ensemble Diversity Combining classifiers is only useful if they disagree on some inputs Diversity refers to a measure of disagreement (ambiguity) For regression Using mean squared error to measure accuracy Using variance to measure diversity Ensemble generalization error [Krogh & Vedelsby ′95] – average error of the ensemble members – average diversity of the ensemble Increasing diversity while maintaining error of ensemble members → decreases ensemble error

Diversity for Classification For classification the simple linear relation doesn’t hold We still have reason to believe that diversity is related to error reduction [Cunnigham ′00] Many measures of diversity have been used in the literature [Kuncheva et al. ′03] compared different measures They show that most of these measures are highly correlated No conclusive study points to which measure of diversity is the best to use However for the classification task, which is our focus, such this

Learning Curve for Soybean (Full)

Learning Curve for Vowel (Full)

Learning Curve for Soybean (Full)

Related Work There have been other ensemble methods that focus on diversity [Liu & Yao ′99], [Rosen ′96], [Opitz & Shavlik ′96], [Zenobi & Cunnigham ′01], [Tumer and Ghosh ′96] , [Opitz ′99] How our work differs from others: Other methods attempt to optimize accuracy and diversity of individual ensemble members We try to minimize error of entire ensemble by increasing diversity Some methods are dependent on the underlying learner (e.g. NN) DECORATE is a general meta-learner applicable to any base learner We compare with standard ensemble methods – others don’t Except for [Opitz ′99] We present learning curves - evaluates performance with varying amounts of data

Modeling Artificial Data We use a very crude approximation of the data distribution Assume independence of features Assume Gaussian distribution for nominal attributes We can do a better job of modeling the data But, we get good results with the current method It is unclear that a better model will improve results It will however increase run time

Artificial vs. Unlabeled Data The way we use artificial examples may appear counterintuitive to the way unlabeled data is used in semi-supervised learning Where the labels given to the unlabeled data by the supervised learner is preserved (instead of being flipped) Why does semi-supervised learning work? Unlabeled data provides more information about the data distribution Artificial data does not Why does flipping labels not hurt Decorate? If the current ensemble is accurate, aren’t we are forcing subsequent members to not be accurate? No – we make sure that the error of the ensemble never decreases

When Should You Use DECORATE? When you have few training examples Or acquiring labeled data is expensive For large amt. of training data you may still do better than Boosting DECORATE performs better on 6 of 15 datasets given 100% of the data For your dataset there is a good chance that DECORATE will outperform Boosting even with large amounts of data When your base classifier cannot handle weighted examples Boosting can be done with resampling – but might not be desirable When you have noisy data Boosting often increases error due to overfitting noisy data [Dietterich 00] DECORATE is resilient to noise in data [Melville et al. 04]

Other Ensemble Methods There are other ensemble methods that we can compare to Error-Correcting Output Coding [Dietterich & Bakiri ′95] Injecting randomness into the learning algorithm We chose to compare to Bagging and Boosting They are the mostly widely used and studied We also compared to Random Forests Which is not a meta-learner But since we use decision trees we also compared with RFs

Labor Iris Heart-C Breast-W On many data sets, Decorate achieves the same or higher accuracy with fewer training examples. These learning curves on four of our datasets clearly demonstrate this point. Hence, in domains where little data is available or acquiring labels is expensive, Decorate has an advantage over other ensemble methods. Heart-C Breast-W

Bagging [Breiman ′96] Each classifier is trained on a set of m training examples Examples drawn randomly with replacement from the original set of size m Such a set is called a bootstrap replicate Predictions are made by taking the majority vote of the ensemble Ensemble members differ because they’re trained on different subsets of the data Bagging reduces error due to variance of the base classifier

Boosting (AdaBoost.M1) [Freund & Shapire ′96] Maintains a set of weights over the training examples In each iteration classifier Ci is trained to minimize the weighted error The weighted error of Ci is used to update the distribution of weights Weights of misclassified examples are increased Weights of correctly classified examples are decreased Next classifier is trained on examples with updated distribution This process is repeated for specified number of iterations Ensemble predictions made using a weighted vote of individual classifiers Weight of each classifier is computed according to its training accuracy

Metrics – Data Utilization Ratio Primary aim of active learning – reduce amount of data needed to induce accurate model First, define target error rate as the error that Decorate can achieve on a given dataset Error averaged over pts of the learning curve corresponding to last 50 examples Then, record smallest num of examples required by a learner to achieve same or lower error Data utilization ratio: (num of exs required by active learner) / (num of exs required by Decorate) Reflects how efficiently the active learner is using data Similar to measure used by Abe & Mamitsuka [98]

Metrics - Percentage Error Reduction How much an active learner improves accuracy over random sampling given a fixed amt of labeled data Compute % reduction in error over Decorate Average over points on the learning curve Towards end of learning curve all methods see almost the same examples Hence, main impact of active learning is lower on curve Capture this by reporting % error reduction on 20% of point on the curve where largest improvements are produced Similar to a measure used by Saar-Tsechansky & Provost [01] Error reduction is considered significant if the diff in the error of the 2 systems averaged across selected pts of the curve is determined to be statistically significant (p<0.05)