Data Mining Practical Machine Learning Tools and Techniques

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Slides for “Data Mining” by I. H. Witten and E. Frank
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Ensembles of Classifiers Evgueni Smirnov
Issues with Data Mining
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
Chapter 9 – Classification and Regression Trees
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
University of Waikato, New Zealand
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Data Science Algorithms: The Basic Methods
Machine Learning: Ensembles
Boosting and Additive Trees
Data Science Algorithms: The Basic Methods
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Introduction to Data Mining, 2nd Edition
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Ensembles.
Ensemble learning.
Machine Learning in Practice Lecture 17
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
CS 391L: Machine Learning: Ensembles
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall

Ensemble learning Combining multiple models Bagging Randomization The basic idea Bagging Bias-variance decomposition, bagging with costs Randomization Random forests, Rotation forests Boosting AdaBoost, the power of boosting Additive regression Numeric prediction, additive logistic regression Interpretable ensembles Option trees, alternating decision trees, logistic model trees Stacking Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Combining multiple models Basic idea: build different “experts”, let them vote Advantage: often improves predictive performance Disadvantage: usually produces output that is very hard to analyze but: there are approaches that aim to produce a single comprehensible structure Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Bagging Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Bagging Combining predictions by voting/averaging “Idealized” version: Simplest way Each model receives equal weight “Idealized” version: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifiers’ predictions Learning scheme is unstable  almost always improves performance Small change in training data can make big change in model (e. g. decision trees) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Bias-variance decomposition Used to analyze how much selection of any specific training set affects performance Assume infinitely many classifiers, built from different training sets of size n For any learning scheme, Bias = expected error of the combined classifier on new data Variance = expected error due to the particular training set used Total expected error  bias + variance Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

More on bagging Bagging works because it reduces variance by voting/averaging Note: in some pathological hypothetical situations the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new ones of size n by sampling from it with replacement Can help a lot if data is noisy Can also be applied to numeric prediction Aside: bias-variance decomposition originally only known for numeric prediction Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Bagging classifiers Model generation Classification Let n be the number of instances in the training data For each of t iterations: Sample n instances from training set (with replacement) Apply learning algorithm to the sample Store resulting model Classification For each of the t models: Predict class of instance using model Return class that is predicted most often Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Bagging with costs Bagging unpruned decision trees known to produce good probability estimates Where, instead of voting, the individual classifiers' probability estimates are averaged Note: this can also improve the success rate Can use this with minimum-expected cost approach for learning problems with costs Problem: not interpretable MetaCost re-labels training data using bagging with costs and then builds single tree Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Randomization Can randomize learning algorithm instead of input Some algorithms already have a random component: e.g. initial weights in neural net Most algorithms can be randomized, e. g. greedy algorithms: Pick from the N best options at random instead of always picking the best options E. g.: attribute selection in decision trees More generally applicable than bagging: e. g. random subsets in nearest-neighbor scheme Can be combined with bagging Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Random forests, rotation forests Bagging creates ensembles of accurate classifiers with relatively low diversity Bootstrap sampling creates training sets with a distribution that resembles the original data Randomness in the learning algorithm increases diversity but sacrifices accuracy of individual ensemble members Random and rotation forests have the goal of creating accurate and diverse ensemble members Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Random forests Combine random attribute sets and bagging to generate an ensemble of decision trees An iteration involves Generate Bootstrap sample When constructing the decision tree, randomly choose one of the k best attributes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Rotation forests Combine random attribute sets, bagging and principal components to generate an ensemble of decision trees An iteration involves Randomly dividing the input attributes into k disjoint subsets Applying PCA to each of the k subsets in turn Learning a decision tree from the k sets of PCA directions Further increases in diversity can be achieved by creating a bootstrap sample in each iteration before applying PCA Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Boosting Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Boosting Also uses voting/averaging Weights models according to performance Iterative: new models are influenced by performance of previously built ones Encourage new model to become an “expert” for instances misclassified by earlier models Intuitive justification: models should be experts that complement each other Several variants Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

AdaBoost.M1 Model generation Classification Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset, store resulting model Compute model’s error e on weighted dataset If e = 0 or e  0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance’s weight by e/(1-e) Normalize weight of all instances Classification Assign weight = 0 to all classes For each of the t (or less) models: For the class this model predicts add –log e/(1-e) to this class’s weight Return class with highest weight Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

More on boosting I Boosting needs weights … but Can adapt learning algorithm ... or Can apply boosting without weights resample with probability determined by weights disadvantage: not all instances are used advantage: if error > 0.5, can resample again Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers are not too complex, and their error doesn’t become too large too quickly Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

More on boosting II Continue boosting after training error = 0? Puzzling fact: generalization error continues to decrease! Seems to contradict Occam’s Razor Explanation: consider margin (confidence), not error Difference between estimated probability for true class and nearest other class (between –1 and 1) Boosting works with weak learners only condition: error doesn’t exceed 0.5 In practice, boosting sometimes overfits (in contrast to bagging) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Additive regression I Turns out that boosting is a greedy algorithm for fitting additive models More specifically, implements forward stagewise additive modeling Same kind of algorithm for numeric prediction: Build standard regression model (e. g. tree) Gather residuals, learn model predicting residuals (e. g. tree), and repeat To predict, simply sum up individual predictions from all models Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Additive regression II Minimizes squared error of ensemble if base learner minimizes squared error Doesn't make sense to use it with standard multiple linear regression, why? Can use it with simple linear regression to build multiple linear regression model Use cross-validation to decide when to stop Another trick: shrink predictions of the base models by multiplying with positive constant < 1 Caveat: need to start with model 0 that predicts the mean Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Additive logistic regression Can use the logit transformation to get algorithm for classification More precisely, class probability estimation Probability estimation problem is transformed into regression problem Regression scheme is used as base learner (e. g. regression tree learner) Can use forward stagewise algorithm: at each stage, add model that maximizes probability of data If fj is the jth regression model, the ensemble predicts probability for the first class 𝑝1| 𝑎  = 1 1exp− 𝑓 𝑗  𝑎   Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

LogitBoost Model generation Classification For j = 1 to t iterations: For each instance a[i]: Set the target value for the regression to z[i] = (y[i] – p(1|a[i])) / [p(1|a[i]) × (1-p(1|a[i])] Set the weight of instance a[i] to p(1|a[i]) × (1-p(1|a[i]) Fit a regression model f[j] to the data with class values z[i] and weights w[i] Classification Predict 1st class if p(1 | a) > 0.5, otherwise predict 2nd class Maximizes probability if base learner minimizes squared error Difference to AdaBoost: optimizes probability/likelihood instead of exponential loss Can be adapted to multi-class problems Shrinking and cross-validation-based selection apply Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Option trees Ensembles are not interpretable Can we generate a single model? One possibility: “cloning” the ensemble by using lots of artificial data that is labeled by ensemble Another possibility: generating a single structure that represents ensemble in compact fashion Option tree: decision tree with option nodes Idea: follow all possible branches at option node Predictions from different branches are merged using voting or by averaging probability estimates Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Example Can be learned by modifying tree learner: Create option node if there are several equally promising splits (within user-specified interval) When pruning, error at option node is average error of options Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Alternating decision trees Can also grow option tree by incrementally adding nodes to it Structure called alternating decision tree, with splitter nodes and prediction nodes Prediction nodes are leaves if no splitter nodes have been added to them yet Standard alternating tree applies to 2-class problems To obtain prediction, filter instance down all applicable branches and sum predictions Predict one class or the other depending on whether the sum is positive or negative Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Example Ex.: -overcast, high, false -0.255 +0.213 +0.486 -0.331 =0.113 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Growing alternating trees Tree is grown using a boosting algorithm E. g. LogitBoost described earlier Assume that base learner produces single conjunctive rule in each boosting iteration (note: rule for regression) Each rule could simply be added into the tree, including the numeric prediction obtained from the rule Problem: tree would grow very large very quickly Solution: base learner should only consider candidate rules that extend existing branches Extension adds splitter node and two prediction nodes (assuming binary splits) Standard algorithm chooses best extension among all possible extensions applicable to tree More efficient heuristics can be employed instead Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Logistic model trees Option trees may still be difficult to interpret Can also use boosting to build decision trees with linear models at the leaves (i. e. trees without options) Algorithm for building logistic model trees: Run LogitBoost with simple linear regression as base learner (choosing the best attribute in each iteration) Interrupt boosting when cross-validated performance of additive model no longer increases Split data (e. g. as in C4.5) and resume boosting in subsets of data Prune tree using cross-validation-based pruning strategy (from CART tree learner) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Stacking Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

Stacking To combine predictions of base learners, don’t vote, use meta learner Base learners: level-0 models Meta learner: level-1 model Predictions of base learners are input to meta learner Base learners are usually different schemes Can’t use predictions on training data to generate data for level-1 model! Instead use cross-validation-like scheme Hard to analyze theoretically: “black magic” Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)

More on stacking If base learners can output probabilities, use those as input to meta learner instead Which algorithm to use for meta learner? In principle, any learning scheme Prefer “relatively global, smooth” model Base learners do most of the work Reduces risk of overfitting Stacking can be applied to numeric prediction too Data Mining: Practical Machine Learning Tools and Techniques (Chapter 8)