Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Friday, 23 May.

Slides:



Advertisements
Similar presentations
Ensemble Learning – Bagging, Boosting, and Stacking, and other topics
Advertisements

On-line learning and Boosting
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
For Better Accuracy Eick: Ensemble Learning
Online Learning Algorithms
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
Experimental Evaluation of Learning Algorithms Part 1.
Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Thursday, November.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 05 March 2008 William.
Benk Erika Kelemen Zsolt
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 16 February 2007 William.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Tuesday, December 7, 1999 William.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, February 2, 2000.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Monday, 03 March 2008 William.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, February 28, 2000.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Machine Learning: Ensembles
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
A task of induction to find patterns
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 391L: Machine Learning: Ensembles
Machine Learning: Lecture 5
Presentation transcript:

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Friday, 23 May 2003 William H. Hsu Department of Computing and Information Sciences, KSU Readings: Section 7.5, Mitchell “Bagging, Boosting, and C4.5”, Quinlan Section 5, “MLC++ Utilities 2.0”, Kohavi and Sommerfield Combining Classifiers: Weighted Majority, Bagging, and Stacking Lecture 3

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Lecture Outline Readings –Section 7.5, Mitchell –Section 5, MLC++ manual, Kohavi and Sommerfield This Week’s Paper Review: “Bagging, Boosting, and C4.5”, J. R. Quinlan Combining Classifiers –Problem definition and motivation: improving accuracy in concept learning –General framework: collection of weak classifiers to be improved Weighted Majority (WM) –Weighting system for collection of algorithms –“Trusting” each algorithm in proportion to its training set accuracy –Mistake bound for WM Bootstrap Aggregating (Bagging) –Voting system for collection of algorithms (trained on subsamples) –When to expect bagging to work (unstable learners) Next Lecture: Boosting the Margin, Hierarchical Mixtures of Experts

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Combining Classifiers Problem Definition –Given Training data set D for supervised learning D drawn from common instance space X Collection of inductive learning algorithms, hypothesis languages (inducers) –Hypotheses produced by applying inducers to s(D) s: X vector  X’ vector (sampling, transformation, partitioning, etc.) Can think of hypotheses as definitions of prediction algorithms (“classifiers”) –Return: new prediction algorithm (not necessarily  H) for x  X that combines outputs from collection of prediction algorithms Desired Properties –Guarantees of performance of combined prediction –e.g., mistake bounds; ability to improve weak classifiers Two Solution Approaches –Train and apply each inducer; learn combiner function(s) from result –Train inducers and combiner function(s) concurrently

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Principle: Improving Weak Classifiers Mixture Model First Classifier Second Classifier Both Classifiers

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Framework: Data Fusion and Mixtures of Experts What Is A Weak Classifier? –One not guaranteed to do better than random guessing (1 / number of classes) –Goal: combine multiple weak classifiers, get one at least as accurate as strongest Data Fusion –Intuitive idea Multiple sources of data (sensors, domain experts, etc.) Need to combine systematically, plausibly –Solution approaches Control of intelligent agents: Kalman filtering General: mixture estimation (sources of data  predictions to be combined) Mixtures of Experts –Intuitive idea: “experts” express hypotheses (drawn from a hypothesis space) –Solution approach (next time) Mixture model: estimate mixing coefficients Hierarchical mixture models: divide-and-conquer estimation method

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Weight-Based Combiner –Weighted votes: each prediction algorithm (classifier) h i maps from x  X to h i (x) –Resulting prediction in set of legal class labels –NB: as for Bayes Optimal Classifier, resulting predictor not necessarily in H Intuitive Idea –Collect votes from pool of prediction algorithms for each training example –Decrease weight associated with each algorithm that guessed wrong (by a multiplicative factor) –Combiner predicts weighted majority label Performance Goals –Improving training set accuracy Want to combine weak classifiers Want to bound number of mistakes in terms of minimum made by any one algorithm –Hope that this results in good generalization quality Weighted Majority: Idea

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Weighted Majority: Procedure Algorithm Combiner-Weighted-Majority (D, L) –n  L.size// number of inducers in pool –m  D.size// number of examples –FOR i  1 TO n DO P[i]  L[i].Train-Inducer (D)// P[i]: ith prediction algorithm w i  1// initial weight –FOR j  1 TO m DO// compute WM label q 0  0, q 1  0 FOR i  1 TO n DO IF P[i](D[j]) = 0 THEN q 0  q 0 + w i // vote for 0 (-) IF P[i](D[j]) = 1 THEN q 1  q 1 + w i // else vote for 1 (+) Prediction[i][j]  (q 0 > q 1 ) ? 0 : ((q 0 = q 1 ) ? Random (0, 1): 1) IF Prediction[i][j]  D[j].target THEN// c(x)  D[j].target w i   w i //  < 1 (i.e., penalize) –RETURN Make-Predictor (w, P)

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Weighted Majority: Properties Advantages of WM Algorithm –Can be adjusted incrementally (without retraining) –Mistake bound for WM Let D be any sequence of training examples, L any set of inducers Let k be the minimum number of mistakes made on D by any L[i], 1  i  n Property: number of mistakes made on D by Combiner-Weighted-Majority is at most 2.4 (k + lg n) Applying Combiner-Weighted-Majority to Produce Test Set Predictor –Make-Predictor: applies abstraction; returns funarg that takes input x  D test –Can use this for incremental learning (if c(x) is available for new x) Generalizing Combiner-Weighted-Majority –Different input to inducers Can add an argument s to sample, transform, or partition D Replace P[i]  L[i].Train-Inducer (D) with P[i]  L[i].Train-Inducer (s(i, D)) Still compute weights based on performance on D –Can have q c ranging over more than 2 class labels

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Bagging: Idea Bootstrap Aggregating aka Bagging –Application of bootstrap sampling Given: set D containing m training examples Create S[i] by drawing m examples at random with replacement from D S[i] of size m: expected to leave out 0.37 of examples from D –Bagging Create k bootstrap samples S[1], S[2], …, S[k] Train distinct inducer on each S[i] to produce k classifiers Classify new instance by classifier vote (equal weights) Intuitive Idea –“Two heads are better than one” –Produce multiple classifiers from one data set NB: same inducer (multiple instantiations) or different inducers may be used Differences in samples will “smooth out” sensitivity of L, H to D

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Bagging: Procedure Algorithm Combiner-Bootstrap-Aggregation (D, L, k) –FOR i  1 TO k DO S[i]  Sample-With-Replacement (D, m) Train-Set[i]  S[i] P[i]  L[i].Train-Inducer (Train-Set[i]) –RETURN (Make-Predictor (P, k)) Function Make-Predictor (P, k) –RETURN (fn x  Predict (P, k, x)) Function Predict (P, k, x) –FOR i  1 TO k DO Vote[i]  P[i](x) –RETURN (argmax (Vote[i])) Function Sample-With-Replacement (D, m) –RETURN (m data points sampled i.i.d. uniformly from D)

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Bagging: Properties Experiments –[Breiman, 1996]: Given sample S of labeled data, do 100 times and report average 1. Divide S randomly into test set D test (10%) and training set D train (90%) 2. Learn decision tree from D train e S  error of tree on T 3. Do 50 times: create bootstrap S[i], learn decision tree, prune using D e B  error of majority vote using trees to classify T –[Quinlan, 1996]: Results using UCI Machine Learning Database Repository When Should This Help? –When learner is unstable Small change to training set causes large change in output hypothesis True for decision trees, neural networks; not true for k-nearest neighbor –Experimentally, bagging can help substantially for unstable learners, can somewhat degrade results for stable learners

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Bagging: Continuous-Valued Data Voting System: Discrete-Valued Target Function Assumed –Assumption used for WM (version described here) as well Weighted vote Discrete choices –Stacking: generalizes to continuous-valued targets iff combiner inducer does Generalizing Bagging to Continuous-Valued Target Functions –Use mean, not mode (aka argmax, majority vote), to combine classifier outputs –Mean = expected value  A (x) = E D [  (x, D)]  (x, D) is base classifier  A (x) is aggregated classifier –(E D [y -  (x, D)]) 2 = y 2 - 2y · E D [  (x, D)] + E D [  2 (x, D)] Now using E D [  (x, D)] =  A (x) and EZ 2  (EZ) 2, (E D [y -  (x, D)]) 2  (y -  A (x)) 2 Therefore, we expect lower error for the bagged predictor  A

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Stacked Generalization aka Stacking Intuitive Idea –Train multiple learners Each uses subsample of D May be ANN, decision tree, etc. –Train combiner on validation segment –See [Wolpert, 1992; Bishop, 1995] Stacked Generalization Network Stacked Generalization: Idea Combiner Inducer y x 11 x 12 y Predictions Combiner Inducer x 21 x 22 yy Predictions Combiner y yy Predictions

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Stacked Generalization: Procedure Algorithm Combiner-Stacked-Gen (D, L, k, n, m’, Levels) –Divide D into k segments, S[1], S[2], …, S[k] // Assert D.size = m –FOR i  1 TO k DO Validation-Set  S[i] // m/k examples FOR j  1 TO n DO Train-Set[j]  Sample-With-Replacement (D ~ S[i], m’) // m - m/k examples IF Levels > 1 THEN P[j]  Combiner-Stacked-Gen (Train-Set[j], L, k, n, m’, Levels - 1) ELSE // Base case: 1 level P[j]  L[j].Train-Inducer (Train-Set[j]) Combiner  L[0].Train-Inducer (Validation-Set.targets, Apply-Each (P, Validation-Set.inputs)) –Predictor  Make-Predictor (Combiner, P) –RETURN Predictor Function Sample-With-Replacement: Same as for Bagging

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Stacked Generalization: Properties Similar to Cross-Validation –k-fold: rotate validation set –Combiner mechanism based on validation set as well as training set Compare: committee-based combiners [Perrone and Cooper, 1993; Bishop, 1995] aka consensus under uncertainty / fuzziness, consensus models Common application with cross-validation: treat as overfitting control method –Usually improves generalization performance Can Apply Recursively (Hierarchical Combiner) –Adapt to inducers on different subsets of input Can apply s(Train-Set[j]) to transform each input data set e.g., attribute partitioning [Hsu, 1998; Hsu, Ray, and Wilkins, 2000] –Compare: Hierarchical Mixtures of Experts (HME) [Jordan et al, 1991] Many differences (validation-based vs. mixture estimation; online vs. offline) Some similarities (hierarchical combiner)

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Other Combiners So Far: Single-Pass Combiners –First, train each inducer –Then, train combiner on their output and evaluate based on criterion Weighted majority: training set accuracy Bagging: training set accuracy Stacking: validation set accuracy –Finally, apply combiner function to get new prediction algorithm (classfier) Weighted majority: weight coefficients (penalized based on mistakes) Bagging: voting committee of classifiers Stacking: validated hierarchy of classifiers with trained combiner inducer Next: Multi-Pass Combiners –Train inducers and combiner function(s) concurrently –Learn how to divide and balance learning problem across multiple inducers –Framework: mixture estimation

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Terminology Combining Classifiers –Weak classifiers: not guaranteed to do better than random guessing –Combiners: functions f: prediction vector  instance  prediction Single-Pass Combiners –Weighted Majority (WM) Weights prediction of each inducer according to its training-set accuracy Mistake bound: maximum number of mistakes before converging to correct h Incrementality: ability to update parameters without complete retraining –Bootstrap Aggregating (aka Bagging) Takes vote among multiple inducers trained on different samples of D Subsampling: drawing one sample from another (D ~ D) Unstable inducer: small change to D causes large change in h –Stacked Generalization (aka Stacking) Hierarchical combiner: can apply recursively to re-stack Trains combiner inducer using validation set

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Summary Points Combining Classifiers –Problem definition and motivation: improving accuracy in concept learning –General framework: collection of weak classifiers to be improved (data fusion) Weighted Majority (WM) –Weighting system for collection of algorithms Weights each algorithm in proportion to its training set accuracy Use this weight in performance element (and on test set predictions) –Mistake bound for WM Bootstrap Aggregating (Bagging) –Voting system for collection of algorithms –Training set for each member: sampled with replacement –Works for unstable inducers Stacked Generalization (aka Stacking) –Hierarchical system for combining inducers (ANNs or other inducers) –Training sets for “leaves”: sampled with replacement; combiner: validation set Next Lecture: Boosting the Margin, Hierarchical Mixtures of Experts