CS Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Slides:

Advertisements

Similar presentations

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning what is an ensemble? why use an ensemble?

Ensemble Learning: An Introduction

Three kinds of learning

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning (2), Tree and Forest

Machine Learning CS 165B Spring 2012

Issues with Data Mining

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

© Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison) Ensembles (Bagging, Boosting, and all that) Old View Learn one good modelLearn.

CS 391L: Machine Learning: Ensembles

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.

Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.

CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.

CLASSIFICATION: Ensemble Methods

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

University of Waikato, New Zealand

CS Fall 2016 (Shavlik©), Lecture 5

Ensembles (Bagging, Boosting, and all that)

CS Fall 2015 (Shavlik©), Midterm Topics

CS Fall 2016 (© Jude Shavlik), Lecture 4

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

CS Fall 2016 (Shavlik©), Lecture 12, Week 6

ECE 5424: Introduction to Machine Learning

A “Holy Grail” of Machine Learing

CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4

Data Mining Practical Machine Learning Tools and Techniques

cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

CS 4700: Foundations of Artificial Intelligence

Machine Learning: Lecture 3

CS Fall 2016 (Shavlik©), Lecture 2

Lecture 18: Bagging and Boosting

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Statistical Learning Dong Liu Dept. EEIS, USTC.

Ensemble learning.

CS Fall 2016 (Shavlik©), Lecture 12, Week 6

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

INTRODUCTION TO Machine Learning 3rd Edition

A task of induction to find patterns

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Evaluation David Kauchak CS 158 – Fall 2019.

Ensembles (Bagging, Boosting, and all that)

Presentation transcript:

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 CS 540 Fall 2015 (Shavlik) 1/1/2019 Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps Feature Selection ID3 as Searching a Space of Possible Soln’s ID3 Wrapup 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Ensembles (Bagging, Boosting, and all that) Old View Learn one good model New View Learn a good set of models Probably best example of interplay between ‘theory & practice’ in machine learning Naïve Bayes, k-NN, neural net, d-tree, SVM, etc 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Ensembles of Neural Networks (or any supervised learner) OUTPUT Combiner Network Network Network INPUT Ensembles often produce accuracy gains of 5-10 percentage points! Can combine “classifiers” of various types Eg, decision trees, rule sets, neural networks, etc. 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Combining Multiple Models Three ideas for combining predictions Simple (unweighted) votes Standard choice Weighted votes eg, weight by tuning-set accuracy Learn a combining function Prone to overfitting? ‘Stacked generalization’ (Wolpert) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Random Forests (Breiman, Machine Learning 2001; related to Ho, 1995) A variant of something called BAGGING (‘multi-sets’) Algorithm Repeat k times Draw with replacement N examples, put in train set Build d-tree, but in each recursive call Choose (w/o replacement) i features Choose best of these i as the root of this (sub)tree Do NOT prune In HW2, we give you 101 ‘bootstrapped’ samples of the WINE Dataset Let N = # of examples F = # of features i = some number << F 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Drawing with Replacement vs Drawing w/o Replacement <show on board> 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Using Random Forests After training we have K decision trees How to use on TEST examples? Some variant of If at least L of these K trees say ‘true’ then output ‘true’ How to choose L ? Use a tune set to decide! 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 More on Random Forests Increasing i Increases correlation among individual trees (BAD) Also increases accuracy of individual trees (GOOD) Can also use tuning set to choose good value for i Overall, random forests Are very fast (eg, 50K examples, 10 features, 10 trees/min on 1 GHz CPU back in 2004) Deal well with large # of features Reduce overfitting substantially; NO NEED TO PRUNE! Work very well in practice 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

HW2 – Programming Portion You will simply run your ID3 on 101 ‘drawn-with-replacement’ copies of the WINE train set (feel free to implement the full random-forest idea) Use WINE tune set to choose best L in If at least L of these 101 trees say ‘true’ then output ‘true’ Evaluate on WINE test set 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Three Explanations of Why Ensembles Help Statistical (sample effects) Computational (limited cycles for search) Representational (wrong hypothesis space) From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408 Key  true concept  learned models search path Concept Space Considered 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

A Relevant Early Paper on ENSEMBLES Hansen & Salamen, PAMI:20, 1990 If (a) the combined predictors have errors that are independent from one another And (b) prob any given model correct predicts any given testset example is > 50%, then 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Some More Relevant Early Papers CS 540 Fall 2015 (Shavlik) 1/1/2019 Some More Relevant Early Papers Schapire, Machine Learning:5, 1990 (‘Boosting’) If you have an algorithm that gets > 50% on any distribution of examples, you can create an algorithm that gets > (100% - ), for any  > 0 Need an infinite (or very large, at least) source of examples - Later extensions (eg, AdaBoost) address this weakness Also see Wolpert, ‘Stacked Generalization,’ Neural Networks, 1992 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Some Methods for Producing ‘Uncorrelated’ Members of an Ensemble K times randomly choose (with replacement) N examples from a training set of size N Give each training set to a std ML algo ‘Bagging’ by Brieman (Machine Learning, 1996) Want unstable algorithms (so learned models vary) Reweight examples each cycle (if wrong, increase weight; else decrease weight) ‘AdaBoosting’ by Freund & Schapire (1995, 1996) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Stable Algorithms A algorithm is stable if small changes to the training data mean small changes to the learned model Are d-trees stable? What about k-NN? (Recall Voronoi diagrams) NO YES 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2015 (Shavlik©), Lecture 7, Week 4 Stable Algorithms (2) An idea from the stats community D-trees unstable since one different example can change the root k-NN stable since impact of examples local Ensembles work best with unstable algos since we want the N learned models to differ 9/29/15 CS 540 - Fall 2015 (Shavlik©), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Empirical Studies (from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper) Error Rate of C4.5 Error Rate of Bagged (Boosted) C4.5 (Each point one data set) Boosting and Bagging helped almost always! Error Rate of AdaBoost Error Rate of Bagging On average, Boosting slightly better? ID3 successor 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Some More Methods for Producing “Uncorrelated” Members of an Ensemble Directly optimize accuracy + diversity Opitz & Shavlik (1995; used genetic algo’s) Melville & Mooney (2004-5) Different number of hidden units in a neural network, different k in k -NN, tie-breaking scheme, example ordering, diff ML algos, etc Various people See 2005-2008 papers of Rich Caruana’s group for large-scale empirical studies of ensembles 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Boosting/Bagging/etc Wrapup An easy to use and usually highly effective technique always consider it (Bagging, at least) when applying ML to practical problems Does reduce ‘comprehensibility’ of models see work by Craven & Shavlik though (‘rule extraction’) Increases runtime, but cycles usually much cheaper than examples (and easily parallelized) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Decision “Stumps” (formerly part of HW; try on your own!) Holte (ML journal) compared: Decision trees with only one decision (decision stumps) vs Trees produced by C4.5 (with pruning algorithm used) Decision ‘stumps’ do remarkably well on UC Irvine data sets Archive too easy? Some datasets seem to be Decision stumps are a ‘quick and dirty control for comparing to new algorithms But ID3/C4.5 easy to use and probably a better control 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

C4.5 Compared to 1R (‘Decision Stumps’) Testset Accuracy C4.5 Compared to 1R (‘Decision Stumps’) Dataset C4.5 1R BC 72.0% 68.7% CH 99.2% GL 63.2% 67.6% G2 74.3% 53.8% HD 73.6% 72.9% HE 81.2% 76.3% HO 83.6% 81.0% HY 99.1% 97.2% IR 93.8% 93.5% LA 77.2% 71.5% LY 77.5% 70.7% MU 100.0% 98.4% SE 97.7% 95.0% SO 97.5% VO 95.6% 95.2% V1 89.4% 86.8% See Holte paper in Machine Learning for key (eg, HD=heart disease) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Feature Selection Sometimes we want to preprocess our dataset before running an ML algo to select a good set of features Simple idea: Collect the i features with the most infoGain (over all the training examples) Weakness: redundancy (consider duplicating best scoring feature; it will also score well!) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

D-Trees as Feature Selectors In feature selection, we want features that distinguish ex’s of various cats But we don’t want redundant features And we want features that cover all the training examples D-trees do just that! Pick informative features CONDITIONED on features chosen so far, until all examples covered 9/26/16 AmFam - Fall 2016 (© Jude Shavlik), Lecture 5

ID3 Recap: Questions Addressed How closely should we fit the training data? Completely, then prune Use tuning sets to score candidates Learn forests and no need to prune! Why? How do we judge features? Use info theory (Shannon) What if a features has many values? Convert to Boolean-valued features D-trees can also handle missing feature values (but we won’t cover this for d-trees) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 ID3 Recap (cont.) Looks like a d-tree! What if some features cost more to evaluate (eg, CAT scan vs Temperature)? Use an ad-hoc correction factor Best way to use in an ensemble? Random forests often perform quite well Batch vs. incremental (aka, online) learning? Basically a ‘batch’ approach Incremental variants exist but since ID3 is so fast, why not simply rerun ‘from scratch’ whenever a mistake is made? 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 ID3 Recap (cont.) What about real-valued outputs? Could learn a linear approximation for various regions of the feature space, eg How rich is our language for describing examples? Limited to fixed-length feature vectors (but they are surprisingly effective) 3 f1 - f2 f1 + 2 f2 f4 Venn 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Summary of ID3 Strengths Good technique for learning models from ‘real world’ (eg, noisy) data Fast, simple, and robust Potentially considers complete hypothesis space Successfully applied to many real-world tasks Results (trees or rules) are human-comprehensive One of the most widely used techniques in data mining 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 Summary of ID3 (cont.) Weaknesses Requires fixed-length feature vectors Only makes axis-parallel (univariate) splits Not designed to make probabilistic predictions Non-incremental Hill-climbing algorithm (poor early decisions can be disastrous) However, extensions exist 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4 A Sample Search Tree - so we can use another search method besides hill climbing (‘greedy’ algo) Nodes are PARTIALLY COMPLETE D-TREES Expand ‘left most’ (in yellow) question mark (?) of current node All possible trees can be generated (given thresholds ‘implied’ by real values in train set) Create leaf node Create leaf node Add F1 ? + Add FN - F1 ? Add F2 FN ? F2 ? . . . Add F1 Assume F2 scores best F2 . . . F2 F1 ? ? + ? 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Viewing ID3 as a Search Algorithm Search Space Operators Search Strategy Heuristic Function Start Node Goal Node 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Viewing ID3 as a Search Algorithm Search Space Space of all decision trees constructible using current feature set Operators Add a node (ie, grow tree) Search Strategy Hill Climbing Heuristic Function Information Gain (Other d-tree algo’s use similar ‘purity measures’) Start Node An isolated leaf node marked ‘?’ Goal Node Tree that separates all the training data (‘post pruning’ may be done later to reduce overfitting) 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

What We’ve Covered So Far Supervised ML Algorithms Instance-based (kNN) Logic-based (ID3, Decision Stumps) Ensembles (Random Forests, Bagging, Boosting) Train/Tune/Test Sets, N-Fold Cross Validation Feature Space, (Greedily) Searching Hypothesis Spaces Parameter Tuning (‘Model Selection’), Feature Selection (info gain) Dealing w/ Real-Valued and Hierarchical Features Overfitting Reduction, Occam’s Razor Fixed-Length Feature Vectors, Graph/Logic-Based Reps of Examples Understandability of Learned Models, “Generalizing not Memorizing” Briefly: Missing Feature Values, Stability (to small changes in training sets) Algo’s Methodology Issues 9/29/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4