CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Slides:



Advertisements
Similar presentations
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Advertisements

Longin Jan Latecki Temple University
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
Ensemble Learning: An Introduction
Adaboost and its application
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
© Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison) Ensembles (Bagging, Boosting, and all that) Old View Learn one good modelLearn.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.
Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
CS Machine Learning 21 Jan CS 391L: Machine Learning: Decision Tree Learning Raymond J. Mooney University of Texas at Austin.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
COMP24111: Machine Learning Ensemble Models Gavin Brown
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
CS 189 Brian Chu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Ensembles (Bagging, Boosting, and all that)
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007
ECE 5424: Introduction to Machine Learning
A “Holy Grail” of Machine Learing
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
CS Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
INTRODUCTION TO Machine Learning 3rd Edition
CS 391L: Machine Learning: Ensembles
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Ensembles (Bagging, Boosting, and all that)
Presentation transcript:

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 CS 540 Fall 2015 (Shavlik) 4/24/2017 Today’s Topics Ensembles Decision Forests (actually, Random Forests) Bagging and Boosting Decision Stumps 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Ensembles (Bagging, Boosting, and all that) Old View Learn one good model New View Learn a good set of models Probably best example of interplay between ‘theory & practice’ in machine learning Naïve Bayes, k-NN, neural net, d-tree, SVM, etc 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Ensembles of Neural Networks (or any supervised learner) OUTPUT Combiner Network Network Network INPUT Ensembles often produce accuracy gains of 5-10 percentage points! Can combine “classifiers” of various types Eg, decision trees, rule sets, neural networks, etc. 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Three Explanations of Why Ensembles Help Statistical (sample effects) Computational (limited cycles for search) Representational (wrong hypothesis space) From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408 Key  true concept  learned models search path Concept Space Considered 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Combining Multiple Models Three ideas for combining predictions Simple (unweighted) votes Standard choice Weighted votes eg, weight by tuning-set accuracy Learn a combining function Prone to overfitting? ‘Stacked generalization’ (Wolpert) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Random Forests (Breiman, Machine Learning 2001; related to Ho, 1995) A variant of something called BAGGING (‘multi-sets’) Algorithm Repeat k times Draw with replacement N examples, put in train set Build d-tree, but in each recursive call Choose (w/o replacement) i features Choose best of these i as the root of this (sub)tree Do NOT prune In HW2, we’ll give you 101 ‘bootstrapped’ samples of the Thoracic Surgery Dataset Let N = # of examples F = # of features i = some number << F 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 Using Random Forests After training we have K decision trees How to use on TEST examples? Some variant of If at least L of these K trees say ‘true’ then output ‘true’ How to choose L ? Use a tune set to decide 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 More on Random Forests Increasing i Increases correlation among individual trees (BAD) Also increases accuracy of individual trees (GOOD) Can also use tuning set to choose good value for i Overall, random forests Are very fast (eg, 50K examples, 10 features, 10 trees/min on 1 GHz CPU back in 2004) Deal well with large # of features Reduce overfitting substantially; NO NEED TO PRUNE! Work very well in practice 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

A Relevant Early Paper on ENSEMBLES Hansen & Salamen, PAMI:20, 1990 If (a) the combined predictors have errors that are independent from one another And (b) prob any given model correct predicts any given testset example is > 50%, then 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some More Relevant Early Papers CS 540 Fall 2015 (Shavlik) 4/24/2017 Some More Relevant Early Papers Schapire, Machine Learning:5, 1990 (‘Boosting’) If you have an algorithm that gets > 50% on any distribution of examples, you can create an algorithm that gets > (100% - ), for any  > 0 Need an infinite (or very large, at least) source of examples - Later extensions (eg, AdaBoost) address this weakness Also see Wolpert, ‘Stacked Generalization,’ Neural Networks, 1992 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some Methods for Producing ‘Uncorrelated’ Members of an Ensemble K times randomly choose (with replacement) N examples from a training set of size N Give each training set to a std ML algo ‘Bagging’ by Brieman (Machine Learning, 1996) Want unstable algorithms (so learned models vary) Reweight examples each cycle (if wrong, increase weight; else decrease weight) ‘AdaBoosting’ by Freund & Schapire (1995, 1996) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 Empirical Studies (from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper) Error Rate of C4.5 Error Rate of Bagged (Boosted) C4.5 (Each point one data set) Boosting and Bagging helped almost always! Error Rate of AdaBoost Error Rate of Bagging On average, Boosting slightly better? ID3 successor 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Some More Methods for Producing “Uncorrelated” Members of an Ensemble Directly optimize accuracy + diversity Opitz & Shavlik (1995; used genetic algo’s) Melville & Mooney (2004-5) Different number of hidden units in a neural network, different k in k -NN, tie-breaking scheme, example ordering, diff ML algos, etc Various people See 2005-2008 papers of Rich Caruana’s group for large-scale empirical studies of ensembles 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Boosting/Bagging/etc Wrapup An easy to use and usually highly effective technique always consider it (Bagging, at least) when applying ML to practical problems Does reduce ‘comprehensibility’ of models see work by Craven & Shavlik though (‘rule extraction’) Increases runtime, but cycles usually much cheaper than examples (and easily parallelized) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Decision “Stumps” (formerly part of HW; try on your own!) Holte (ML journal) compared: Decision trees with only one decision (decision stumps) vs Trees produced by C4.5 (with pruning algorithm used) Decision ‘stumps’ do remarkably well on UC Irvine data sets Archive too easy? Some datasets seem to be Decision stumps are a ‘quick and dirty control for comparing to new algorithms But ID3/C4.5 easy to use and probably a better control 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

C4.5 Compared to 1R (‘Decision Stumps’) Testset Accuracy C4.5 Compared to 1R (‘Decision Stumps’) Dataset C4.5 1R BC 72.0% 68.7% CH 99.2% GL 63.2% 67.6% G2 74.3% 53.8% HD 73.6% 72.9% HE 81.2% 76.3% HO 83.6% 81.0% HY 99.1% 97.2% IR 93.8% 93.5% LA 77.2% 71.5% LY 77.5% 70.7% MU 100.0% 98.4% SE 97.7% 95.0% SO 97.5% VO 95.6% 95.2% V1 89.4% 86.8% See Holte paper in Machine Learning for key (eg, HD=heart disease) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3