Scaling up Decision Trees. Decision tree learning.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Decision Tree Algorithm
Ensemble Learning: An Introduction
ICS 273A Intro Machine Learning
Three kinds of learning
ICS 273A Intro Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
Machine Learning CS 165B Spring 2012
INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Scaling up Decision Trees Shannon Quinn (with thanks to William Cohen of CMU, and B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo of IIT)
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Classification with Decision Trees and Rules. Copyright © Andrew W. Moore Density Estimation – looking ahead Compare it against the two other major kinds.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Some Common Nonlinear Learning Methods. Learning methods we’ve seen so far: Linear classifiers: – naïve Bayes, logistic regression, perceptron, support.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Review Session. Examples: How I would study Can you explain these to a friend? Can you find them in your notes? Can you find a problem in the sample midterms.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
University of Waikato, New Zealand
DECISION TREES An internal node represents a test on an attribute.
Bagging and Random Forests
Trees, bagging, boosting, and stacking
COMP61011 : Machine Learning Ensemble Models
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Ensembles.
Decision Trees By Cole Daily CSCI 446.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Scaling up Decision Trees

Decision tree learning

A decision tree

A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 33 Play ~= 24 Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32 Play ~= 18 Play = 45m,45,60,40 Play ~= 48

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Popular splitting criterion: try to lower entropy of the y labels on the resulting partition i.e., prefer splits that contain fewer labels, or that have very skewed distributions of labels

Most decision tree learning algorithms “ Pruning” a tree – avoid overfitting by removing subtrees somehow – trade variance for bias 15 13

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Same idea

Decision trees: plus and minus Simple and fast to learn Arguably easy to understand (if compact) Very fast to use: – often you don’t even need to compute all attribute values Can find interactions between variables (play if it’s cool and sunny or ….) and hence non- linear decision boundaries Don’t need to worry about how numeric values are scaled

Decision trees: plus and minus Hard to prove things about Not well-suited to probabilistic extensions Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on

Another view of a decision tree

Sepal_length<5.7 Sepal_width>2.8

Another view of a decision tree

Another picture…

Fixing decision trees…. Hard to prove things about Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on Solution is two build ensembles of decision trees

Most decision tree learning algorithms “ Pruning” a tree – avoid overfitting by removing subtrees somehow – trade variance for bias Alternative - build a big ensemble to reduce the variance of the algorithms - via boosting, bagging, or random forests

Example: random forests Repeat T times: – Draw a bootstrap sample S (n examples taken with replacement) from the dataset D. – Select a subset of features to use for S Usually half to 1/3 of the full feature set – Build a tree considering only the features in the selected subset Don’t prune Vote the classifications of all the trees at the end Ok - how well does this work?

Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree easy cases!

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – sort examples by a retain label y – scan thru once and update the histogram of y|a θ at each point θ – pick the threshold θ with the best entropy score – O(nlogn) due to the sort – but repeated for each attribute

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – or, fix a set of possible split-points θ in advance – scan through once and compute the histogram of y’s – O(n) per attribute

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Subset splits: – expensive but useful – there is a similar sorting trick that works for the regression case

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Points to ponder: – different subtrees are distinct tasks – once the data is in memory, this algorithm is fast. each example appears only once in each level of the tree depth of the tree is usually O(log n) – as you move down the tree, the datasets get smaller

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree The classifier is sequential and so is the learning algorithm – it’s really hard to see how you can learn the lower levels without learning the upper ones first!

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Bottleneck points: – what’s expensive is picking the attributes especially at the top levels – also, moving the data around in a distributed setting

Key ideas A controller to generate Map-Reduce jobs – distribute the task of building subtrees of the main decision trees – handle “small” and “large” tasks differently small: – build tree in-memory large: – build the tree (mostly) depth-first – send the whole dataset to all mappers and let them classify instances to nodes on-the-fly

Walk through DeQueue “large” job for root, A. Submit job to compute split quality for each attribute: – reduction in var – size of left/right branches – branch predictions (Left/Right) After completion: – Pick best split and add it to the “model” Key point: we need the split criterion (var) to be something that can be computed in a map- reduce framework. If we just need to aggregate histograms, we’re ok. Key point: we need the split criterion (var) to be something that can be computed in a map- reduce framework. If we just need to aggregate histograms, we’re ok.

Walk through DeQueue “large” job for root, A. Submit job to compute split quality for each attribute: – reduction in var – size of left/right branches – branch predictions (Left/Right) After completion: – Pick best split and add it to the “model” Key point: model file is shared with the mappers. The model file also contains information on job completion status (for error recovery) Key point: model file is shared with the mappers. The model file also contains information on job completion status (for error recovery)

Walk through DeQueue job for left branch: – it’s small enough to stop, so we do DeQueue job for B: – submit mapreduce job scan thru the whole dataset and identifies records send in B computes possible splits for these records – on completion pick best split and enQueue…

Walk through DeQueue job for left branch: – it’s small enough to stop, so we do DeQueue job for B: – submit mapreduce job scan thru the whole dataset and identifies records send in B computes possible splits for these records – on completion pick best split

Walk through DeQueue jobs for C and D submits a single map- reduce job for {C,D} – scan thru data – pick nodes sent to C or D. – computes split quality – sends to controller

Walk through DeQueued jobs for E, F, G are small enough for memory – put them on a separate “inMemQueue” DeQueue job H is send to map-reduce …. – …

MR routines for expanding nodes Update all histograms incrementally - use fixed number of split points to save memory Update all histograms incrementally - use fixed number of split points to save memory

MR routines for expanding nodes

MR routines for expanding node