Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde of Origami Logic)

Decision tree learning

A decision tree

A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 33 Play ~= 24 Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32 Play ~= 18 Play = 45m,45,60,40 Play ~= 48

Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

Tree planting Training sample of points covering area [0, 3] x [0, 3] Two possible colors of points Model should predict color of a new point

Decision Tree

How to grow a decision tree? Split rows in a given node into two sets with respect to an impurity measure Popular splitting criterion: try to lower entropy of the y labels on the resulting partition – i.e., prefer splits that contain fewer labels, or that have very skewed distributions of labels

When to stop growing a tree? 1.Build a full tree OR 2.Apply stopping criterion – Tree depth – Minimum number of points on a leaf

How to assign leaf value? Leaf value is – Color, if leaf contains only one point – Majority color or color distribution, if leaf contains multiple points

Trained decision tree Tree covers an entire area by picking rectangles predicting a point color

Decision tree scoring Model can predict a new point color based on its coordinates

Decision trees: plus and minus Simple and fast to learn Arguably easy to understand (if compact) Very fast to use: – often you don’t even need to compute all attribute values Can find interactions between variables (play if it’s cool and sunny or ….) and hence non- linear decision boundaries Don’t need to worry about how numeric values are scaled

Decision trees: plus and minus Hard to prove things about Not well-suited to probabilistic extensions Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on

Another view of a decision tree

Sepal_length<5.7 Sepal_width>2.8

Another view of a decision tree

Another picture…

Problem of overfitting Tree perfectly represents training data (100% classification accuracy on training data), but also learned noise

How to handle overfitting? Pre-pruning – Stopping criterion Post-pruning – Random Forests! – Randomize tree building and combine them together

“Bagging” Create bootstrap samples of training data Build independent trees on these samples

“Bagging”

Each tree sees only a sample of the training data and captures only a part of the information Build multiple “weak” trees which vote together to give final prediction – Voting based on majority or weighted average Ensemble learning! Boosting!

Validation Each tree built over a sample of training points Remaining points are called “out-of-bag” (OOB) Good approximation for general error – Almost identical to N-fold cross- validation

Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree easy cases!

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – sort examples by a retain label y – scan thru once and update the histogram of y|a θ at each point θ – pick the threshold θ with the best entropy score – O(nlogn) due to the sort – but repeated for each attribute

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – or, fix a set of possible split-points θ in advance – scan through once and compute the histogram of y’s – O(n) per attribute

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Subset splits: – expensive but useful – there is a similar sorting trick that works for the regression case

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Points to ponder: – different subtrees are distinct tasks – once the data is in memory, this algorithm is fast. each example appears only once in each level of the tree depth of the tree is usually O(log n) – as you move down the tree, the datasets get smaller

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree The classifier is sequential and so is the learning algorithm – it’s really hard to see how you can learn the lower levels without learning the upper ones first!

Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Bottleneck points: – what’s expensive is picking the attributes especially at the top levels – also, moving the data around in a distributed setting

Boosting Assign weights to each weak learner – Start with uniform weights – Stream in test data – If a weak learner is WRONG: upweight – If a weak learner is RIGHT: downweight AdaBoost

Decision Trees in Spark MLlib

Scaling results Synthetic dataset (unspecified) 10 to 50 million instances 10 to 50 features 2 to 16 machines 700MB to 18GB datasets

Large-scale experiment 500M instances, 20 features, 90GB dataset

The End is coming Student lecture tomorrow Final lecture Friday – Overview of distributed / big data frameworks Thanksgiving break PROJECT TALKS! Final project write-ups: Dec. 9 – 6-10 pages (I stop reading after pg. 10) – NIPS format (see course website)

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Similar presentations

Presentation on theme: "Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Similar presentations

Presentation on theme: "Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde."— Presentation transcript:

Similar presentations

About project

Feedback