Download presentation
Presentation is loading. Please wait.
Published bySarah Simpson Modified over 8 years ago
1
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde of Origami Logic)
2
Decision tree learning
3
A decision tree
4
A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 33 Play ~= 24 Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32 Play ~= 18 Play = 45m,45,60,40 Play ~= 48
5
Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree
6
Tree planting Training sample of points covering area [0, 3] x [0, 3] Two possible colors of points Model should predict color of a new point
7
Decision Tree
8
How to grow a decision tree? Split rows in a given node into two sets with respect to an impurity measure Popular splitting criterion: try to lower entropy of the y labels on the resulting partition – i.e., prefer splits that contain fewer labels, or that have very skewed distributions of labels
9
When to stop growing a tree? 1.Build a full tree OR 2.Apply stopping criterion – Tree depth – Minimum number of points on a leaf
10
How to assign leaf value? Leaf value is – Color, if leaf contains only one point – Majority color or color distribution, if leaf contains multiple points
11
Trained decision tree Tree covers an entire area by picking rectangles predicting a point color
12
Decision tree scoring Model can predict a new point color based on its coordinates
13
Decision trees: plus and minus Simple and fast to learn Arguably easy to understand (if compact) Very fast to use: – often you don’t even need to compute all attribute values Can find interactions between variables (play if it’s cool and sunny or ….) and hence non- linear decision boundaries Don’t need to worry about how numeric values are scaled
14
Decision trees: plus and minus Hard to prove things about Not well-suited to probabilistic extensions Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on
16
Another view of a decision tree
17
Sepal_length<5.7 Sepal_width>2.8
18
Another view of a decision tree
19
Another picture…
20
Problem of overfitting Tree perfectly represents training data (100% classification accuracy on training data), but also learned noise
21
How to handle overfitting? Pre-pruning – Stopping criterion Post-pruning – Random Forests! – Randomize tree building and combine them together
22
“Bagging” Create bootstrap samples of training data Build independent trees on these samples
23
“Bagging”
24
Each tree sees only a sample of the training data and captures only a part of the information Build multiple “weak” trees which vote together to give final prediction – Voting based on majority or weighted average Ensemble learning! Boosting!
25
Validation Each tree built over a sample of training points Remaining points are called “out-of-bag” (OOB) Good approximation for general error – Almost identical to N-fold cross- validation
28
Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.
29
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree
30
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree easy cases!
31
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – sort examples by a retain label y – scan thru once and update the histogram of y|a θ at each point θ – pick the threshold θ with the best entropy score – O(nlogn) due to the sort – but repeated for each attribute
32
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – or, fix a set of possible split-points θ in advance – scan through once and compute the histogram of y’s – O(n) per attribute
33
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Subset splits: – expensive but useful – there is a similar sorting trick that works for the regression case
34
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Points to ponder: – different subtrees are distinct tasks – once the data is in memory, this algorithm is fast. each example appears only once in each level of the tree depth of the tree is usually O(log n) – as you move down the tree, the datasets get smaller
36
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree The classifier is sequential and so is the learning algorithm – it’s really hard to see how you can learn the lower levels without learning the upper ones first!
37
Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Bottleneck points: – what’s expensive is picking the attributes especially at the top levels – also, moving the data around in a distributed setting
38
Boosting Assign weights to each weak learner – Start with uniform weights – Stream in test data – If a weak learner is WRONG: upweight – If a weak learner is RIGHT: downweight AdaBoost
39
Decision Trees in Spark MLlib
41
Scaling results Synthetic dataset (unspecified) 10 to 50 million instances 10 to 50 features 2 to 16 machines 700MB to 18GB datasets
42
Large-scale experiment 500M instances, 20 features, 90GB dataset
44
The End is coming Student lecture tomorrow Final lecture Friday – Overview of distributed / big data frameworks Thanksgiving break PROJECT TALKS! Final project write-ups: Dec. 9 – 6-10 pages (I stop reading after pg. 10) – NIPS format (see course website)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.