Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations


Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

1 Statistical Learning Dong Liu Dept. EEIS, USTC

2 Chapter 8. Decision Tree Tree model Tree building Tree pruning
Tree and ensemble 2019/2/23 Chap 8. Decision Tree

3 Taxonomy How does the biologist determine the category of an animal?
A hierarchy of rules Kingdom of Animalia Phylum of Chordata Class of Mammalia Order of Carnivora 2019/2/23 Chap 8. Decision Tree

4 Taxonomy as tree 2019/2/23 Chap 8. Decision Tree

5 Decision tree Refund MarSt TaxInc YES NO Yes No Married
Single, Divorced <= 80K > 80K 2019/2/23 Chap 8. Decision Tree

6 Using decision tree Start from the root of the tree NO Refund Yes No
MarSt Single, Divorced Married NO TaxInc NO <= 80K > 80K NO YES 2019/2/23 Chap 8. Decision Tree

7 Tree models A tree model consists of a set of conditions and a set of base models, organized in a tree Each internal node represents a condition on input attributes One condition is a division (split) of input space Each leaf node represents a base model Classification: a class (simplest case) or a classifier Regression: a constant (simplest case) or a regressor 2019/2/23 Chap 8. Decision Tree

8 Chapter 8. Decision Tree Tree model Tree building Tree pruning
Tree and ensemble 2019/2/23 Chap 8. Decision Tree

9 Tree induction Assume we have defined the form of base models
How to find out the optimal tree structure (set of conditions, division of input space)? Exhaustive search is computationally expensive Heuristic approach: Hunt’s algorithm 2019/2/23 Chap 8. Decision Tree

10 Hunt’s algorithm Input: A set of training data 𝒟={ 𝒙 𝑛 , 𝑦 𝑛 }
Output: A classification tree or regression tree 𝑇 Function 𝑇 = Hunt_Algorithm(𝒟) If 𝒟 need not or cannot be divided, return a leaf node Else Find an attribute of 𝒙, say 𝑥 𝑑 , and decide a condition 𝑔( 𝑥 𝑑 ). Divide 𝒟 into 𝒟 1 , 𝒟 2 ,…, according to the output of 𝑔 𝑥 𝑑 𝑇 1 = Hunt_Algorithm( 𝒟 1 ), 𝑇 2 = Hunt_Algorithm( 𝒟 2 ), … Let 𝑇 1 , 𝑇 2 ,…, be the children of 𝑇 Return 𝑇 2019/2/23 Chap 8. Decision Tree

11 Example of Hunt’s algorithm (1)
𝒟= 𝑇= Refund Yes No 𝑇 1 =? 𝑇 2 =? D need be divided? Yes! D’s class labels have different values D can be divided? Yes! D’s input attributes have different values 2019/2/23 Chap 8. Decision Tree

12 Example of Hunt’s algorithm (2)
𝒟 1 = 𝑇= Refund Yes No NO 𝑇 1 =? 𝑇 2 =? D1 need be divided? No! T1 is a leaf node 2019/2/23 Chap 8. Decision Tree

13 Example of Hunt’s algorithm (3)
𝒟 2 = 𝑇= Refund Yes No NO MarSt 𝑇 2 =? Single, Divorced Married 𝑇 21 =? 𝑇 22 =? 2019/2/23 Chap 8. Decision Tree

14 Example of Hunt’s algorithm (4)
𝒟 22 = 𝑇= Refund Yes No NO MarSt Single, Divorced Married TaxInc 𝑇 22 =? NO <= 80K > 80K NO YES D22 can be divided? No! T22 is a leaf node 2019/2/23 Chap 8. Decision Tree

15 Find an attribute and decide a condition
MarSt Single Married Divorced Discrete values: Multi-way or Two-way Which attribute & which condition shall be selected? MarSt {Single, Divorced} Married MarSt Single {Married, Divorced} Define a criterion that describes the “gain” of dividing a set into several subsets Continuous values: Two-way or Multi-way 2019/2/23 Chap 8. Decision Tree

16 Purity of set Purity of set describes how easily the set can be classified E.g. two sets with 0-1 classes: {0,0,0,0,0,0,0,0,0,1} vs. {0,1,0,1,0,1,0,1,0,1} Measures ( 𝑝 0 and 𝑝 1 stands for the percentage of class-0 and class-1) Entropy: − 𝑝 0 log 𝑝 0 − 𝑝 1 log 𝑝 1 Gini index: 1− 𝑝 0 2 − 𝑝 1 2 Misclassification error (if taking the dominant class): min⁡( 𝑝 0 , 𝑝 1 ) 2019/2/23 Chap 8. Decision Tree

17 Criterion to find attribute and decide condition
Information gain 𝑔=𝐻 𝒟 − 𝑖 𝒟 𝑖 𝒟 𝐻( 𝒟 𝑖 ) where 𝐻 𝒟 is the entropy Information gain ratio 𝑔𝑟= 𝑔 − 𝑖 | 𝒟 𝑖 | |𝒟| log | 𝒟 𝑖 | |𝒟| , suppress too many subsets Gini index gain 𝑔𝑖𝑔=𝐺 𝒟 − 𝑖 𝒟 𝑖 𝒟 𝐺( 𝒟 𝑖 ) where 𝐺(𝒟) is the Gini index 2019/2/23 Chap 8. Decision Tree

18 Example: Gini index gain
Before split: 𝐺 𝒟 =0.42 After split with {TaxInc <= 97}: 𝑔𝑖𝑔=0.12 Training samples 2019/2/23 Chap 8. Decision Tree

19 Chapter 8. Decision Tree Tree model Tree building Tree pruning
Tree and ensemble 2019/2/23 Chap 8. Decision Tree

20 Control the complexity of tree
Using Hunt’s algorithm, we build a tree as accurate as possible, which may incur over-fitting Two manners to control the complexity (thus to avoid over-fitting) Early termination: stop splitting, if the gain is less than a threshold, or if the tree is too deep, or if the set is too small Tree pruning: remove branches from the tree so as to minimize the joint cost 𝐶 𝛼 𝑇 =𝐶 𝑇 +𝛼|𝑇| where 𝐶 𝑇 is empirical risk (e.g. error rate of training data), |𝑇| is tree complexity (e.g. number of leaf nodes) 2019/2/23 Chap 8. Decision Tree

21 Tree pruning example (1)
Refund Yes No NO MarSt 3 correct Single, Divorced Married TaxInc 𝑇 22 =? NO <= 80K > 80K 2 correct 1 error NO YES 1 correct 3 correct 𝐶 𝑇 =1/10 𝑇 =4 2019/2/23 Chap 8. Decision Tree

22 Tree pruning example (2)
We have different pruning selections Refund Yes No MarSt Single, Divorced Married NO MarSt Married 3 correct Single, Divorced TaxInc 𝑇 22 =? NO <= 80K > 80K 3 correct 1 error TaxInc 𝑇 22 =? NO <= 80K > 80K NO YES 2 correct 1 error 1 correct 3 correct 2 error NO YES 1 correct 3 correct 𝐶 𝑇 =3/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree

23 Tree pruning example (2)
We have different pruning selections Refund Refund Yes No Yes No NO MarSt NO MarSt 3 correct Single, Divorced Married Married 3 correct Single, Divorced YES 𝑇 22 =? NO TaxInc 𝑇 22 =? NO 3 correct 1 error 2 correct 1 error <= 80K > 80K 2 correct 1 error NO YES 1 correct 3 correct 𝐶 𝑇 =2/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree

24 Tree pruning example (2)
We have different pruning selections Refund Refund Yes No Yes No NO TaxInc NO MarSt 3 correct <= 80K > 80K Married 3 correct Single, Divorced NO YES TaxInc 𝑇 22 =? NO 3 correct 1 error 3 correct <= 80K > 80K 2 correct 1 error NO YES 𝐶 𝑇 =1/10 𝑇 =3 1 correct 3 correct 2019/2/23 Chap 8. Decision Tree

25 Tree pruning example (2)
Select the tree with minimal 𝐶(𝑇) Refund Yes No NO TaxInc YES <= 80K > 80K MarSt Married Single, Divorced TaxInc YES NO <= 80K > 80K 𝑇 22 =? Refund Yes No NO MarSt Married Single, Divorced YES 𝑇 22 =? 𝐶 𝑇 =1/10 𝑇 =3 𝐶 𝑇 =3/10 𝑇 =3 𝐶 𝑇 =2/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree

26 Tree pruning example (3)
Continue pruning, keep 2 leaf nodes Refund Yes No MarSt Single, Divorced Married NO YES YES 𝑇 22 =? NO 3 correct 4 correct 3 error 3 correct 3 error 3 correct 1 error TaxInc 𝐶 𝑇 =3/10 𝑇 =2 <= 80K > 80K 𝐶 𝑇 =4/10 𝑇 =2 NO YES 3 correct 1 error 3 correct 3 error 𝐶 𝑇 =4/10 𝑇 =2 2019/2/23 Chap 8. Decision Tree

27 Tree pruning example (4)
Continue pruning, keep 1 leaf node NO 6 correct 4 error 𝐶 𝑇 =4/10 𝑇 =1 2019/2/23 Chap 8. Decision Tree

28 Tree pruning example (5)
In summary 𝐶(𝑇)=1/10,|𝑇|=4 𝐶(𝑇)=1/10,|𝑇|=3 𝐶(𝑇)=3/10,|𝑇|=2 𝐶(𝑇)=4/10,|𝑇|=1 Therefore, according to 𝛼, optimal trees are different 𝛼≥0.15: one leaf node 0≤𝛼≤0.15: three leaf nodes 𝛼=0: four leaf nodes Refund Yes No NO MarSt Married Single, Divorced TaxInc YES <= 80K > 80K 𝑇 22 =? Refund Yes No NO TaxInc YES <= 80K > 80K NO 2019/2/23 Chap 8. Decision Tree

29 Chapter 8. Decision Tree Tree model Tree building Tree pruning
Tree and ensemble 2019/2/23 Chap 8. Decision Tree

30 Decision tree for regression
Consider the simplest case: each leaf node corresponds to a constant Each time to find attribute and decide condition, is to minimize the (e.g. quadratic) cost The final regression tree is indeed a piecewise constant function min 𝑑,𝑡 [ min 𝑐 𝑥 𝑖𝑑 ≤𝑡 𝑦 𝑖 − 𝑐 min 𝑐 𝑥 𝑖𝑑 >𝑡 ( 𝑦 𝑖 − 𝑐 2 ) 2 ] 2019/2/23 Chap 8. Decision Tree

31 Equivalence of decision tree and boosting tree for regression
Hunt’s algorithm: “divide and conquer”, conditions + base models Boosting: linear combination of base models Model 1 Model 3 两种做法得到的都是分段常函数,是等价的。 Model 1 + Model 2 + Model 3 Model 2 Each model is a constant Each model is a decision stump 2019/2/23 Chap 8. Decision Tree

32 Implementation ID3: use information gain
C4.5: use information gain ratio (by default), one of most famous classification algorithm CART: use Gini index (for classification) and quadratic cost (for regression), only 2-way split According to 𝐶 𝛼 𝑇 =𝐶 𝑇 +𝛼|𝑇|, increase 𝛼 gradually to get a series of subtrees. Determine which subtree is optimal according to validation (or cross validation) 2019/2/23 Chap 8. Decision Tree

33 Remarks on tree models Easy to interpret
Irrelevant/redundant attributes can be filtered out Good at discrete variables How to handle complex conditions e.g. 𝑥 1 + 𝑥 2 <𝑐? Oblique tree 2019/2/23 Chap 8. Decision Tree

34 Random forest Combination of decision tree and ensemble learning
According to bagging, firstly generate multiple datasets (bootstrap samples), each of which gives rise to a tree model During tree building, consider a random subset of features when splitting 2019/2/23 Chap 8. Decision Tree

35 Chap 5. Non-Parametric Supervised Learning
Chapter summary Dictionary Toolbox Decision tree Gini index Pruning (of decision tree) CART C4.5 Hunt’s algorithm Information gain, ~ ratio Random forest 2019/2/23 Chap 5. Non-Parametric Supervised Learning


Download ppt "Statistical Learning Dong Liu Dept. EEIS, USTC."

Similar presentations


Ads by Google