Statistical Learning Dong Liu Dept. EEIS, USTC
Chapter 8. Decision Tree Tree model Tree building Tree pruning Tree and ensemble 2019/2/23 Chap 8. Decision Tree
Taxonomy How does the biologist determine the category of an animal? A hierarchy of rules Kingdom of Animalia Phylum of Chordata Class of Mammalia Order of Carnivora … 2019/2/23 Chap 8. Decision Tree
Taxonomy as tree 2019/2/23 Chap 8. Decision Tree
Decision tree Refund MarSt TaxInc YES NO Yes No Married Single, Divorced <= 80K > 80K 2019/2/23 Chap 8. Decision Tree
Using decision tree Start from the root of the tree NO Refund Yes No MarSt Single, Divorced Married NO TaxInc NO <= 80K > 80K NO YES 2019/2/23 Chap 8. Decision Tree
Tree models A tree model consists of a set of conditions and a set of base models, organized in a tree Each internal node represents a condition on input attributes One condition is a division (split) of input space Each leaf node represents a base model Classification: a class (simplest case) or a classifier Regression: a constant (simplest case) or a regressor 2019/2/23 Chap 8. Decision Tree
Chapter 8. Decision Tree Tree model Tree building Tree pruning Tree and ensemble 2019/2/23 Chap 8. Decision Tree
Tree induction Assume we have defined the form of base models How to find out the optimal tree structure (set of conditions, division of input space)? Exhaustive search is computationally expensive Heuristic approach: Hunt’s algorithm 2019/2/23 Chap 8. Decision Tree
Hunt’s algorithm Input: A set of training data 𝒟={ 𝒙 𝑛 , 𝑦 𝑛 } Output: A classification tree or regression tree 𝑇 Function 𝑇 = Hunt_Algorithm(𝒟) If 𝒟 need not or cannot be divided, return a leaf node Else Find an attribute of 𝒙, say 𝑥 𝑑 , and decide a condition 𝑔( 𝑥 𝑑 ). Divide 𝒟 into 𝒟 1 , 𝒟 2 ,…, according to the output of 𝑔 𝑥 𝑑 𝑇 1 = Hunt_Algorithm( 𝒟 1 ), 𝑇 2 = Hunt_Algorithm( 𝒟 2 ), … Let 𝑇 1 , 𝑇 2 ,…, be the children of 𝑇 Return 𝑇 2019/2/23 Chap 8. Decision Tree
Example of Hunt’s algorithm (1) 𝒟= 𝑇= Refund Yes No 𝑇 1 =? 𝑇 2 =? D need be divided? Yes! D’s class labels have different values D can be divided? Yes! D’s input attributes have different values 2019/2/23 Chap 8. Decision Tree
Example of Hunt’s algorithm (2) 𝒟 1 = 𝑇= Refund Yes No NO 𝑇 1 =? 𝑇 2 =? D1 need be divided? No! T1 is a leaf node 2019/2/23 Chap 8. Decision Tree
Example of Hunt’s algorithm (3) 𝒟 2 = 𝑇= Refund Yes No NO MarSt 𝑇 2 =? Single, Divorced Married 𝑇 21 =? 𝑇 22 =? 2019/2/23 Chap 8. Decision Tree
Example of Hunt’s algorithm (4) 𝒟 22 = 𝑇= Refund Yes No NO MarSt Single, Divorced Married TaxInc 𝑇 22 =? NO <= 80K > 80K NO YES D22 can be divided? No! T22 is a leaf node 2019/2/23 Chap 8. Decision Tree
Find an attribute and decide a condition MarSt Single Married Divorced Discrete values: Multi-way or Two-way Which attribute & which condition shall be selected? MarSt {Single, Divorced} Married MarSt Single {Married, Divorced} Define a criterion that describes the “gain” of dividing a set into several subsets Continuous values: Two-way or Multi-way 2019/2/23 Chap 8. Decision Tree
Purity of set Purity of set describes how easily the set can be classified E.g. two sets with 0-1 classes: {0,0,0,0,0,0,0,0,0,1} vs. {0,1,0,1,0,1,0,1,0,1} Measures ( 𝑝 0 and 𝑝 1 stands for the percentage of class-0 and class-1) Entropy: − 𝑝 0 log 𝑝 0 − 𝑝 1 log 𝑝 1 Gini index: 1− 𝑝 0 2 − 𝑝 1 2 Misclassification error (if taking the dominant class): min( 𝑝 0 , 𝑝 1 ) 2019/2/23 Chap 8. Decision Tree
Criterion to find attribute and decide condition Information gain 𝑔=𝐻 𝒟 − 𝑖 𝒟 𝑖 𝒟 𝐻( 𝒟 𝑖 ) where 𝐻 𝒟 is the entropy Information gain ratio 𝑔𝑟= 𝑔 − 𝑖 | 𝒟 𝑖 | |𝒟| log | 𝒟 𝑖 | |𝒟| , suppress too many subsets Gini index gain 𝑔𝑖𝑔=𝐺 𝒟 − 𝑖 𝒟 𝑖 𝒟 𝐺( 𝒟 𝑖 ) where 𝐺(𝒟) is the Gini index 2019/2/23 Chap 8. Decision Tree
Example: Gini index gain Before split: 𝐺 𝒟 =0.42 After split with {TaxInc <= 97}: 𝑔𝑖𝑔=0.12 Training samples 2019/2/23 Chap 8. Decision Tree
Chapter 8. Decision Tree Tree model Tree building Tree pruning Tree and ensemble 2019/2/23 Chap 8. Decision Tree
Control the complexity of tree Using Hunt’s algorithm, we build a tree as accurate as possible, which may incur over-fitting Two manners to control the complexity (thus to avoid over-fitting) Early termination: stop splitting, if the gain is less than a threshold, or if the tree is too deep, or if the set is too small Tree pruning: remove branches from the tree so as to minimize the joint cost 𝐶 𝛼 𝑇 =𝐶 𝑇 +𝛼|𝑇| where 𝐶 𝑇 is empirical risk (e.g. error rate of training data), |𝑇| is tree complexity (e.g. number of leaf nodes) 2019/2/23 Chap 8. Decision Tree
Tree pruning example (1) Refund Yes No NO MarSt 3 correct Single, Divorced Married TaxInc 𝑇 22 =? NO <= 80K > 80K 2 correct 1 error NO YES 1 correct 3 correct 𝐶 𝑇 =1/10 𝑇 =4 2019/2/23 Chap 8. Decision Tree
Tree pruning example (2) We have different pruning selections Refund Yes No MarSt Single, Divorced Married NO MarSt Married 3 correct Single, Divorced TaxInc 𝑇 22 =? NO <= 80K > 80K 3 correct 1 error TaxInc 𝑇 22 =? NO <= 80K > 80K NO YES 2 correct 1 error 1 correct 3 correct 2 error NO YES 1 correct 3 correct 𝐶 𝑇 =3/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree
Tree pruning example (2) We have different pruning selections Refund Refund Yes No Yes No NO MarSt NO MarSt 3 correct Single, Divorced Married Married 3 correct Single, Divorced YES 𝑇 22 =? NO TaxInc 𝑇 22 =? NO 3 correct 1 error 2 correct 1 error <= 80K > 80K 2 correct 1 error NO YES 1 correct 3 correct 𝐶 𝑇 =2/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree
Tree pruning example (2) We have different pruning selections Refund Refund Yes No Yes No NO TaxInc NO MarSt 3 correct <= 80K > 80K Married 3 correct Single, Divorced NO YES TaxInc 𝑇 22 =? NO 3 correct 1 error 3 correct <= 80K > 80K 2 correct 1 error NO YES 𝐶 𝑇 =1/10 𝑇 =3 1 correct 3 correct 2019/2/23 Chap 8. Decision Tree
Tree pruning example (2) Select the tree with minimal 𝐶(𝑇) Refund Yes No NO TaxInc YES <= 80K > 80K MarSt Married Single, Divorced TaxInc YES NO <= 80K > 80K 𝑇 22 =? Refund Yes No NO MarSt Married Single, Divorced YES 𝑇 22 =? 𝐶 𝑇 =1/10 𝑇 =3 𝐶 𝑇 =3/10 𝑇 =3 𝐶 𝑇 =2/10 𝑇 =3 2019/2/23 Chap 8. Decision Tree
Tree pruning example (3) Continue pruning, keep 2 leaf nodes Refund Yes No MarSt Single, Divorced Married NO YES YES 𝑇 22 =? NO 3 correct 4 correct 3 error 3 correct 3 error 3 correct 1 error TaxInc 𝐶 𝑇 =3/10 𝑇 =2 <= 80K > 80K 𝐶 𝑇 =4/10 𝑇 =2 NO YES 3 correct 1 error 3 correct 3 error 𝐶 𝑇 =4/10 𝑇 =2 2019/2/23 Chap 8. Decision Tree
Tree pruning example (4) Continue pruning, keep 1 leaf node NO 6 correct 4 error 𝐶 𝑇 =4/10 𝑇 =1 2019/2/23 Chap 8. Decision Tree
Tree pruning example (5) In summary 𝐶(𝑇)=1/10,|𝑇|=4 𝐶(𝑇)=1/10,|𝑇|=3 𝐶(𝑇)=3/10,|𝑇|=2 𝐶(𝑇)=4/10,|𝑇|=1 Therefore, according to 𝛼, optimal trees are different 𝛼≥0.15: one leaf node 0≤𝛼≤0.15: three leaf nodes 𝛼=0: four leaf nodes Refund Yes No NO MarSt Married Single, Divorced TaxInc YES <= 80K > 80K 𝑇 22 =? Refund Yes No NO TaxInc YES <= 80K > 80K NO 2019/2/23 Chap 8. Decision Tree
Chapter 8. Decision Tree Tree model Tree building Tree pruning Tree and ensemble 2019/2/23 Chap 8. Decision Tree
Decision tree for regression Consider the simplest case: each leaf node corresponds to a constant Each time to find attribute and decide condition, is to minimize the (e.g. quadratic) cost The final regression tree is indeed a piecewise constant function min 𝑑,𝑡 [ min 𝑐 1 𝑥 𝑖𝑑 ≤𝑡 𝑦 𝑖 − 𝑐 1 2 + min 𝑐 2 𝑥 𝑖𝑑 >𝑡 ( 𝑦 𝑖 − 𝑐 2 ) 2 ] 2019/2/23 Chap 8. Decision Tree
Equivalence of decision tree and boosting tree for regression Hunt’s algorithm: “divide and conquer”, conditions + base models Boosting: linear combination of base models Model 1 Model 3 两种做法得到的都是分段常函数,是等价的。 Model 1 + Model 2 + Model 3 Model 2 Each model is a constant Each model is a decision stump 2019/2/23 Chap 8. Decision Tree
Implementation ID3: use information gain C4.5: use information gain ratio (by default), one of most famous classification algorithm CART: use Gini index (for classification) and quadratic cost (for regression), only 2-way split According to 𝐶 𝛼 𝑇 =𝐶 𝑇 +𝛼|𝑇|, increase 𝛼 gradually to get a series of subtrees. Determine which subtree is optimal according to validation (or cross validation) 2019/2/23 Chap 8. Decision Tree
Remarks on tree models Easy to interpret Irrelevant/redundant attributes can be filtered out Good at discrete variables How to handle complex conditions e.g. 𝑥 1 + 𝑥 2 <𝑐? Oblique tree 2019/2/23 Chap 8. Decision Tree
Random forest Combination of decision tree and ensemble learning According to bagging, firstly generate multiple datasets (bootstrap samples), each of which gives rise to a tree model During tree building, consider a random subset of features when splitting 2019/2/23 Chap 8. Decision Tree
Chap 5. Non-Parametric Supervised Learning Chapter summary Dictionary Toolbox Decision tree Gini index Pruning (of decision tree) CART C4.5 Hunt’s algorithm Information gain, ~ ratio Random forest 2019/2/23 Chap 5. Non-Parametric Supervised Learning