Decision trees (concept learnig) 02/03/2016
These slides are based on Tom M. Mitchell’s Machine Learning slides (2011) http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml
Concept learning example concept: ”the days where the weather is suitable for playing (outdoor) tennis.” Outlook Temp. Humidity Wind Tennis? Sunny Hot High Weak No Strong Overcast Cold Normal Yes Rain Mild
Hypothesis hypothesis h in „concept learning” is the disjunctive normal form of feature values feature values can be a particular value, e.g. Wind=Strong or any value, e.g. Wind=? a possible h is Outlook Temp. Humidity Wind Sunny ? ? Strong
Decision trees
Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes
Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ? No Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes
Decision Tree for Conjunction Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes
Decision Tree for Disjunction Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes ICS320
Decision Tree decision trees represent disjunctions of conjunctions Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)
Advantages of decision trees explicit relationship among features human interpretable modell
Training a decision tree (algorithm ID3) A the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.
Which Attribute is ”best”? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
Entropy Entropy(S) = -p+ log2 p+ - p- log2 p- S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-
Entropy -p+ log2 p+ - p- log2 p- Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p- Note that: 0Log20 =0
Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
ICS320-Foundations of Adaptive and Learning Systems Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cool D9 D8 D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day Part3 Decision Tree Learning
Selecting the Next Attribute Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048
Selecting the Next Attribute Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247
ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Converting a Tree to Rules Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No R5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
ID3 and the space of hypotheses + - + + - + A2 + - - + - + A1 - - + A2 A2 - + - + - A3 A4 + - - +
ID3 and the space of hypotheses hill climbing algorithm from a simple to a complex hypothesis The hypothesis space is complete (the optimal solution is present) ID3 outputs one hypothesis no backtrack (greedy) → local minima It prefers smaller trees (as attributes with greater information gain are placed close to the root)
ID3 → C4.5 GainRatio() Continous features Missing feature values Pruning
Attributes with many values Gain() prefers attributes with many values GainRatio(S,A) = Gain(S,A) / Split(S,A), where Split(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S| Si is a subset of S where A értéke takes vi
Continous attributes Convert continous attributes to disrete intervalls Temperature=240C, Temperature =270C (Temperature > 20.00C) = {true, false} How to recognise a good threshold? Temperature 150C 180C 190C 220C 240C 270C Tennis No Yes
Missing attributes During training: at node n and attribute A use the most frequent value of A in n use the most frequent value of A in n among the instances with the same classlabel use the expected value of A estimated from n Prediction time evaluate each possible values aggregate the prediction of the leaves (calculate the probability)
Overfitting
Pruning of decision trees to avoid overtraining Stop growing the tree if the improvement is minor prune back the complete tree
Pruning of decision trees Split your training data into training and validation sets and repeat until improves: Evaluate each tree where the lowest level branches are pruned (replace it with a leaf) Prune the one with greatest improvement (greedy iteration) Corollary: there will be heterogenous leaves
General questions of machine learning
Generalization ability overfitting bias-variance dilemma
Overfitting modell hH overfits if there is a h’H errortrain(h) < errortrain(h’) and errorX(h) > errorX(h’)
Occam’s razor Among competing hypotheses, the one with the fewest assumptions should be selected. If a long hypothesis fits to the data it can be due to randomness (imagine that a shorter also fits to the data)
The bias-variance dilemma Regression problem of the function F(x) g(x;D) is the prediction of the modell trained on D We use multiple D bias=error on the training data variance=the difference among modells learnt on various D
The bias-variance dilemma Overfitting=low bias but great variance
© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)
© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)
„generalisation” metaparameter Each machine learning approach has one (or a few) metaparameters to finetune the bias/variance tradeoff kNN: k Parzen windows: window size Naive Bayes: m-estimate decision tree: pruning
The curse of dimensionality
Feature space Introducing new features (information to the system) helps: BUT it can easily can indicate overfitting!
The curse of dimensionality if we ha 100 binary features we’d need 2100 training samples… What does „similarity” mean at d=1000? In practice the performance can decrease by introducing new features “Curse of dimensionality” parameter estimation becomes more complex easily overfits
Performance measure of supervised machine learners
Estimation of the performance of supervised learner Supervised learning: Based on training examples, learn a modell which works fine on previously unseen examples. Selection among models: Selection of the machine learning approach Feature space construction Metaparameter optimisation (e.g. pruning of decision trees)
Leave-out technique Split your dataset into D = {(v1,y1),…,(vn,yn)} training (Dt) and test (Dv=D\Dt) sets Train Dt Test D\Dt Simulation of the performance on unseen data.
train-dev-test train test train dev test
n-fold cross validation n-fold cross validation splits the training data D into n disjunct folds: D1,D2,…,Dk … Train your model on D\Di–n then evaluate on Di Repeat this n times and average the results achieved D1 D2 D3 Dk D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4
Summary Decision trees Concept learning Entropy ID3 -> C4.5 General questions of machine learning Generalization ability Curse of dimensionality Performance evaluation in supervised learning