Decision trees (concept learnig)

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Decision Tree Algorithm (C4.5)
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Machine Learning Group University College Dublin Decision Trees What is a Decision Tree? How to build a good one…
Machine Learning II Decision Tree Induction CSE 473.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
Decision Trees and Decision Tree Learning Philipp Kärger
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 3 - Learning Decision.
Decision Trees Decision tree representation Top Down Construction
1 Interacting with Data Materials from a Course in Princeton University -- Hu Yan.
Ch 3. Decision Tree Learning
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision tree learning
By Wang Rui State Key Lab of CAD&CG
Machine Learning Chapter 3. Decision Tree Learning
Mohammad Ali Keyvanrad
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Decision-Tree Induction & Decision-Rule Induction
Decision Tree Learning
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Tree Learning
Training Examples. Entropy and Information Gain Information answers questions The more clueless I am about the answer initially, the more information.
Seminar on Machine Learning Rada Mihalcea Decision Trees Very short intro to Weka January 27, 2003.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Decision Tree Learning
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
ICS320-Foundations of Adaptive and Learning Systems
Machine Learning Inductive Learning and Decision Trees
Decision Trees an introduction.
Università di Milano-Bicocca Laurea Magistrale in Informatica
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Classification Algorithms
Decision Tree Learning
Decision Trees: Another Example
Artificial Intelligence
Lecture 3: Decision Tree Learning
Data Science Algorithms: The Basic Methods
Introduction to Machine Learning Algorithms in Bioinformatics: Part II
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees.
Decision Trees Decision tree representation ID3 learning algorithm
INTRODUCTION TO Machine Learning
Decision Tree.
Presentation transcript:

Decision trees (concept learnig) 02/03/2016

These slides are based on Tom M. Mitchell’s Machine Learning slides (2011) http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml

Concept learning example concept: ”the days where the weather is suitable for playing (outdoor) tennis.” Outlook Temp. Humidity Wind Tennis? Sunny Hot High Weak No Strong Overcast Cold Normal Yes Rain Mild

Hypothesis hypothesis h in „concept learning” is the disjunctive normal form of feature values feature values can be a particular value, e.g. Wind=Strong or any value, e.g. Wind=? a possible h is Outlook Temp. Humidity Wind Sunny ? ? Strong

Decision trees

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ? No Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes

Decision Tree for Conjunction Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes

Decision Tree for Disjunction Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes ICS320

Decision Tree decision trees represent disjunctions of conjunctions Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak)

Advantages of decision trees explicit relationship among features human interpretable modell

Training a decision tree (algorithm ID3) A  the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

Which Attribute is ”best”? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

Entropy Entropy(S) = -p+ log2 p+ - p- log2 p- S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-

Entropy -p+ log2 p+ - p- log2 p- Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p- Note that: 0Log20 =0

Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

ICS320-Foundations of Adaptive and Learning Systems Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cool D9 D8 D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day Part3 Decision Tree Learning

Selecting the Next Attribute Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048

Selecting the Next Attribute Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247

ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Converting a Tree to Rules Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes

ID3 and the space of hypotheses + - + + - + A2 + - - + - + A1 - - + A2 A2 - + - + - A3 A4 + - - +

ID3 and the space of hypotheses hill climbing algorithm from a simple to a complex hypothesis The hypothesis space is complete (the optimal solution is present) ID3 outputs one hypothesis no backtrack (greedy) → local minima It prefers smaller trees (as attributes with greater information gain are placed close to the root)

ID3 → C4.5 GainRatio() Continous features Missing feature values Pruning

Attributes with many values Gain() prefers attributes with many values GainRatio(S,A) = Gain(S,A) / Split(S,A), where Split(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S| Si is a subset of S where A értéke takes vi

Continous attributes Convert continous attributes to disrete intervalls Temperature=240C, Temperature =270C (Temperature > 20.00C) = {true, false} How to recognise a good threshold? Temperature 150C 180C 190C 220C 240C 270C Tennis No Yes

Missing attributes During training: at node n and attribute A use the most frequent value of A in n use the most frequent value of A in n among the instances with the same classlabel use the expected value of A estimated from n Prediction time evaluate each possible values aggregate the prediction of the leaves (calculate the probability)

Overfitting

Pruning of decision trees to avoid overtraining Stop growing the tree if the improvement is minor prune back the complete tree

Pruning of decision trees Split your training data into training and validation sets and repeat until improves: Evaluate each tree where the lowest level branches are pruned (replace it with a leaf) Prune the one with greatest improvement (greedy iteration) Corollary: there will be heterogenous leaves

General questions of machine learning

Generalization ability overfitting bias-variance dilemma

Overfitting modell hH overfits if there is a h’H errortrain(h) < errortrain(h’) and errorX(h) > errorX(h’)

Occam’s razor Among competing hypotheses, the one with the fewest assumptions should be selected. If a long hypothesis fits to the data it can be due to randomness (imagine that a shorter also fits to the data)

The bias-variance dilemma Regression problem of the function F(x) g(x;D) is the prediction of the modell trained on D We use multiple D bias=error on the training data variance=the difference among modells learnt on various D

The bias-variance dilemma Overfitting=low bias but great variance

© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

„generalisation” metaparameter Each machine learning approach has one (or a few) metaparameters to finetune the bias/variance tradeoff kNN: k Parzen windows: window size Naive Bayes: m-estimate decision tree: pruning

The curse of dimensionality

Feature space Introducing new features (information to the system) helps: BUT it can easily can indicate overfitting!

The curse of dimensionality if we ha 100 binary features we’d need 2100 training samples… What does „similarity” mean at d=1000? In practice the performance can decrease by introducing new features  “Curse of dimensionality” parameter estimation becomes more complex easily overfits

Performance measure of supervised machine learners

Estimation of the performance of supervised learner Supervised learning: Based on training examples, learn a modell which works fine on previously unseen examples. Selection among models: Selection of the machine learning approach Feature space construction Metaparameter optimisation (e.g. pruning of decision trees)

Leave-out technique Split your dataset into D = {(v1,y1),…,(vn,yn)} training (Dt) and test (Dv=D\Dt) sets Train Dt Test D\Dt Simulation of the performance on unseen data.

train-dev-test train test train dev test

n-fold cross validation n-fold cross validation splits the training data D into n disjunct folds: D1,D2,…,Dk … Train your model on D\Di–n then evaluate on Di Repeat this n times and average the results achieved D1 D2 D3 Dk D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4

Summary Decision trees Concept learning Entropy ID3 -> C4.5 General questions of machine learning Generalization ability Curse of dimensionality Performance evaluation in supervised learning