Induction of Decision Trees

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Trees Decision tree representation ID3 learning algorithm
Classification Algorithms
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Decision Tree Algorithm (C4.5)
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Rule induction: Ross Quinlan's ID3 algorithm Fredda Weinberg CIS 718X Fall 2005 Professor Kopec Assignment #3.
Three kinds of learning
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Machine Learning Reading: Chapter Text Classification  Is text i a finance new article? PositiveNegative.
Classification.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Data Mining: Classification
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CpSc 810: Machine Learning Decision Tree Learning.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Midterm 3 Revision and ID3 Prof. Sin-Min Lee. Armstrong’s Axioms We can find F+ by applying Armstrong’s Axioms: –if   , then    (reflexivity) –if.
Decision Tree Learning
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
10. Decision Trees and Markov Chains for Gene Finding.
Machine Learning: Ensemble Methods
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Introduce to machine learning
Classification Algorithms
Decision Tree Learning
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Classification by Decision Tree Induction
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Decision trees One possible representation for hypotheses
Decision Trees Jeff Storey.
Presentation transcript:

Induction of Decision Trees J. Ross Quinlan Presentation by: Ashwin Kumar Kayyoor

Outline What is a decision tree? Building a decision tree Which attribute should we split on? Information theory and the splitting criterion ID3 example ID3 enhancements: windowing Dealing with noisy data - expected error pruning

Decision Trees Each branch node represents a choice between a number of alternatives. Each leaf node represents a classification or decision. For example: we might have a decision tree to help a financial institution decide whether a person should be offered a loan:

Example Instance Data

We shall now describe an algorithm for inducing a decision tree from such a collection of classified instances. Originally termed CLS (Concept Learning System) it has been successively enhanced. At the highest level of enhancement that we shall describe, the system is known as ID3 (Iterative Dichotomizer)

Tree Induction Algorithm The algorithm operates over a set of training instances, C. If all instances in C are in class P, create a node P and stop, otherwise select a feature or attribute F and create a decision node. Partition the training instances in C into subsets according to the values of V. Apply the algorithm recursively to each of the subsets C.

Output of Tree Induction Algorithm if (Outlook == Overcast) return Yes; Else if (Outlook == Sunny) { if (Humidity == High) return No; if (Humidity == Normal) } Else if (Outlook == Rain) if (Wind == Strong) if (Wind == Weak)

Choosing Attributes and ID3 The order in which attributes are chosen determines how complicated the tree is. ID3 uses information theory to determine the most informative attribute. A measure of the information content of a message is the inverse of the probability of receiving the message: information1(M) = 1/probability(M) Taking logs (base 2) makes information correspond to the number of bits required to encode a message: information(M) = -log2(probability(M))

Information The information content of a message should be related to the degree of surprise in receiving the message. Messages with a high probability of arrival are not as informative as messages with low probability. Learning aims to predict accurately, i.e. reduce surprise.

Entropy Different messages have different probabilities of arrival. Overall level of uncertainty (termed entropy) is: -Σi Pi log2Pi Frequency can be used as a probability estimate. E.g. if there are 5 positive examples and 3 negative examples in a node the estimated probability of positive is 5/8 = 0.625.

Information and Learning We can think of learning as building many-to-one mappings between input and output. Learning tries to reduce the information content of the inputs by mapping them to fewer outputs. Hence we try to minimise entropy. The simplest mapping is to map everything to one output. We seek a trade-off between accuracy and simplicity.

Splitting Criterion Work out entropy based on distribution of classes. Trying splitting on each attribute. Work out expected information gain for each attribute. Choose best attribute.

Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv)) Information Gain Gain(S, A) is information gain of example set S on attribute A is defined as Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv)) Where: S is each value v of all possible values of attribute A Sv = subset of S for which attribute A has value v |Sv| = number of elements in Sv |S| = number of elements in S

Example Instance Data

Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940 Example If S is a collection of 14 examples with 9 YES and 5 NO examples then Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940 Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally random").

Example Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811  For attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind = Strong Gain(S,Wind) = Entropy(S)-(8/14)*Entropy(Sweak)-(6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00 = 0.048 Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811 Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00 For each attribute, the gain is calculated and the highest gain is used in the decision node.

Example We need to find which attribute will be the root node in our decision tree. The gain is calculated for all four attributes: Gain(S, Outlook) = 0.246 Gain(S, Temperature) = 0.029 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 (calculated in example 2) Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root node.

Output of Tree Induction Algorithm

Example what attribute should be tested at the Sunny branch node? Ssunny = {D1, D2, D8, D9, D11} = 5 examples from table 1 with outlook = sunny Gain(Ssunny, Humidity) = 0.970 Gain(Ssunny, Temperature) = 0.570 Gain(Ssunny, Wind) = 0.019 Humidity has the highest gain; therefore, it is used as the decision node. Repeat this process until all attributes are exausted.

Output of Tree Induction Algorithm

Windowing D3 can deal with very large data sets by performing induction on subsets or windows onto the data. Select a random subset of the whole set of training instances. Use the induction algorithm to form a rule to explain the current window. Scan through all of the training instances looking for exceptions to the rule. Add the exceptions to the window Repeat steps 2 to 4 until there are no exceptions left.

Noisy Data Frequently, training data contains "noise" - i.e. examples which are misclassified, or where one or more of the attribute values is wrong. If there are any unused attributes, we might be able to use them to elaborate the tree to take care of this one case, but the subtree we would be building would in fact be wrong, and would likely misclassify real data. Thus, particularly if we know there is noise in the training data, it may be wise to "prune" the decision tree to remove nodes which, statistically speaking, seem likely to arise from noise in the training data. A question to consider: How fiercely should we prune?

Expected Error Pruning Approximate expected error assuming that we prune at a particular node. Approximate backed-up error from children assuming we did not prune. If expected error is less than backed-up error, prune.

(Static) Expected Error If we prune a node, it becomes a leaf labelled, C. What will be the expected classification error at this leaf? E(S) = (N – n + k – 1) / (N + k) S is the set of examples in a node K is the number of classes N examples in S C is the majority class in S n out of N examples in S belong to C

Backed-up Error For a non-leaf node Let children of Node be Node1, Node2, etc BackedUpError(Node) = Σi Pi×Error(Nodei) Probabilities can be estimated by relative frequencies of attribute values in sets of examples that fall into child nodes. Error(Node) = min(E(Node), BackedUpError(Node))

Pruning Example

Summary The ID3 family of decision tree induction algorithms use information theory to decide which attribute shared by a collection of instances to split the data on next. Attributes are chosen repeatedly in this way until a complete decision tree that classifies every input is obtained. If the data is noisy, some of the original instances may be misclassified. It may be possible to prune the decision tree in order to reduce classification errors in the presence of noisy data. The speed of this learning algorithm is reasonably high, as is the speed of the resulting decision tree classification system. Generalisation ability can be reasonable too.

References Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann . (http://www.rulequest.com/Personal/) Bil Wilson http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html