Decision Trees.

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Naïve Bayes: discussion
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Biological Data Mining (Predicting Post-synaptic Activity in Proteins)
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Decision Trees Decision tree representation Top Down Construction
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
K Nearest Neighbor Classification Methods Qiang Yang.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Learning Chapter 18 and Parts of Chapter 20
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning Chapter 3. Decision Tree Learning
and Confidential NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT. Slide 1 Decision.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Algorithms: Decision Trees 2 Outline  Introduction: Data Mining and Classification  Classification  Decision trees  Splitting attribute  Information.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
K Nearest Neighbor Classification Methods. Training Set.
1 Illustration of the Classification Task: Learning Algorithm Model.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Illustrating Classification Task
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Teori Keputusan (Decision Theory)
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Chapter 3. Decision Tree Learning
Classification with Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
Data Mining(中国人民大学) Yang qiang(香港科技大学) Han jia wei(UIUC)
Junheng, Shengming, Yunsheng 10/19/2018
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

Decision Trees

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree

Another Example of Decision Tree categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data!

Apply Model to Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Apply Model to Test Data Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES

Digression: Entropy

Bits We are watching a set of independent random samples of X We see that X has four possible values So we might see: BAACBADCDADDDA… We transmit data over a binary serial link. We can encode each reading with two bits (e.g. A=00, B=01, C=10, D = 11) 0100001001001110110011111100…

Fewer Bits Someone tells us that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. Here is one.

General Case Suppose X can have one of m values… What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s Well, Shannon got to this formula by setting down several desirable properties for uncertainty, and then finding it.

Back to Decision Trees

Constructing decision trees (ID3) Normal procedure: top down in a recursive divide-and-conquer fashion First: an attribute is selected for root node and a branch is created for each possible attribute value Then: the instances are split into subsets (one for each branch extending from the node) Finally: the same procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class

Weather data Outlook Temp Humidity Windy Play Sunny Hot High False No True Overcast Yes Rainy Mild Cool Normal

Which attribute to select? (d)

A criterion for attribute selection Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: entropy of nodes Lower the entropy purer the node. Strategy: choose attribute that results in lowest entropy of the children nodes.

0*log(0) is normally not defined. Attribute “Outlook” outlook=sunny info([2,3]) = entropy(2/5,3/5) = -2/5*log(2/5) -3/5*log(3/5) = .971 outlook=overcast info([4,0]) = entropy(4/4,0/4) = -1*log(1) -0*log(0) = 0 outlook=rainy info([3,2]) = entropy(3/5,2/5) = -3/5*log(3/5)-2/5*log(2/5) = .971 Expected info: .971*(5/14) + 0*(4/14) + .971*(5/14) = .693 0*log(0) is normally not defined.

Attribute “Temperature” temperature=hot info([2,2]) = entropy(2/4,2/4) = -2/4*log(2/4) -2/4*log(2/4) = 1 temperature=mild info([4,2]) = entropy(4/6,2/6) = -4/6*log(1) -2/6*log(2/6) = .528 temperature=cool info([3,1]) = entropy(3/4,1/4) = -3/4*log(3/4)-1/4*log(1/4) = .811 Expected info: 1*(4/14) + .528*(6/14) + .811*(4/14) = .744

Attribute “Humidity” humidity=high humidity=normal Expected info: info([3,4]) = entropy(3/7,4/7) = -3/7*log(3/7) -4/7*log(4/7) = .985 humidity=normal info([6,1]) = entropy(6/7,1/7) = -6/7*log(6/7) -1/7*log(1/7) = .592 Expected info: .985*(7/14) + .592*(7/14) = .788

Attribute “Windy” windy=false humidity=true Expected info: info([6,2]) = entropy(6/8,2/8) = -6/8*log(6/8) -2/8*log(2/8) = .811 humidity=true info([3,3]) = entropy(3/6,3/6) = -3/6*log(3/6) -3/6*log(3/6) = 1 Expected info: .811*(8/14) + 1*(6/14) = .892

And the winner is... "Outlook" ...So, the root will be "Outlook"

Continuing to split (for Outlook="Sunny") Temp Humidity Windy Play Sunny Hot High False No True Mild Cool Normal Yes Which one to choose?

Continuing to split (for Outlook="Sunny") temperature=hot: info([2,0]) = entropy(2/2,0/2) = 0 temperature=mild: info([1,1]) = entropy(1/2,1/2) = 1 temperature=cool: info([1,0]) = entropy(1/1,0/1) = 0 Expected info: 0*(2/5) + 1*(2/5) + 0*(1/5) = .4 humidity=high: info([3,0]) = 0 humidity=normal: info([2,0]) = 0 Expected info: 0 windy=false: info([1,2]) = entropy(1/3,2/3) = -1/3*log(1/3) -2/3*log(2/3) = .918 humidity=true: info([1,1]) = entropy(1/2,1/2) = 1 Expected info: .918*(3/5) + 1*(2/5) = .951 Winner is "humidity"

Tree so far

Continuing to split (for Outlook="Overcast") Temp Humidity Windy Play Overcast Hot High False Yes Cool Normal True Mild Nothing to split here, "play" is always "yes". Tree so far

Continuing to split (for Outlook="Rainy") Temp Humidity Windy Play Rainy Mild High False Yes Cool Normal True No We can easily see that "Windy" is the one to choose. (Why?)

The final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes Þ Splitting stops when data can’t be split any further

Information gain Sometimes people don’t use directly the entropy of a node. Rather the information gain is being used. Clearly, greater the information gain better the purity of a node. So, we choose “Outlook” for the root.

Highly-branching attributes The weather data with ID code

Tree stump for ID code attribute

Highly-branching attributes So, Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction)

The gain ratio Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy (with respect to the attribute on focus) of node to be split.

Computing the gain ratio

Gain ratios for weather data

More on the gain ratio “Outlook” still comes out top but “Humidity” is now a much closer contender because it splits the data into two subsets instead of three. However: “ID code” has still greater gain ratio. But its advantage is greatly reduced. Problem with gain ratio: it may overcompensate May choose an attribute just because its intrinsic information is very low Standard fix: choose an attribute that maximizes the gain ratio, provided the information gain for that attribute is at least as great as the average information gain for all the attributes examined.

Discussion Algorithm for top-down induction of decision trees (“ID3”) was developed by Ross Quinlan (University of Sydney Australia) Gain ratio is just one modification of this basic algorithm Led to development of C4.5, which can deal with numeric attributes, missing values, and noisy data There are many other attribute selection criteria! (But almost no difference in accuracy of result.)