Data Mining CSCI 307, Spring 2019 Lecture 15

Slides:



Advertisements
Similar presentations
Web Usage Mining Classification Fang Yao MEMS Humboldt Uni zu Berlin.
Advertisements

Decision Tree Approach in Data Mining
Decision Tree Algorithm (C4.5)
Classification Techniques: Decision Tree Learning
Naïve Bayes: discussion
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
SEG Tutorial 1 – Classification Decision tree, Naïve Bayes & k-NN CHANG Lijun.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Inferring rudimentary rules
The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
SEEM Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN
Learning Chapter 18 and Parts of Chapter 20
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
SEEM Tutorial 1 Classification: Decision tree Siyuan Zhang,
Decision Trees MSE 2400 EaLiCaRA Dr. Tom Way.
Decision Trees by Muhammad Owais Zahid
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Algorithms for Classification:
Classification Algorithms
Teori Keputusan (Decision Theory)
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Artificial Intelligence
ID3 Vlad Dumitriu.
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Advanced Artificial Intelligence
ID3 Algorithm.
Machine Learning Techniques for Data Mining
Clustering.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Classification with Decision Trees
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Machine Learning Chapter 3. Decision Tree Learning
Dept. of Computer Science University of Liverpool
COMP61011 : Machine Learning Decision Trees
Decision Trees Decision tree representation ID3 learning algorithm
Data Mining(中国人民大学) Yang qiang(香港科技大学) Han jia wei(UIUC)
Junheng, Shengming, Yunsheng 10/19/2018
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Mining CSCI 307, Spring 2019 Lecture 15 Constructing Trees

Wishlist for a Purity Measure Properties we require from a purity measure: When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) x measure([3,4]) This decision is made in two stages Make the first decision Then decide on the second case Entropy is the only function that satisfies all three properties

Example: attribute Outlook Yes No . Sunny 2 3 Overcast 4 0 Rainy 3 2 Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy :

Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy : Expected information for the attribute: info([2,3]) = 0.971 bits info([4,0]) = 0 bits info([3,2]) = 0.971 bits info([2,3],[4,0],[3,2]) =

Computing Information Gain information before splitting – information after splitting We've calculated the information BEFORE splitting and we've calculated the information AFTER the split for the Outlook attribute. So we can calculate the information gain for Outlook. gain(Outlook ) =

attribute: Temperature Yes No Hot 2 2 Mild 4 2 Cool 3 1 attribute: Temperature Temperature = Hot : Temperature = Mild : Temperature = Cool : info([2,2]) = entropy(2/4, 2/4) = −2/4 log2(2/4) − 2/4 log2(2/4)= 1 bit info([4,2]) = entropy(4/6, 2/6) = −2/3 log(2/3) − 1/3 log(1/3)= 0.918 bits info([3,1]) = entropy(3/4, 1/4) = −3/4 log(3/4) − 1/4 log(1/4) = 0.811 bits

attribute: Temperature Temperature = Hot : Temperature = Mild : Temperature = Cool : Expected information for the attribute: info([2,2]) = 1 bit info([4,2]) = 0.918 bits info([3,1]) = 0.811 bits Average information value. (Use the number of instances that go down each branch.) info([2,2],[4,2],[3,1]) = gain(Temperature ) = info([9,5]) – info([2,2],[4,2],[3,1]) =

attribute: Humidity Humidity = High : Humidity = Normal : Yes No High 3 4 Normal 6 1 attribute: Humidity Humidity = High : Humidity = Normal : info([3,4]) = entropy(3/7, 4/7) = −3/7 log2(3/7) − 4/7 log2(4/7) = 0.985 bits info([6,1]) = entropy(6/7, 1/7) = −6/7 log(6/7) − 1/7 log(1/7) = 0.592 bits

attribute: Humidity = 0.788 bits Humidity = High : Humidity = Normal : Expected information for the attribute: info([3,4]) = 0.985 bits info([6,1]) = 0.592 bits Average information value. (Use the number of instances that go down each branch.) info([3,4],[6,1]) = 7/14 x 0.985 + 7/14 x 0.592 = 0.788 bits gain(Humidity ) = info([9,5]) – info([3,4],[6,1]) = 0.940 – 0.788 = 0.152 bits

attribute: Windy Windy = False : Windy = True : Yes No False 6 2 True 3 3 attribute: Windy Windy = False : Windy = True : info([6,2]) = entropy(6/8, 2/8) = −6/8 log2(6/8) − 2/8 log2(2/8) = 0.811 bits info([3,3]) = entropy(3/6, 3/6) = −3/6 log(3/6) − 3/6 log(3/6) = 1 bit

attribute: Windy = 0.892 bits Windy = False : Windy = True : Expected information for the attribute: info([6,2]) = 0.8112777 bits info([3,3]) = 1 bit Average information value. (Use the number of instances that go down each branch.) info([6,2],[3,3]) = 8/14 x 0.811 + 6/14 x 1 = 0.892 bits gain(Windy ) = info([9,5]) – info([6,2],[3,3]) = 0.940 – 0.892 = 0.048 bits

Which Attribute to Select as Root? For all the attributes from the weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity )= 0.152 bits gain(Windy ) = 0.048 bits Outlook is the way to go ... it's the root.

Continuing to Split yes Now, determine the gain for EACH of Outlook's branches, sunny, overcast, and rainy. For the sunny branch we know at this point the entropy is 0.971; it is our "before" split information as we calculate our gain from here. yes The rainy branch entropy is also 0.971; use it is our "before" split information as we calculate our gain from here on down. Splitting stops when we can't split any further; that is the case with the value overcast. We don't need to consider Outlook further.

Continuing the Split at Sunny Now, must determine the gain for EACH of Outlook's branches. For the sunny branch we know the "before" split entropy is 0.971

Find Subroot for Sunny humidity = high: info([0,3]) = entropy(0,1) = humidity = normal: info([2,0]) = entropy(1,0) info([0,3],[2,0]) = gain(Humidity ) = info([2,3]) – info([0,3],[2,0])

Find Subroot for Sunny (continued) windy = false: info([1,2]) = entropy(1/3,2/3) = −⅓ log(⅓) − ⅔log(⅔) = 0.183 bits windy = true: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit info([0,3],[2,0]) = 3/5 x 0.183 + 2/5 x 1 = 0.951 bits gain(Windy ) = info([2,3]) – info([1,2],[1,1]) = .971 - .951= 0.020

Find Subroot for Sunny (continued) temperature = hot: info([0,2]) = entropy(0,1) = −0 log(0) − 1 log(1) = 0 bits temperature = mild: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit temperature = cool: info([1,0]) = entropy(1,0) = −1 log(1) − 0 log(0) = 0 bits info([0,2],[1,1],[1,0]) = 2/5 x 0 + 2/5 x 1 + 1/5 x 0 = 0 + 0.4 + 0 = 0.4 bits gain(Temperature ) = info([2,3]) – info([0,2],[1,1],[1,0]) = 0.971 – 0.4 = 0.571 bits

Finish the Split at Sunny gain(Humidity )= 0.971 bits gain(Temperature ) = 0.571 bits gain(Windy ) = 0.020 bits

Possible Splits at Rainy No need to actually do the calculations because windy is pure

Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can't be split any further.

Highly-branching Attributes Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) Another problem: fragmentation

Tree Stump for ID code Attribute This seems like a bad idea for a split. Entropy of split: info(ID code) = info([0,1]) + (info([0,1]) + (info([1,0]) + ... + (info([1,0]) + info([0,1]) = 0 bits Information gain is maximal for ID code (namely 0.940 bits, i.e. the before split information)

Gain Ratio Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much information do we need to tell which branch an instance belongs to)

Computing the Gain Ratio Example: intrinsic information for ID code info([1,1,...,1]) = 14 x (-1/14 x log(1/14)) = 3.807bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: gain_ratio(attribute) = gain(attribute) intrinsic_info(attribute) gain_ratio(ID code) = = 0.246 0.940 bits 3.807 bits