Data Mining CSCI 307, Spring 2019 Lecture 15 Constructing Trees
Wishlist for a Purity Measure Properties we require from a purity measure: When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) x measure([3,4]) This decision is made in two stages Make the first decision Then decide on the second case Entropy is the only function that satisfies all three properties
Example: attribute Outlook Yes No . Sunny 2 3 Overcast 4 0 Rainy 3 2 Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy :
Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy : Expected information for the attribute: info([2,3]) = 0.971 bits info([4,0]) = 0 bits info([3,2]) = 0.971 bits info([2,3],[4,0],[3,2]) =
Computing Information Gain information before splitting – information after splitting We've calculated the information BEFORE splitting and we've calculated the information AFTER the split for the Outlook attribute. So we can calculate the information gain for Outlook. gain(Outlook ) =
attribute: Temperature Yes No Hot 2 2 Mild 4 2 Cool 3 1 attribute: Temperature Temperature = Hot : Temperature = Mild : Temperature = Cool : info([2,2]) = entropy(2/4, 2/4) = −2/4 log2(2/4) − 2/4 log2(2/4)= 1 bit info([4,2]) = entropy(4/6, 2/6) = −2/3 log(2/3) − 1/3 log(1/3)= 0.918 bits info([3,1]) = entropy(3/4, 1/4) = −3/4 log(3/4) − 1/4 log(1/4) = 0.811 bits
attribute: Temperature Temperature = Hot : Temperature = Mild : Temperature = Cool : Expected information for the attribute: info([2,2]) = 1 bit info([4,2]) = 0.918 bits info([3,1]) = 0.811 bits Average information value. (Use the number of instances that go down each branch.) info([2,2],[4,2],[3,1]) = gain(Temperature ) = info([9,5]) – info([2,2],[4,2],[3,1]) =
attribute: Humidity Humidity = High : Humidity = Normal : Yes No High 3 4 Normal 6 1 attribute: Humidity Humidity = High : Humidity = Normal : info([3,4]) = entropy(3/7, 4/7) = −3/7 log2(3/7) − 4/7 log2(4/7) = 0.985 bits info([6,1]) = entropy(6/7, 1/7) = −6/7 log(6/7) − 1/7 log(1/7) = 0.592 bits
attribute: Humidity = 0.788 bits Humidity = High : Humidity = Normal : Expected information for the attribute: info([3,4]) = 0.985 bits info([6,1]) = 0.592 bits Average information value. (Use the number of instances that go down each branch.) info([3,4],[6,1]) = 7/14 x 0.985 + 7/14 x 0.592 = 0.788 bits gain(Humidity ) = info([9,5]) – info([3,4],[6,1]) = 0.940 – 0.788 = 0.152 bits
attribute: Windy Windy = False : Windy = True : Yes No False 6 2 True 3 3 attribute: Windy Windy = False : Windy = True : info([6,2]) = entropy(6/8, 2/8) = −6/8 log2(6/8) − 2/8 log2(2/8) = 0.811 bits info([3,3]) = entropy(3/6, 3/6) = −3/6 log(3/6) − 3/6 log(3/6) = 1 bit
attribute: Windy = 0.892 bits Windy = False : Windy = True : Expected information for the attribute: info([6,2]) = 0.8112777 bits info([3,3]) = 1 bit Average information value. (Use the number of instances that go down each branch.) info([6,2],[3,3]) = 8/14 x 0.811 + 6/14 x 1 = 0.892 bits gain(Windy ) = info([9,5]) – info([6,2],[3,3]) = 0.940 – 0.892 = 0.048 bits
Which Attribute to Select as Root? For all the attributes from the weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity )= 0.152 bits gain(Windy ) = 0.048 bits Outlook is the way to go ... it's the root.
Continuing to Split yes Now, determine the gain for EACH of Outlook's branches, sunny, overcast, and rainy. For the sunny branch we know at this point the entropy is 0.971; it is our "before" split information as we calculate our gain from here. yes The rainy branch entropy is also 0.971; use it is our "before" split information as we calculate our gain from here on down. Splitting stops when we can't split any further; that is the case with the value overcast. We don't need to consider Outlook further.
Continuing the Split at Sunny Now, must determine the gain for EACH of Outlook's branches. For the sunny branch we know the "before" split entropy is 0.971
Find Subroot for Sunny humidity = high: info([0,3]) = entropy(0,1) = humidity = normal: info([2,0]) = entropy(1,0) info([0,3],[2,0]) = gain(Humidity ) = info([2,3]) – info([0,3],[2,0])
Find Subroot for Sunny (continued) windy = false: info([1,2]) = entropy(1/3,2/3) = −⅓ log(⅓) − ⅔log(⅔) = 0.183 bits windy = true: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit info([0,3],[2,0]) = 3/5 x 0.183 + 2/5 x 1 = 0.951 bits gain(Windy ) = info([2,3]) – info([1,2],[1,1]) = .971 - .951= 0.020
Find Subroot for Sunny (continued) temperature = hot: info([0,2]) = entropy(0,1) = −0 log(0) − 1 log(1) = 0 bits temperature = mild: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit temperature = cool: info([1,0]) = entropy(1,0) = −1 log(1) − 0 log(0) = 0 bits info([0,2],[1,1],[1,0]) = 2/5 x 0 + 2/5 x 1 + 1/5 x 0 = 0 + 0.4 + 0 = 0.4 bits gain(Temperature ) = info([2,3]) – info([0,2],[1,1],[1,0]) = 0.971 – 0.4 = 0.571 bits
Finish the Split at Sunny gain(Humidity )= 0.971 bits gain(Temperature ) = 0.571 bits gain(Windy ) = 0.020 bits
Possible Splits at Rainy No need to actually do the calculations because windy is pure
Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can't be split any further.
Highly-branching Attributes Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) Another problem: fragmentation
Tree Stump for ID code Attribute This seems like a bad idea for a split. Entropy of split: info(ID code) = info([0,1]) + (info([0,1]) + (info([1,0]) + ... + (info([1,0]) + info([0,1]) = 0 bits Information gain is maximal for ID code (namely 0.940 bits, i.e. the before split information)
Gain Ratio Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much information do we need to tell which branch an instance belongs to)
Computing the Gain Ratio Example: intrinsic information for ID code info([1,1,...,1]) = 14 x (-1/14 x log(1/14)) = 3.807bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: gain_ratio(attribute) = gain(attribute) intrinsic_info(attribute) gain_ratio(ID code) = = 0.246 0.940 bits 3.807 bits