Data Mining CSCI 307, Spring 2019 Lecture 15

Data Mining CSCI 307, Spring 2019 Lecture 15
Constructing Trees

Wishlist for a Purity Measure
Properties we require from a purity measure: When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) x measure([3,4]) This decision is made in two stages Make the first decision Then decide on the second case Entropy is the only function that satisfies all three properties

Example: attribute Outlook
Yes No . Sunny Overcast Rainy Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy :

Example: attribute Outlook
Outlook = Sunny : Outlook = Overcast : Outlook = Rainy : Expected information for the attribute: info([2,3]) = bits info([4,0]) = 0 bits info([3,2]) = bits info([2,3],[4,0],[3,2]) =

Computing Information Gain
information before splitting – information after splitting We've calculated the information BEFORE splitting and we've calculated the information AFTER the split for the Outlook attribute. So we can calculate the information gain for Outlook. gain(Outlook ) =

attribute: Temperature
Yes No Hot Mild Cool attribute: Temperature Temperature = Hot : Temperature = Mild : Temperature = Cool : info([2,2]) = entropy(2/4, 2/4) = −2/4 log2(2/4) − 2/4 log2(2/4)= 1 bit info([4,2]) = entropy(4/6, 2/6) = −2/3 log(2/3) − 1/3 log(1/3)= bits info([3,1]) = entropy(3/4, 1/4) = −3/4 log(3/4) − 1/4 log(1/4) = bits

attribute: Temperature
Temperature = Hot : Temperature = Mild : Temperature = Cool : Expected information for the attribute: info([2,2]) = 1 bit info([4,2]) = bits info([3,1]) = bits Average information value. (Use the number of instances that go down each branch.) info([2,2],[4,2],[3,1]) = gain(Temperature ) = info([9,5]) – info([2,2],[4,2],[3,1]) =

attribute: Humidity Humidity = High : Humidity = Normal :
Yes No High Normal 6 1 attribute: Humidity Humidity = High : Humidity = Normal : info([3,4]) = entropy(3/7, 4/7) = −3/7 log2(3/7) − 4/7 log2(4/7) = bits info([6,1]) = entropy(6/7, 1/7) = −6/7 log(6/7) − 1/7 log(1/7) = bits

attribute: Humidity = 0.788 bits Humidity = High : Humidity = Normal :
Expected information for the attribute: info([3,4]) = bits info([6,1]) = bits Average information value. (Use the number of instances that go down each branch.) info([3,4],[6,1]) = 7/14 x /14 x 0.592 = bits gain(Humidity ) = info([9,5]) – info([3,4],[6,1]) = – 0.788 = bits

attribute: Windy Windy = False : Windy = True :
Yes No False True attribute: Windy Windy = False : Windy = True : info([6,2]) = entropy(6/8, 2/8) = −6/8 log2(6/8) − 2/8 log2(2/8) = bits info([3,3]) = entropy(3/6, 3/6) = −3/6 log(3/6) − 3/6 log(3/6) = 1 bit

attribute: Windy = 0.892 bits Windy = False : Windy = True :
Expected information for the attribute: info([6,2]) = bits info([3,3]) = 1 bit Average information value. (Use the number of instances that go down each branch.) info([6,2],[3,3]) = 8/14 x /14 x 1 = bits gain(Windy ) = info([9,5]) – info([6,2],[3,3]) = – 0.892 = bits

Which Attribute to Select as Root?
For all the attributes from the weather data: gain(Outlook ) = bits gain(Temperature ) = bits gain(Humidity )= bits gain(Windy ) = bits Outlook is the way to go ... it's the root.

Continuing to Split yes
Now, determine the gain for EACH of Outlook's branches, sunny, overcast, and rainy. For the sunny branch we know at this point the entropy is 0.971; it is our "before" split information as we calculate our gain from here. yes The rainy branch entropy is also 0.971; use it is our "before" split information as we calculate our gain from here on down. Splitting stops when we can't split any further; that is the case with the value overcast. We don't need to consider Outlook further.

Continuing the Split at Sunny
Now, must determine the gain for EACH of Outlook's branches. For the sunny branch we know the "before" split entropy is 0.971

Find Subroot for Sunny humidity = high: info([0,3]) = entropy(0,1) =
humidity = normal: info([2,0]) = entropy(1,0) info([0,3],[2,0]) = gain(Humidity ) = info([2,3]) – info([0,3],[2,0])

Find Subroot for Sunny (continued)
windy = false: info([1,2]) = entropy(1/3,2/3) = −⅓ log(⅓) − ⅔log(⅔) = bits windy = true: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit info([0,3],[2,0]) = 3/5 x /5 x 1 = bits gain(Windy ) = info([2,3]) – info([1,2],[1,1]) = = 0.020

Find Subroot for Sunny (continued)
temperature = hot: info([0,2]) = entropy(0,1) = −0 log(0) − 1 log(1) = 0 bits temperature = mild: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit temperature = cool: info([1,0]) = entropy(1,0) = −1 log(1) − 0 log(0) = 0 bits info([0,2],[1,1],[1,0]) = 2/5 x /5 x /5 x 0 = = 0.4 bits gain(Temperature ) = info([2,3]) – info([0,2],[1,1],[1,0]) = – 0.4 = bits

Finish the Split at Sunny
gain(Humidity )= bits gain(Temperature ) = bits gain(Windy ) = bits

Possible Splits at Rainy
No need to actually do the calculations because windy is pure

Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can't be split any further.

Highly-branching Attributes
Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) Another problem: fragmentation

Tree Stump for ID code Attribute
This seems like a bad idea for a split. Entropy of split: info(ID code) = info([0,1]) + (info([0,1]) + (info([1,0]) (info([1,0]) + info([0,1]) = 0 bits Information gain is maximal for ID code (namely bits, i.e. the before split information)

Gain Ratio Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much information do we need to tell which branch an instance belongs to)

Computing the Gain Ratio
Example: intrinsic information for ID code info([1,1,...,1]) = 14 x (-1/14 x log(1/14)) = 3.807bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: gain_ratio(attribute) = gain(attribute) intrinsic_info(attribute) gain_ratio(ID code) = = 0.246 0.940 bits 3.807 bits

Data Mining CSCI 307, Spring 2019 Lecture 15

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 15"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining CSCI 307, Spring 2019 Lecture 15

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 15"— Presentation transcript:

Similar presentations

About project

Feedback