Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS4445 Data Mining B term 2014. WPI Solutions HW4: Classification Rules using RIPPER By Chiying Wang 1.

Similar presentations


Presentation on theme: "CS4445 Data Mining B term 2014. WPI Solutions HW4: Classification Rules using RIPPER By Chiying Wang 1."— Presentation transcript:

1 CS4445 Data Mining B term 2014. WPI Solutions HW4: Classification Rules using RIPPER By Chiying Wang 1

2 Car Dataset InstanceBuyingMaintPersonsSafetyClass 1medvhighmorelowunacc 2medvhigh2medunacc 3vhigh moremedunacc 4medhigh4lowunacc 5highmed4highgood 6lowmed2 unacc 7lowhigh2 unacc 8lowvhighmoremedacc 9medvhigh4medacc 10medvhigh4medacc 11vhigh 4medunacc InstanceBuyingMaintPersonsSafetyClass 12med moremedacc 13medvhigh2medunacc 14med 4lowunacc 15medvhighmorelowunacc 16medlow4medacc 17highlow2highunacc 18highmed4lowunacc 19medlow4 unacc 20high 4lowunacc 21lowmed4highgood 22low 2highunacc The car dataset contains 22 instances, four predictive attributes and one class attribute. 2

3 Ripper 1 st Rule: Selecting a consequent ClassFrequency class = good2 class = acc5 class = unacc15 We will use Ripper algorithm to construct the first rule for the Car dataset. A rule is defined as the form of antecedent -> consequence. In Ripper, we construct rules for the least frequent class first. The following table shows the frequencies of class values in the dataset. ‘class = good’ has the lowest frequency and we should choose it as the consequence of the first rule. Thus, we start from the empty rule ‘ -> class = good’. 3

4 Ripper 1st Rule: 1 st Candidate Conditions Next we attempt to find the first condition for the antecedent of the rule ‘->class = good’. In the dataset, there are two instances covered by the rule: We only need to look at possible conditions in the above two instances where class = good. All possible conditions are listed in the following table: BuyingMaintPersonsSafetyClass highmed4highgood lowmed4highgood Rule : -> class = good Buying = highMaint = medPersons = 4Safety = highBuying = low 4

5 Ripper 1 st Rule: Comparing candidate conditions InstanceBuyingMaintPersonsSafetyClass 1medvhighmorelowunacc 2medvhigh2medunacc 3vhigh moremedunacc 4medhigh4lowunacc 5highmed4highgood 6lowmed2 unacc 7lowhigh2 unacc 8lowvhighmoremedacc 9medvhigh4medacc 10medvhigh4medacc 11vhigh 4medunacc InstanceBuyingMaintPersonsSafetyClass 12med moremedacc 13medvhigh2medunacc 14med 4lowunacc 15medvhighmorelowunacc 16medlow4medacc 17highlow2highunacc 18highmed4lowunacc 19medlow4 unacc 20high 4lowunacc 21lowmed4highgood 22low 2highunacc Next, we determine the information gain for each of possible conditions for the rule. We will add one of them as the first antecedent into the rule. The instances used in the construction are shown in the following table. 5

6 Ripper 1 st Rule: Calculating info gain To calculate FOIL’s information gain, we start by calculating p 0 and n 0 for the rule ‘->class = good’ before adding a new condition. p 0 is the number of instances with class = good and the corresponding instances are: n 0 is the number of instances such that class ≠ good and the corresponding instances are: InstanceBuyingMaintPersonsSafetyClass 5highmed4highgood 21lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 1medvhighmorelowunacc 2medvhigh2medunacc 3vhigh moremedunacc 4medhigh4lowunacc 6lowmed2 unacc 7lowhigh2 unacc 8lowvhighmoremedacc 9medvhigh4medacc 10medvhigh4medacc 11vhigh 4medunacc InstanceBuyingMaintPersonsSafetyClass 12med moremedacc 13medvhigh2medunacc 14med 4lowunacc 15medvhighmorelowunacc 16medlow4medacc 17highlow2highunacc 18highmed4lowunacc 19medlow4 unacc 20high 4lowunacc 22low 2highunacc 6

7 1 st Rule: Calculating info gain for candidate 1 Consider the first candidate condition: “buying = high”, we need to calculate the information gain of adding this condition to the empty rule obtaining: buying = high -> class = good Given: p 0 is the number of instances such that class = good (instances in slide 6) n 0 is the number of instances such that class ≠ good (instances in slide 6) p 1 is the number of instances such that buying = high and class = good (instances as follows) n 1 is the number of instances such that buying = high and class ≠ good (instances as follows) MeasuresValue p0p0 2 n0n0 20 p1p1 1 n1n1 3 InstanceBuyingMaintPersonsSafetyClass 5highmed4highgood InstanceBuyingMaintPersonsSafetyClass 17highlow2highunacc 18highmed4lowunacc 20high 4lowunacc 7

8 1 st Rule: Calculating info gain for candidate 2 For “maint = med”, maint = med -> class = good Given: p 0 is the number of instances such that class = good (instances in slide 6) n 0 is the number of instances such that class ≠ good (instances in slide 6) p 1 is the number of instances such that maint = med and class = good (instances as follows) n 1 is the number of instances such that maint = med and class ≠ good (instances as follows) MeasuresValue p0p0 2 n0n0 20 p1p1 2 n1n1 4 8 InstanceBuyingMaintPersonsSafetyClass 5highmed4highgood 21lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 6lowmed2 unacc 12med moremedacc 14med 4lowunacc 18highmed4lowunacc

9 1 st Rule: Calculating info gain for candidate 3 For “persons = 4”, persons = 4 -> class = good Given: p 0 is the number of instances such that class = good (instances in slide 6) n 0 is the number of instances such that class ≠ good (instances in slide 6) p 1 is the number of instances such that persons = 4 and class = good (instances as follows:) n 1 is the number of instances such that persons = 4 and class ≠ good (instances as follows:) MeasuresValue p0p0 2 n0n0 20 p1p1 2 n1n1 9 9 InstanceBuyingMaintPersonsSafetyClass 5highmed4highgood 21lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 4medhigh4lowunacc 9medvhigh4medacc 10medvhigh4medacc 11vhigh 4medunacc InstanceBuyingMaintPersonsSafetyClass 14med 4lowunacc 16medlow4medacc 18highmed4lowunacc 19medlow4 unacc 20high 4lowunacc

10 1 st Rule: Calculating info gain for candidate 4 For “safety = high”, safety = high -> class = good Given: p 0 is the number of instances such that class = good (instances in slide 6) n 0 is the number of instances such that class ≠ good (instances in slide 6) p 1 is the number of instances such that safety = high and class = good (instances as follows) n 1 is the number of instances such that safety = high and class ≠ good (instances as follows) MeasuresValue p0p0 2 n0n0 20 p1p1 2 n1n1 3 10 InstanceBuyingMaintPersonsSafetyClass 5highmed4highgood 21lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 7lowhigh2 unacc 17highlow2highunacc 22low 2highunacc

11 1 st Rule: Calculating info gain for candidate 5 For “buying = low”, buying = low -> class = good Given: p 0 is the number of instances such that class = good (instances in slide 6) n 0 is the number of instances such that class ≠ good (instances in slide 6) p 1 is the number of instances such that buying = low and class = good (instances as follows) n 1 is the number of instances such that buying = low and class ≠ good (instances as follows) MeasuresValue p0p0 2 n0n0 20 p1p1 1 n1n1 4 11 InstanceBuyingMaintPersonsSafetyClass 21lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 6lowmed2 unacc 7lowhigh2 unacc 8lowvhighmoremedacc 22low 2highunacc

12 1 st Rule: Choosing 1 st condition Now we have the information gain for each of the possible conditions for the antecedent of the rule, as shown in the following table. Possible ConditionsInformation Gain Buying = high= 1*(log 2 (1/(1+3)) - log 2 (2/(2+20))) = 1.459 Buying = low= 1*(log 2 (1/(1+4)) - log 2 (2/(2+20))) = 1.138 Maint = med= 1*(log 2 (2/(2+4)) - log 2 (2/(2+20))) = 3.749 Persons = 4= 1*(log 2 (2/(2+9)) - log 2 (2/(2+20))) = 2.000 Safety = high= 1*(log 2 (2/(2+3)) - log 2 (2/(2+20))) = 4.275 Then, we select condition “safety = high” with highest information gain as the first antecedent of the rule. Since its gain is > 0 we add this condition to the rule: Safety = high -> class = good 12

13 1 st Rule: Checking termination criteria Then, we need to determine if the construction should stop. Since the current rule covers negative examples (shown in the table below), the construction continues to refine the rule. BuyingMaintPersonsSafetyClass lowhigh2 unacc highlow2highunacc low 2highunacc 13

14 Ripper 1st Rule: 2nd Candidate Conditions Next we attempt to find the second condition in the antecedent. We only need to look at possible conditions in the two instances where safety = high and class = good, shown in the following table. BuyingMaintPersonsSafetyClass highmed4highgood lowmed4highgood Rule : safety = high and … -> class = good Buying = highMaint = medPersons = 4Buying = low The list of possible conditions are in the table below. 14

15 Ripper 1 st Rule: Comparing 2 nd candidate conditions InstanceBuyingMaintPersonsSafetyClass 1highmed4highgood 2lowhigh2 unacc 3highlow2highunacc 4lowmed4highgood 5low 2highunacc Next, we determine the information gain for each of possible conditions for the rule. We will add one of them as the second condition in the antecedent of the rule. The instances used in the construction are shown in the following table. p 1 is the number of instances such that safety = high and class = good and the corresponding instances from the above table are: n 1 is the number of instances such that safety = high and class ≠ good and the corresponding instances from the above table are: InstanceBuyingMaintPersonsSafetyClass 2lowhigh2 unacc 3highlow2highunacc 5low 2highunacc InstanceBuyingMaintPersonsSafetyClass 1highmed4highgood 4lowmed4highgood 15

16 1 st Rule: Calculating info gain for 2 nd candidate 1 For “buying = high”, safety= high and buying = high -> class = good Given: p 1 is the number of instances such that safety = high and class = good (instances in slide 15) n 1 is the number of instances such that safety = high and class ≠ good (instances in slide 15) p 2 is the number of instances such that safety = high and buying = high and class = good (as follows) n 2 is the number of instances such that safety = high and buying = high and class ≠ good (as follows) MeasuresValue p1p1 2 n1n1 3 p2p2 1 n2n2 1 InstanceBuyingMaintPersonsSafetyClass 1highmed4highgood InstanceBuyingMaintPersonsSafetyClass 3highlow2highunacc 16

17 1 st Rule: Calculating info gain for 2 nd candidate 2 For “maint = med”, safety= high and maint = med-> class = good Given: p 1 is the number of instances such that safety = high and class = good (instances in slide 15) n 1 is the number of instances such that safety = high and class ≠ good (instances in slide 15) p 2 is the number of instances such that safety = high and maint = med and class = good (as follows) n 2 is the number of instances such that safety = high and maint = med and class ≠ good (no instances) MeasuresValue p1p1 2 n1n1 3 p2p2 2 n2n2 0 InstanceBuyingMaintPersonsSafetyClass 1highmed4highgood 4lowmed4highgood 17

18 1 st Rule: Calculating info gain for 2 nd candidate 3 For “persons = 4”, safety= high and persons = 4 -> class = good Given: p 1 is the number of instances such that safety = high and class = good (instances in slide 15) n 1 is the number of instances such that safety = high and class ≠ good (instances in slide 15) p 2 is the number of instances such that safety = high and persons = 4 and class = good (as follows) n 2 is the number of instances such that safety = high and persons = 4 and class ≠ good (no instances) MeasuresValue p1p1 2 n1n1 3 p2p2 2 n2n2 0 InstanceBuyingMaintPersonsSafetyClass 1highmed4highgood 4lowmed4highgood 18

19 1 st Rule: Calculating info gain for 2 nd candidate 4 For “buying = low”, safety= high and buying = low -> class = good Given: p 1 is the number of instances such that safety = high and class = good (instances in slide 15) n 1 is the number of instances such that safety = high and class ≠ good (instances in slide 15) p 2 is the number of instances such that safety = high and buying = low and class = good (as follows) n 2 is the number of instances such that safety = high and buying = low and class ≠ good (as follows) MeasuresValue p1p1 2 n1n1 3 p2p2 1 n2n2 2 InstanceBuyingMaintPersonsSafetyClass 4lowmed4highgood InstanceBuyingMaintPersonsSafetyClass 2lowhigh2 unacc 5low 2highunacc 19

20 1 st Rule: Choosing 2nd condition Now we have the information gain for each of the possible conditions for the second antecedent of the rule, as shown in the following table. Possible ConditionsInformation Gain Buying = high= 1*(log 2 (1/(1+1)) - log 2 (2/(2+3))) = 0.6439 Buying = low= 1*(log 2 (1/(1+2)) - log 2 (2/(2+3))) = -0.5261 Maint = med= 1*(log 2 (2/(2+0)) - log 2 (2/(2+3))) = 2.6439 Persons = 4= 1*(log 2 (2/(2+0)) - log 2 (2/(2+3))) = 2.6439 There is a tie for highest info gain. We can pick any of the two conditions that yield the maximum. Let’s select the first of the two: “Maint = med” with highest information gain. Since its gain is > 0, we add this condition to the rule: Safety = high and Maint = med -> class = good 20

21 1 st Rule: Checking termination criteria Then, we need to determine if the construction should stop. The current rule doesn’t cover any negative examples, only the positive examples below: BuyingMaintPersonsSafetyClass highmed4highgood lowmed4highgood Therefore, there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete! 21

22 Ripper : Pruning the First Rule The First rule: Safety = high and Maint = med -> class = good To prune the above rule, the RIPPER algorithm will: Prepare a validation set which is apart from the training dataset. Apply the following metric to evaluate the pruned rule over the validation set: where p: the number of positive examples in the validation set covered by the rule. n: the number of negative examples in the validation set covered by the rule. Pruning method: First we considerpruning the last condition of the rule: “Maint = med”. If the v value of the rule Safety = high -> class = good is no lower than the v value of the rule Safety = high and Maint = med -> class = good then: (1) remove the last condition “Maint = med” from the rule, and (2) repeat this pruning method recursively with Safety = high -> class = good Otherwise, stop the pruning procedure (that is, do not consider removing any other conditions of the rule). 22


Download ppt "CS4445 Data Mining B term 2014. WPI Solutions HW4: Classification Rules using RIPPER By Chiying Wang 1."

Similar presentations


Ads by Google