Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.

Similar presentations


Presentation on theme: "Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4."— Presentation transcript:

1 Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4

2 Rules Can be directly read off a decision tree – but those might not be the most compact or effective rules Common approach – take each class in turn and find a way of “covering” all instances in it, while excluding instances not in the class

3 Let’s use My Weather Data Again Again, Let’s take this a little more realistic than book does Divide into training and test data Let’s save the last record as a test (using my weather, nominal … and assuming we’re working on the play?=yes class first … We’re looking for a rule in the form if ___ Then play? = yes Possible ways of filling include: –Outlook = sunny –Outlook = overcast –… –Temperature = hot –…

4 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Outlook = sunny54.80 Outlook = Overcast42.50 Outlook = Rainy40.00 Temp = Hot41.25 Temp = Mild53.60 Temp = Cool42.50 Humid = High63.50 Humid = Normal73.43 Windy = TRUE54.80 Windy = False82.25

5 Refining Rule If this rule is not accurate enough for us (based on a threshold), we’re going to try to refine it by adding a clause(s) Now, we’re looking to fill in a clause in the following: if outlook = sunny and _____ then play? = yes We consider the accuracy of all possible ways of filling this blank …

6 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Outlook = Sunny & Temp = Hot21.50 Outlook = Sunny & Temp = Mild221.00 Outlook = Sunny & Temp = Cool111.00 Outlook = Sunny & Humid = High32.67 Outlook = Sunny & Humid = Normal221.00 Outlook = Sunny & Windy = TRUE221.00 Outlook = Sunny & Windy = False32.67

7 Still more to cover though This rule only covers 2 of the 6 play=yes days –This approach looks more for pockets of a success whereas ID3 is looking more at the big picture So we temporarily toss those 2 instances and work on another rule

8 Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno sunnyhothighTRUEyes overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnycoolnormalFALSEyes rainymildnormalFALSEno overcastmildhighTRUEyes overcasthotnormalFALSEno rainymildhighTRUEno TEST

9 We’re Looking for another rule … in the form if ___ Then play? = yes Again, possible ways of filling include: –Outlook = sunny –Outlook = overcast –… –Temperature = hot –… However, our data is a little different now

10 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Outlook = sunny32.67 Outlook = Overcast42.50 Outlook = Rainy40.00 Temp = Hot41.25 Temp = Mild31.33 Temp = Cool42.50 Humid = High52.40 Humid = Normal62.33 Windy = TRUE43.75 Windy = False71.14

11 Refining Rule If this rule is not accurate enough for us (based on a threshold), we’re going to try to refine it by adding a clause(s) Now, we’re looking to fill in a clause in the following: if windy = TRUE and _____ then play? = yes We consider the accuracy of all possible ways of filling this blank …

12 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Windy = TRUE & Outlook = sunny111.00 Windy = TRUE & Outlook = Overcast221.00 Windy = TRUE & Outlook = Rainy10.00 Windy = TRUE & Temp = Hot111.00 Windy = TRUE & Temp = Mild111.00 Windy = TRUE & Temp = Cool21.50 Windy = TRUE & Humid = High221.00 Windy = TRUE & Humid = Normal21.50

13 Still more to cover though The rules so far cover 4 of the 6 play=yes days So we temporarily toss the 2 instances covered by the second rule and work on another rule

14 Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnycoolnormalFALSEyes rainymildnormalFALSEno overcasthotnormalFALSEno rainymildhighTRUEno TEST

15 We’re Looking for another rule … in the form if ___ Then play? = yes Again, we’ll try all possible ways of filling … on our reduced data

16 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Outlook = sunny21.50 Outlook = Overcast31.33 Outlook = Rainy40.00 Temp = Hot30.00 Temp = Mild20.00 Temp = Cool42.50 Humid = High30.00 Humid = Normal62.33 Windy = TRUE21.50 Windy = False71.14

17 Refining Rule If this rule is not accurate enough for us (based on a threshold – and at 50% it almost assuredly isn’t), we’re going to try to refine it by adding a clause(s) Now, we’re looking to fill in a clause in the following: if temp = cool and _____ then play? = yes We consider the accuracy of all possible ways of filling this blank …

18 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Temp = Cool & Outlook = sunny111.00 Temp = Cool & Outlook = Overcast111.00 Temp = Cool & Outlook = Rainy20.00 Temp = Cool & Humid = High00--- Temp = Cool & Humid = Normal42.50 Temp = Cool & Windy = True21.50 Temp = Cool & Windy = False21.50

19 So Far, We Have 3 Rules … if Outlook = Sunny & Temp = Mild Then Play? = yes If Windy = TRUE & Humid = High Then Play? = yes If Temp = Cool & Outlook = Sunny Then Play? = yes Still more to cover though The rules so far cover 5 of the 6 play=yes days So we temporarily toss the 1 instance covered by the third rule and work on another rule

20 Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes rainymildnormalFALSEno overcasthotnormalFALSEno rainymildhighTRUEno TEST

21 Again we’re looking for another rule … in the form if ___ Then play? = yes Again, we’ll try all possible ways of filling … on our reduced data

22 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Outlook = sunny10.00 Outlook = Overcast31.33 Outlook = Rainy40.00 Temp = Hot30.00 Temp = Mild20.00 Temp = Cool31.33 Humid = High30.00 Humid = Normal51.20 Windy = TRUE21.50 Windy = False60.00

23 Refining Rule If this rule is not accurate enough for us (based on a threshold – and at 50% it almost assuredly isn’t), we’re going to try to refine it by adding a clause(s) Now, we’re looking to fill in a clause in the following: if Windy = True and _____ then play? = yes We consider the accuracy of all possible ways of filling this blank …

24 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS Ratio Windy = True & Outlook = sunny00--- Windy = True & Outlook = Overcast111.00 Windy = True & Outlook = Rainy10.00 Windy = True & Temp = Hot00--- Windy = True & Temp = Mild00--- Windy = True & Temp = Cool21.50 Windy = True & Humid = High00--- Windy = True & Humid = Normal21.50

25 We’ve Covered all Yes Instances We Have 4 Rules … if Outlook = Sunny & Temp = Mild Then Play? = yes If Windy = TRUE & Humid = High Then Play? = yes If Temp = Cool & Outlook = Sunny Then Play? = yes If Windy = TRUE & Outlook = Overcast Then Play? = yes It’s time to work on the next class –(remember to bring back all of the instances) –(since it is the last class, we might create a default rule – anything else is play?=no)

26 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side (play? = no) LHSMatches LHSOf those, Match RHS Ratio Outlook = sunny51.20 Outlook = Overcast42.50 Outlook = Rainy441.00 Temp = Hot43.75 Temp = Mild52.40 Temp = Cool42.50 Humid = High63.50 Humid = Normal74.57 Windy = TRUE51.20 Windy = False86.75

27 Still more to cover though This rule only covers 4 of the 7 play=no days So we temporarily toss those 4 instances and work on another rule

28 Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno sunnyhothighTRUEyes overcasthothighFALSEno overcastcoolnormalTRUEyes sunnymildhighFALSEyes sunnycoolnormalFALSEyes sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEno rainymildhighTRUEno TEST

29 We’re Looking for another rule … in the form if ___ Then play? = no

30 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS (no) Ratio Outlook = sunny51.20 Outlook = Overcast42.50 Outlook = Rainy00--- Temp = Hot43.75 Temp = Mild30.00 Temp = Cool20.00 Humid = High52.40 Humid = Normal41.25 Windy = TRUE40.00 Windy = False53.60

31 Refining Rule If this rule is not accurate enough for us (based on a threshold), we’re going to try to refine it by adding a clause(s) Now, we’re looking to fill in a clause in the following: if Temp = Hot and _____ then play? = no We consider the accuracy of all possible ways of filling this blank …

32 Find the best filler using training data We look at proportion of instances that match the left hand side that also match the right hand side LHSMatches LHSOf those, Match RHS (no) Ratio Temp = Hot & Outlook = sunny21.50 Temp = Hot & Outlook = Overcast221.00 Temp = Hot & Outlook = Rainy00--- Temp = Hot & Humid = High32.67 Temp = Hot & Humid = Normal111.00 Temp = Hot & Windy = True10.00 Temp = Hot & Windy = False331.00

33 We’ve Done It! The 2 rules so far cover all 7 of the play=no days So we have a 6 rule set of rules based on this training data –if Outlook = Sunny & Temp = Mild Then Play? = yes –If Windy = TRUE & Humid = High Then Play? = yes –If Temp = Cool & Outlook = Sunny Then Play? = yes –If Windy = TRUE & Outlook = Overcast Then Play? = yes –If Outlook = Rainy Then Play? = no –If Temp = Hot & Windy = False Then Play? = no Note that the rules for a given category is considered an ordered set of rules, but between categories there is no order implied – there may be a conflict!

34 Now, suppose we must predict the test instance Rainy, mild, high, true Rule 2 concludes play?=yes (incorrectly) Rule 5 concludes play?=no (correctly) One possible way of dealing with this conflict is to favor the rule that has greatest coverage (most instances in support of it) in the training data In this case, Rule 2 has 2 instances in support, and Rule 5 has 4 instances in support

35 In a 14-fold cross validation, this would continue 13 more times Let’s run WEKA on this … Prism …

36 WEKA results – first look near the bottom === Stratified cross-validation === === Summary === Correctly Classified Instances 12 85.7143 % Incorrectly Classified Instances 2 14.2857% ============================================ On the cross validation – it got 12 out of 14 tests correct Wins BIG over other approaches tried so far!

37 More Detailed Results === Confusion Matrix === a b <-- classified as 5 1 | a = yes 1 7 | b = no ==================================== Here we see –the program 6 times predicted play=yes, on 5 of those it was correct – The program 8 times predicted play = no, on 7 of those it was correct There were 6 instances whose actual value was play=yes, the program correctly predicted that on 5 of them There were 8 instances whose actual value was play=no, the program correctly predicted that on 7 of them All-in-all, uniformly good prediction

38 Again, part of our purpose is to have a take-home message for humans Not 14 take home messages! So instead of reporting each of the things learned on each of the 14 training sets … … The program runs again on all of the data and builds a pattern for that – a take home message

39 WEKA - Take-Home === Classifier model (full training set) === Prism rules ---------- If outlook = sunny and temperature = mild then yes If outlook = sunny and temperature = cool then yes If windy = TRUE and outlook = overcast then yes If outlook = sunny and windy = TRUE then yes If outlook = rainy then no If temperature = hot and windy = FALSE then no

40 Let’s Try WEKA Prism on njcrimenominal Try 10-fold === Confusion Matrix === a b <-- classified as 5 2 | a = bad 6 19 | b = ok This represents the same accuracy as with Naïve Bayes We note that OneR chose unemployment as the attribute to use, with Prism, it is the first thing tested for each class, but if it is not high or low, other attributes are taken into account …

41 Prism’s rules for njcrimenominal: === Classifier model (full training set) === Prism rules If unemploy = hi then bad If popdens = med and education = low then bad If pop = med and popdens = med then bad If unemploy = med and education = low and pop = low then bad If education = med and unemploy = med and twoparent = med then bad If unemploy = low then ok If education = hi then ok If pop = med and popdens = low then ok If twoparent = low and unemploy = med and popdens = low then ok

42 Figure 4.8 Pseudo-code for Prism basic rule learner.

43 Prism – Missing Values Prism cannot handle

44 Prism – Numeric Values Prism cannot handle Easy to imagine a simple rule learner that could handle them (in regular attributes) –See example introducing section, where thresholds are chosen for numeric attributes as part of adding clauses to rules No chance of ever handling numeric prediction

45 Prism – Discussion Prism tries to fit training data 100% This presents a serious risk for overfitting!! Simple variation is to lower accuracy threshold –May need experimentation to find suitable threshold Needs conflict resolution between classes if more than one class is predicted Needs means of dealing with if no class is predicted

46 Class Exercise Let’s run WEKA Prism on japanbank Need nominal attributes – so discretize first

47 End Section 4.4

48 Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno sunnyhothighTRUEyes overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnymildhighFALSEyes sunnycoolnormalFALSEyes rainymildnormalFALSEno sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEno rainymildhighTRUEno


Download ppt "Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4."

Similar presentations


Ads by Google