Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS4445/B12 Provided by: Kenneth J. Loomis. CLASSIFICATION RULES: RIPPER ALGORITHM.

Similar presentations


Presentation on theme: "CS4445/B12 Provided by: Kenneth J. Loomis. CLASSIFICATION RULES: RIPPER ALGORITHM."— Presentation transcript:

1 CS4445/B12 Provided by: Kenneth J. Loomis

2 CLASSIFICATION RULES: RIPPER ALGORITHM

3 The first thing that needs to be determined is the consequence of the rule: Recall that a rule is made up of an antecedent  consequence. The table below contains the frequency counts of the possible consequences of the rules from the userprofile dataset using budget as the classification attribute: RuleFrequency …  budget=low35 …  budget=medium91 …  budget=high5 …  budget=?7 We can see that budget=high has the lowest frequency count in our training dataset, so we choose that as the first antecedent that we will find rules for. Note: I have included missing values here as one could classify the target as missing. Alternately, these instances could be removed.

4 Next we attempt to find the first condition in the antecedent. We need only look at possible conditions that exists in the 5 instances that have budget=high. The list of possible conditions are in the table below. Rule: ___ -> budget=high smoker=trueambience=familypersonality=hard-worker smoker=falseambience=friendspersonality=conformist drink_level=abstemioustransport=car ownerpersonality=hunter-ostentatious drink_level=casual drinkertransport=publicpersonality=thrifty-protector drink_level=social drinkermarital_status=singlereligion=none dress_preference=no preferenceinterest=technologyreligion=mormon dress_preference=informalinterest=nonereligion=christian dress_preference=formalinterest=varietyactivity=student

5

6 Here we see a list of the information gain for each of the possible first condition in the antecedent Rule: ___ -> budget=highInfo GainRule: ___ -> budget=highInfo Gain smoker=true0.0862marital_status=single0.8889 smoker=false0.07365interest=technology3.6049 drink_level=abstemious2.0974interest=none-0.1203 drink_level=casual drinker-0.7680interest=variety3.6049 drink_level=social drinker-0.5353personality=hard-worker-1.1441 dress_preference=no preference0.1174personality=conformist1.9792 dress_preference=informal-0.3426personality=hunter-ostentatious1.2016 dress_preference=formal-0.5710personality=thrifty-protector-0.1428 ambience=family-0.6854religion=none-0.1203 ambience=friends2.5440religion=mormon4.7866 transport=car owner6.7865religion=christian1.9792 transport=public-1.5710activity=student-0.1343

7

8 Next we attempt to find the second condition in the antecedent. We need only look at possible conditions that exists in the 4 instances that have transport = car owner and budget=high. The list of possible conditions are in the table below. Rule: transport=car owner and ___ -> budget=high smoker=falseambience=friendspersonality=thrifty-protector drink_level=abstemiousmarital_status=singlereligion=none drink_level=casual drinkerinterest=technologyreligion=mormon dress_preference=no preferenceinterest=nonereligion=christian dress_preference=informalinterest=varietyactivity=student dress_preference=elegantpersonality=hard-worker ambience=familypersonality=hunter-ostentatious

9 Here we see a list of the information gain for each of the possible second condition in the antecedent Rule: transport=car owner and ___ -> budget=high Info GainRule: transport=car owner and ___ -> budget=high Info Gain smoker=false2.5121interest=none0.0875 drink_level=abstemious5.0173interest=variety2.5602 drink_level=casual drinker-0.6130personality=hard-worker-1.1605 dress_preference=no preference-.06097personality=hunter-ostentatious0.7655 dress_preference=informal0.7655personality=thrifty-protector1.5311 dress_preference=elegant3.0875religion=none-0.0824 ambience=family-0.6130religion=mormon3.0875 ambience=friends1.5075religion=christian3.0875 marital_status=single2.7570activity=student-0.0840 interest=technology2.5602

10

11 Next we attempt to find the third condition in the antecedent. We need only look at possible conditions that exists in the 3 instances that have transport = car owner and drink_level = abstemious and budget=high. The list of possible conditions are in the table below. Rule: transport=car owner and drink_level=abstemious and ___ -> budget=high smoker=falseinterest=technologypersonality=thrifty-protector dress_preference=no preferenceinterest=nonereligion=none dress_preference=formalinterest=varietyreligion=catholic ambience=familypersonality=hard-workerreligion=christian ambience=friendspersonality=hunter-ostentatiousactivity=student marital_status=single

12 Here we see a list of the information gain for each of the possible third conditions in the antecedent Rule: transport=car owner and drink_level=abstemious and ___ -> budget=high Info GainRule: transport=car owner and drink_level=abstemious and ___ -> budget=high Info Gain smoker=false0interest=variety0.4515 dress_preference=no preference-0.3399personality=hard-worker-0.5850 dress_preference=formal1.4513personality=hunter-ostentatious1.4150 ambience=family-0.5850personality=thrifty-protector-0.1699 ambience=friends2.8300religion=none-0.1699 marital_status=single1.2415religion=catholic-0.5850 interest=technology0.4515religion=christian1.4150 interest=none-0.5850activity=student.01826

13 Since the following rule results in the highest information gain we select that as the third condition of our rule: transport = car owner and drink_level = abstemious and ambience = friends  budget = high: Note that this rule covers only positive examples (i.e., budget=high data instances). Since it doesn’t cover negative examples, then there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete.

14 First rule: transport = car owner and drink_level = abstemious and ambience = friends  budget = high: In order to decide if/how to prune this rule, RIPPER will: use a validation set (that is, a piece of the training set that was kept apart and not used to construct the rule) use a metric for pruning: v = (p-n)/(p+n) where p: # of positive examples covered by the rule in the validation set n: # of negative examples covered by the rule in the validation set pruning method: deletes any final sequence of conditions that maximizes v. That is, it calculates v for each of the following pruned versions of the rule and keeps the version of the rule with maximum v: transport = car owner & drink_level = abstemious & ambience = friends  budget = high transport = car owner & drink_level = abstemious  budget = high transport = car owner  budget = high  budget = high

15 ASSOCIATION RULES: APRIORI ALGORITHM

16 We begin the Apriori algorithm by determining the order: Here I will use the order that the attributes appear and the values for each attribute in alphabetical order. Then all the possible single item rules are generated and the support calculated for each rule. The following slide shows the complete list of possible items in the rule. Support is calculated in the following manner: Since we know the minimum acceptable support count is 55, we need only look at the numerator of this ratio to determine whether or not to keep this item.

17 Candidate Itemsets with Support Count smoker=false109transport=on foot14religion=christian7 smoker=true26transport=public82religion=jewish1 drink_level=abstemious51marital_status=single122religion=mormon1 drink_level=casual drinker47marital_status=married10religion=none30 drink_level=social drinker40interest=eco-friendly16activity=professional15 dress_preference=elegant4interest=none30activity=student113 dress_preference=formal41interest=technology36activity=unemployed2 dress_preference=informal53interest=variety50activity=working-class1 dress_preference=no preference35personality=conformist7budget=high5 ambience=family70personality=hard-worker61budget-low35 ambience=friends46personality=hunter-ostentatious12budget=medium91 ambience=solitary16personality=thrifty-protector58 transport=car owner34religion=catholic99 We keep the ones in bold as they meet the minimum support threshold.

18 Itemsets with Support smoker=false109 ambience=family70 transport=public82 marital_status=single122 personality=hard-worker61 personality=thrifty-protector58 religion=catholic99 activity=student113 budget=medium91 We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.

19 We merge pairs from the level 1 set. Since there are no prefixes here then we must consider all combinations. (Continued on next slide) Candidate Itemsets with Support Count smoker=false, ambience=family 59 smoker=false, budget=medium 75 ambience=family, budget=medium 54 smoker=false, transport=public 69 ambience=family, transport=public 46 transport=public, marital_status=single 76 smoker=false, marital_status=single 98 ambience=family, marital_status=single 63 transport=public, personality=hard-worker 28 smoker=false, personality=hard-worker 49 ambience=family, personality=hard-worker 26 transport=public, personality=thrifty-protector 44 smoker=false, personality=thrifty-protector 48 ambience=family, personality=thrifty-protector 33 transport=public, religion=catholic 62 smoker=false, religion=catholic 79 ambience=family, religion=catholic 57 transport=public, activity=student 71 smoker=false, activity=student 90 ambience=family, activity=student 61 transport=public, budget=medium 54

20 Candidate Itemsets with Support Count marital_status=single, personality=hard-worker 52 personality=hard-worker budget=medium 40 marital_status=single, personality=thrifty-protector 51 personality=thrifty-protector, religion=catholic 45 marital_status=single, religion=catholic 91 personality=thrifty-protector, activity=student 50 marital_status=single, activity=student 107 personality=thrifty-protector, budget=medium 41 marital_status=single, budget=medium 79 religion=catholic, activity=student 84 personality=hard-worker, personality=thrifty-protector 0 religion=catholic, budget=medium 67 personality=hard-worker, religion=catholic 40 activity=student, budget=medium 71 personality=hard-worker, activity=student 46

21 Itemsets with Support Count smoker=false, ambience=family 59 ambience=family, marital_status=single 63 marital_status=single, religion=catholic 91 smoker=false, transport=public 69 ambience=family, religion=catholic 57 marital_status=single, activity=student 107 smoker=false, marital_status=single 98 ambience=family, activity=student 61 marital_status=single, budget=medium 79 smoker=false, religion=catholic 79 transport=public, marital_status=single 76 religion=catholic, activity=student 84 smoker=false, activity=student 90 transport=public, religion=catholic 62 religion=catholic, budget=medium 67 smoker=false, budget=medium 75 transport=public, activity=student 71 activity=student, budget=medium 71 We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.

22 We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates. Itemsets from Level 2 smoker=false, ambience=family ambience=family, marital_status=single marital_status=single, religion=catholic smoker=false, transport=public ambience=family, religion=catholic marital_status=single, activity=student smoker=false, marital_status=single ambience=family, activity=student marital_status=single, budget=medium smoker=false, religion=catholic transport=public, marital_status=single religion=catholic, activity=student smoker=false, activity=student transport=public, religion=catholic religion=catholic, budget=medium smoker=false, budget=medium transport=public, activity=student activity=student, budget=medium

23 First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets are the same) Here we need only match the first item in the itemset. Itemsets from Level 2 smoker=false, ambience=family ambience=family, marital_status=single marital_status=single, religion=catholic smoker=false, transport=public ambience=family, religion=catholic marital_status=single, activity=student smoker=false, marital_status=single ambience=family, activity=student marital_status=single, budget=medium smoker=false, religion=catholic transport=public, marital_status=single religion=catholic, activity=student smoker=false, activity=student transport=public, religion=catholic religion=catholic, budget=medium smoker=false, budget=medium transport=public, activity=student activity=student, budget=medium

24 That results in this set of potential candidate itemsets. Potential Candidate Itemsets smoker=false, ambience=family, transport=public smoker=false, transport=public, religion=catholic smoker=false, activity=student, budget=medium transport=public, religion=catholic, activity=student smoker=false, ambience=family, marital_status=single smoker=false, transport=public, activity=student ambience=family, marital_status=single, religion=catholic marital_status=single, religion=catholic, activity=student smoker=false, ambience=family, religion=catholic smoker=false, transport=public, budget=medium ambience=family, marital_status=single, activity=student marital_status=single, religion=catholic, budget=medium smoker=false, ambience=family, activity=student smoker=false, marital_status=single, religion=catholic ambience=family, religion=catholic, activity=student marital_status=single, activity=student, budget=medium smoker=false, ambience=family, budget=medium smoker=false, marital_status=single, activity=student transport=public, marital_status=single, religion=catholic religion=catholic, activity=student, budget=medium smoker=false, transport=public, marital_status=single smoker=false, marital_status=single, budget=medium transport=public, marital_status=single, activity=student

25 We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 2 in each of these itemsets also existed in the level 2 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets. The following itemsets can be removed as the bolded subsets do not appear in the Level 2 itemsets. This leaves us the candidate itemsets on the next slide. Candidate Itemsets That Can be Removed smoker=false, ambience=family, transport=public smoker=false, ambience=family, budget=medium smoker=false, transport=public, budget=medium

26 Finally we can calculate the support for these candidate itemsets. Candidate Itemsets with Support Count smoker=false, ambience=family, marital_status=single 53 smoker=false, transport=public, activity=student 58 ambience=family, marital_status=single, religion=catholic 50 transport=public, religion=catholic, activity=student 59 smoker=false, ambience=family, religion=catholic 46 smoker=false, marital_status=single, religion=catholic 72 ambience=family, marital_status=single, activity=student 57 marital_status=single, religion=catholic, activity=student 80 smoker=false, ambience=family, activity=student 52 smoker=false, marital_status=single, activity=student 85 ambience=family, religion=catholic, activity=student 51 marital_status=single, religion=catholic, budget=medium 80 smoker=false, transport=public, marital_status=single 63 smoker=false, marital_status=single, budget=medium 65 transport=public, marital_status=single, religion=catholic 57 marital_status=single, activity=student, budget=medium 59 smoker=false, transport=public, religion=catholic 52 smoker=false, activity=student, budget=medium 58 transport=public, marital_status=single, activity=student 67 religion=catholic, activity=student, budget=medium 53

27 Level 3 Itemsets with Support smoker=false, transport=public, marital_status=single 63 smoker=false, activity=student, budget=medium 58 marital_status=single, religion=catholic, activity=student 80 smoker=false, transport=public, activity=student 58 ambience=family, marital_status=single, activity=student 57 marital_status=single, religion=catholic, budget=medium 80 smoker=false, marital_status=single, religion=catholic 72 transport=public, marital_status=single, religion=catholic 57 marital_status=single, activity=student, budget=medium 59 smoker=false, marital_status=single, activity=student 85 transport=public, marital_status=single, activity=student 67 smoker=false, marital_status=single, budget=medium 65 transport=public, religion=catholic, activity=student 59 We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.

28 We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates. Level 3 Itemsets smoker=false, transport=public, marital_status=single smoker=false, activity=student, budget=medium marital_status=single, religion=catholic, activity=student smoker=false, transport=public, activity=student ambience=family, marital_status=single, activity=student marital_status=single, religion=catholic, budget=medium smoker=false, marital_status=single, religion=catholic transport=public, marital_status=single, religion=catholic marital_status=single, activity=student, budget=medium smoker=false, marital_status=single, activity=student transport=public, marital_status=single, activity=student smoker=false, marital_status=single, budget=medium transport=public, religion=catholic, activity=student

29 Finally we can calculate the support for these candidate itemsets. We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates. Level 3 Itemsets smoker=false, transport=public, marital_status=single smoker=false, activity=student, budget=medium marital_status=single, religion=catholic, activity=student smoker=false, transport=public, activity=student ambience=family, marital_status=single, activity=student marital_status=single, religion=catholic, budget=medium smoker=false, marital_status=single, religion=catholic transport=public, marital_status=single, religion=catholic marital_status=single, activity=student, budget=medium smoker=false, marital_status=single, activity=student transport=public, marital_status=single, activity=student smoker=false, marital_status=single, budget=medium transport=public, religion=catholic, activity=student First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets match) Here we need only match the first two items in the itemset.

30 Potential Candidate Item Sets smoker=false, transport=public, marital_status=single, activity=student smoker=false, marital_status=single, activity=student, budget=medium smoker=false, marital_status=single, religion=catholic, activity=student transport=public, marital_status=single, religion=catholic, activity=student smoker=false, marital_status=single, religion=catholic, budget=medium marital_status=single, religion=catholic, activity=student, budget=medium That results in this set of candidate itemsets. We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 3 in each of these itemsets also existed in the level 3 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets. Here we again eliminate candidates from consideration, the offending subsets are bolded.

31 Candidate Itemsets with Support Count smoker=false, marital_status=single, religion=catholic, activity=student 63 smoker=false, marital_status=single, activity=student, budget=medium 53 In the end we keep only one single itemset that has enough support for this level. The following slide depicts the complete itemset. Level 4 Itemsets with Support Count smoker=false, marital_status=single, religion=catholic, activity=student 63

32 Itemsets with Support Count smoker=false109 smoker=false, marital_status=single 98 marital_status=single, religion=catholic 91 smoker=false, marital_status=single, budget=medium 65 ambience=family70 smoker=false, religion=catholic 79 marital_status=single, activity=student 107 smoker=false, activity=student, budget=medium 58 marital_status=single122 smoker=false, activity=student 90 marital_status=single, budget=medium 79 ambience=family, marital_status=single, activity=student 57 personality=hard-worker61 smoker=false, budget=medium 75 religion=catholic, activity=student 84 transport=public, marital_status=single, religion=catholic 57 transport=public82 ambience=family, marital_status=single 63 religion=catholic, budget=medium 67 transport=public, marital_status=single, activity=student 67 religion=catholic99 ambience=family, religion=catholic 57 activity=student, budget=medium 71 transport=public, religion=catholic, activity=student 59 activity=student113 ambience=family, activity=student 61 smoker=false, transport=public, marital_status=single 63 marital_status=single, religion=catholic, activity=student 80 budget=medium91 transport=public, marital_status=single 76 smoker=false, transport=public, activity=student 58 marital_status=single, religion=catholic, budget=medium 80 smoker=false, ambience=family 59 transport=public, religion=catholic 62 smoker=false, marital_status=single, religion=catholic 72 marital_status=single, activity=student, budget=medium 59 smoker=false, transport=public 69 transport=public, activity=student 71 smoker=false, marital_status=single, activity=student 85 smoker=false, marital_status=single, religion=catholic, activity=student 63

33 Largest itemset: Let’s call this itemset I4: I4: smoker=false, marital_status=single, religion=catholic, activity=student Rules constructed from I4 with 2 items in the antecedent:  R1: smoker=false, marital_status=single  religion=catholic, activity=student conf(R1) = supp(I4)/supp(smoker=false, marital_status=single ) = 63/ 98 = 64.28%  R2: smoker=false, religion=catholic  marital_status=single, activity=student conf(R2) = supp(I4)/supp(smoker=false, religion=catholic ) = 63/ 79 = 79.74%  R3: smoker=false, activity=student  marital_status=single, religion=catholic conf(R3) = supp(I4)/supp(smoker=false, activity=student ) = 63/ 90= 70%  R4: marital_status=single, religion=catholic  smoker=false, activity=student conf(R4) = supp(I4)/supp(marital_status=single, religion=catholic ) = 63/ 91 = 69.23%  R5: marital_status=single, activity=student  smoker=false, religion=catholic conf(R5) = supp(I4)/supp(marital_status=single, activity=student ) = 63/ 107 = 58.87%  R6: religion=catholic, activity=student  smoker=false, marital_status=single conf(R6) = supp(I4)/supp(religion=catholic, activity=student) = 63/ 84 = 75%


Download ppt "CS4445/B12 Provided by: Kenneth J. Loomis. CLASSIFICATION RULES: RIPPER ALGORITHM."

Similar presentations


Ads by Google