The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 10/25/2006 Classification III Overview Rule-based classification method overview CBA: classification based on association Applying rules based method in Microarray analysis
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 10/25/2006 Classification III Rule Based Methods Vs. SVM Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05, 2005.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 10/25/2006 Classification III Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition) y where Condition is a conjunctions of attributes y is the class label Examples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 10/25/2006 Classification III Rule-based Classifier (Example) R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 10/25/2006 Classification III Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 10/25/2006 Classification III Rule Coverage and Accuracy Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule (Status=Single) No Coverage = 40%, Accuracy = 50%
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 10/25/2006 Classification III How does Rule-based Classifier Work? R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 10/25/2006 Classification III Characteristics of Rule-Based Classifier Mutually exclusive rules Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 10/25/2006 Classification III From Decision Trees To Rules Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 10/25/2006 Classification III Rules Can Be Simplified Initial Rule: (Refund=No) (Status=Married) No Simplified Rule: (Status=Married) No
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 10/25/2006 Classification III Effect of Rule Simplification Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set – use voting schemes Rules are no longer exhaustive A record may not trigger any rules Solution? Use a default class
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 10/25/2006 Classification III Ordered Rule Set Rules are rank ordered according to their priority An ordered rule set is known as a decision list When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 10/25/2006 Classification III Rule Ordering Schemes Rule-based ordering Individual rules are ranked based on their quality Class-based ordering Rules that belong to the same class appear together
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 10/25/2006 Classification III Building Classification Rules Direct Method: Extract rules directly from data e.g.: CBA Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 10/25/2006 Classification III Direct Method: Sequential Covering 1.Start from an empty rule 2.Grow a rule using the Learn-One-Rule function 3.Remove training records covered by the rule 4.Repeat Step (2) and (3) until stopping criterion is met
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 10/25/2006 Classification III Example of Sequential Covering
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 10/25/2006 Classification III Example of Sequential Covering…
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 10/25/2006 Classification III Aspects of Sequential Covering Rule Growing Rule Evaluation Stopping Criterion Rule Pruning
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 10/25/2006 Classification III Rule Growing Two common strategies
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 10/25/2006 Classification III Rule Growing (Examples) RIPPER Algorithm: Start from an empty rule: {} => class Add conjuncts that maximizes FOIL’s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 10/25/2006 Classification III Rule Evaluation Metrics: Accuracy Laplace M-estimate n : Number of instances covered by rule n c : Number of instances covered by rule k : Number of classes p : Prior probability
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 10/25/2006 Classification III Stopping Criterion and Rule Pruning Stopping criterion Compute the gain If gain is not significant, discard the new rule Rule Pruning Similar to post-pruning of decision trees Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 10/25/2006 Classification III Summary of Direct Method Grow a single rule Remove Instances from rule Prune the rule (if necessary) Add rule to Current Rule Set Repeat
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 10/25/2006 Classification III Direct Method: RIPPER For 2-class problem, choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 10/25/2006 Classification III Indirect Methods
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 10/25/2006 Classification III Indirect Method: C4.5rules Extract rules from an unpruned decision tree For each rule, r: A y, consider an alternative rule r’: A’ y where A’ is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all r’s Prune if one of the r’s has lower pessimistic error rate Repeat until we can no longer improve generalization error
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 10/25/2006 Classification III Advantages of Rule-Based Classifiers As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 10/25/2006 Classification III Overview of CBA Classification rule mining versus Association rule mining Aim A small set of rules as classifier All rules according to minsup and minconf Syntax X y X Y
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 10/25/2006 Classification III Association Rules for Classification Classification: mine a small set of rules existing in the data to form a classifier or predictor. It has a target attribute: Class attribute Association rules: have no fixed target, but we can fix a target. Class association rules (CAR): has a target class attribute. E.g., Own_house = true Class =Yes [sup=6/15, conf=6/6] CARs can obviously be used for classification. B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD,
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 10/25/2006 Classification III Decision tree vs. CARs The decision tree below generates the following 3 rules. Own_house = true Class =Yes [sup=6/15, conf=6/6] Own_house = false, Has_job = true Class=Yes [sup=5/15, conf=5/5] Own_house = false, Has_job = false Class=No [sup=4/15, conf=4/4] But there are many other rules that are not found by the decision tree
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 10/25/2006 Classification III There are many more rules CAR mining finds all of them. In many cases, rules not in the decision tree (or a rule list) may perform classification better. Such rules may also be actionable in practice
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 10/25/2006 Classification III Decision tree vs. CARs (cont …) Association mining require discrete attributes. Decision tree learning uses both discrete and continuous attributes. CAR mining requires continuous attributes discretized. There are several such algorithms. Decision tree is not constrained by minsup or minconf, and thus is able to find rules with very low support. Of course, such rules may be pruned due to the possible overfitting.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 10/25/2006 Classification III CBA: Three Steps Discretize continuous attributes, if any Generate all class association rules (CARs) Building a classifier based on the generated CARs.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 10/25/2006 Classification III RG: The Algorithm Find the complete set of all possible rules Usually takes long time to finish
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 10/25/2006 Classification III RG: Basic Concepts Frequent ruleitems A ruleitem is frequent if its support is above minsup Accurate rule A rule is accurate if its confidence is above minconf Possible rule For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 10/25/2006 Classification III Further Considerations in CAR mining Multiple minimum class supports Deal with imbalanced class distribution, e.g., some class is rare, 98% negative and 2% positive. We can set the minsup(positive) = 0.2% and minsup(negative) = 2%. If we are not interested in classification of negative class, we may not want to generate rules for negative class. We can set minsup(negative)=100% or more. Rule pruning may be performed.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 10/25/2006 Classification III Building Classifiers There are many ways to build classifiers using CARs. Several existing systems available. Simplest: After CARs are mined, do nothing. For each test case, we simply choose the most confident rule that covers the test case to classify it. Microsoft SQL Server has a similar method. Or, using a combination of rules. Another method (used in the CBA system) is similar to sequential covering. Choose a set of rules to cover the training data.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 10/25/2006 Classification III Class Builder: Three Steps The basic idea is to choose a set of high precedence rules in R to cover D. Sort the set of generated rules R Select rules for the classifier from R following the sorted sequence and put in C. Each selected rule has to correctly classify at least one additional case. Also select default class and compute errors. Discard those rules in C that don’t improve the accuracy of the classifier. Locate the rule with the lowest error rate and discard the rest rules in the sequence.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 10/25/2006 Classification III Rules are sorted first Definition: Given two rules, r i and r j, r i r j (also called ri precedes r j or r i has a higher precedence than r j ) if the confidence of r i is greater than that of r j, or their confidences are the same, but the support of r i is greater than that of r j, or both the confidences and supports of r i and r j are the same, but r i is generated earlier than r j. A CBA classifier L is of the form: L =
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 10/25/2006 Classification III Classifier building using CARs Selection: Each rule does at least one correct prediction Each case is covered by the rule with highest precedence This algorithm is correct but is not efficient
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 10/25/2006 Classification III Classifier building using CARs For each case d in D coverRules d = all covering rules of d Sort D according to the precedence of the first correctly predicting rule of each case d RuleSet = empty Scan D again to find optimal rule set.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 10/25/2006 Classification III Refined Classification Based on TopkRGS (RCBT) General Idea: Construct RCBT classifier from top-k covering rule groups. So the number of the rule groups generated are bounded. Efficiency and accuracy are validated by experimental results Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 10/25/2006 Classification III Dataset In the microarray dataset: Each row in the dataset corresponds to a sample Each item value in the dataset corresponds to a distretized gene expression value. Class labels correspond to category of sample, (cancer / not cancer) Useful in diagnostic purpose
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 10/25/2006 Classification III Introduction of gene expression data (Microarray) Format of gene expression data: Column – gene: thousands of Row — sample (class): tens of or hundreds of patients gene1gene2gene3gene4Class row C1 row C1 row C2 itemsClass row1a, b, e,hC1 row2c, d, e, fC1 row3a, b, g, hC2 discretize
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 10/25/2006 Classification III Rule Example Rule r: {a,e,h} -> C Support(r)= 3 Condifence(r)= 66%
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 10/25/2006 Classification III RG: General solution Step1: Find all frequently occurred itemsets from dataset D. Step2: Generate rule in the form of itemset -> C. Prune rules that do not have enough support an confidence.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 10/25/2006 Classification III RG: Previous Algorithms Item enumeration: Search all the frequent itemsets by checking all possible combinations of items. { } {a }{b }{c } {ab }{ac }{bc } {abc } We can simulate the search process in an item enumeration tree.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 10/25/2006 Classification III Microarray data Features of Microarray data A few rows: A large number of items, The space of all the combinations of items is large
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 10/25/2006 Classification III Motivations Very slow for existing rule mining algorithms Item search space is exponential to the number of item use the idea of row enumeration to design new algorithm The number of association rules are too huge even for a given consequent mine top-k Interesting rule groups for each row
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 10/25/2006 Classification III Definitions Row support set: Given a set of items I’, we denote R(I’) as the largest set of rows that contain I. Item support set: Given a set of rows R’, we denote I(R’) as the largest set of items that are common among rows in R’.
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 10/25/2006 Classification III Example I ’ ={a,e,h}, then R(I ’ )={r2,r3,r4} R ’ ={r2,r3}, then I(R ’ )={a,e,h}
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 10/25/2006 Classification III Rule Groups What is rule group? Given a one row dataset: {a, b, c, d, e, Cancer}, 31 rules in the form of LHS Cancer. the same row and the same confidence (100%). 1 upper bound and 5 lower bound Rule group: a set of association rules whose LHS itemsets occurs in a same set of rows. Rule group has a unique upper bound. abcde Cancer
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 10/25/2006 Classification III Rule groups: example Rule groups: a set of rules covered by the same rows. upper bound rule c-> C1 is not in the group abc->C1(100%) ab->C1(100%) a->C1(100%) b->C1(100%) ac->C1(100%)bc->C1(100%) upper bound rule classItems C1a,b,c C1a,b,c,d C1c,d,e C2c,d,e
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 10/25/2006 Classification III Significant of RGS Rule group r1 is more significant than r2 if r1.conf>r2.conf or r1.sup>r2.sup and r1.conf=r2.conf
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 10/25/2006 Classification III Finding top-k rule groups Given dataset D, for each row of the dataset, find the k most significant covering rule groups (represented by the upper bounds) subject to the minimum support constraint No min-confidence required
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 10/25/2006 Classification III Top-k covering rule groups For each row, we find the most significant k rule groups: based on confidence first then support Given minsup=1, Top-1 row 1: abc C1(sup = 2, conf= 100%) row 2: abc C1 abcd C1(sup=1,conf = 100%) row 3: cd C1(sup=2, conf = 66.7%) row 4: cde C2 (sup=1, conf = 50%) classItems C1a,b,c C1a,b,c,d C1c,d,e C2c,d,e
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 10/25/2006 Classification III Relationship with CBA The rules selected by CBA for classification are a subset of the rules of TopkRGS with k=1 So we can use top-1 covering rule groups to build the CBA classifier The authors also proposed a refined classification method based on TopkRGS Reduces the chance that test data is classified by a default class; Uses a subset of rules to make a collective decision
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 10/25/2006 Classification III Main advantages of Top-k coverage rule group The number is bounded by the product of k and the number of samples Treat each sample equally provide a complete description for each row (small) The minimum confidence parameter-- instead k. Sufficient to build classifiers while avoiding excessive computation
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 10/25/2006 Classification III Naïve method of finding TopkRGS Find the complete set of upper bound rules in the dataset by the row-wise algorithm Pick the top-k covering rule groups for each row in the dataset Inefficient -- To improve, keep track of the top-k rule groups at each enumeration node dynamically and use effective pruning strategies
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 10/25/2006 Classification III Reference W. Cohen. Fast eective rule induction. In ICML'95, Xiaoxing Yin & Jiawei Han CPAR: Classification based on Predictive Association Rules,SDM’03 B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD'98 Jiuyong Li, On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 4, 2006 Some slides are offered by Zhang Xiang at