1 1 Associative Classification of Imbalanced Datasets Sanjay Chawla School of IT University of Sydney

2 2 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

3 3 Data Mining Data Mining research has settled into an equilibrium involving four tasks Pattern Mining (Association Rules) Classification Clustering Anomaly or Outlier Detection Associative Classifier DB ML

4 4 Association Rule Mining In terms of impact nothing rivals association rule mining within the data mining community –SIGMOD 93 (~4100 citations) Agrawal, Imielinski, Swami –VLDB 94 (~4900 Citations) Agrawal, Srikant –C4.5 93 (~7000 citations) Ross Quinlan –Gibbs Sampling 84 (IEEE PAMI, ~5000 citations) Geman & Geman –Content Addressable Network (~3000) Ratnasamy, Francis, Hadley, Karp

5 5 Association Rules (Agrawal, Imielinksi and Swami, 93 SIGMOD) Example: –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics –Support (s) Fraction of transactions that contain both X and Y –Confidence (c) Measures how often items in Y appear in transactions that contain X From “Introduction to Data Mining”, Tan,Steinbach and Kumar

6 6 Mining Association Rules Two-step approach: 1.Frequent Itemset Generation –Generate all itemsets whose support  minsup 2.Rule Generation –Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is computationally expensive

7 7 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

8 8 Associative Classifiers Most of the Associative Classifiers are based on rules discovered using the support-confidence criterion. The classifier itself is a collection of rules ranked using their support or confidence.

9 9 Associative Classifiers (2) TIDItemsGender 1Bread, MilkF 2Bread, Diaper, Beer, EggsM 3Milk Diaper, Beer, CokeM 4Bread, Milk, Diaper, BeerM 5Bread, Milk, Diaper, CokeF In a Classification task we want to predict the class label (Gender) using the attributes A good (albeit stereotypical) rule is {Beer,Diaper}  Male whose support is 60% and confidence is 100%

10 10 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

11 11 Imbalanced Data Set In some application domains, Data Sets are Imbalanced : –The proportion of samples from one class is much smaller than the other class/classes. –And the smaller class is the class of interest. Support and confidence are biased toward the majority class, and do not perform well in such cases.

12 12 Downsides of Support Support is biased towards the majority class –Eg: classes = {yes, no}, sup({yes})=90% –minSup > 10% wipes out any rule predicting “no” –Suppose X  no has confidence 1 and support 3%. Rule discarded if minSup > 3% even though it perfectly predicts 30% of the instances in the minority class!

13 13 Downside of Confidence(1) 20525 70575 9010100 Conf(A  C) = 20/25 = 0.8 Support(A  C) = 20/100 = 0.2 Correlation between A and C: Thus, when the data set is imbalanced a high support and high confidence rule may not necessarily imply that the antecedent and the consequent are positively correlated.

14 14 Downside of Confidence (2) Reasonable to expect that for “good rules” the antecedent and consequent are not independent! Suppose –P(Class=Yes) = 0.9 –P(Class=Yes|X) = 0.9

15 15 Downsides of Confidence (3) Another useful observation Higher confidence (support) for a rule in the minority class implies higher correlation, and lower correlation in the minority class implies lower confidence, but neither of these apply for the majority class. Confidence (support) tends to bias the majority class.

16 16 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

17 17 Contingency Table A 2 * 2 Contingency Table for X → y. We will use the notation [a, b; c, d] to represent this table.

18 18 Fisher Exact Test Given a table, [a, b; c, d], Fisher Exact Test will find the probability (p-value) of obtaining the given table under the hypothesis that {X, ¬X} and {y, ¬y} are independent. The margin sums (∑rows, ∑cols) are fixed.

19 19 Fisher Exact Test (2) The p-value is given by: We will only use rules whose p-values are below the level of significant desired (e.g. 0.01). Rules that pass this test are statistically significant in the positively associated direction (e.g. X → y).

20 20 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

21 21 Class Correlation Ratio In Class Correlation, we are interested in rules X → y where X is more positively correlated with y than it is with ¬y. The correlation is defined by: where |T| is the number of transactions n.

22 22 Class Correlation Ratio (2) We then use corr() to measure how correlated X is with y compared to ¬y. X and y are positively correlated if corr(X→y)>1, and negatively correlated if corr(X→y)<1.

23 23 Class Correlation Ratio (3) Based on correlation corr(), we define the Class Correlation Ratio (CCR): The CCR measures how much more positively the antecedent is correlated with the class it predicts (e.g. y), relative to the alternative class (e.g. ¬y).

24 24 Class Correlation Ratio (4) We only use rules with CCR higher than a desired threshold, so that no rules are used that are more positively associated with the classes they do not predict.

25 25 The two measurements We perform the following tests to determine whether a potentially interesting rule is indeed interesting: –Check the significant of a rule X → y by performing the Fisher’s Exact Test. –Check whether CCR(X→y) > 1. Those rules that pass the above two tests are candidates for the classification task.

26 26 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

27 27 Search and Pruning Strategies To avoid examining the whole set of possible rules, we use search strategies that ensure the concept of being potential interesting is anti-monotonic: X→y might be considered as potential interesting if and only if all {X’→y|X’ in X} have been found to be potentially interesting.

28 28 Search and Pruning Strategies (2) The contingency table [a, b; c, d] used to test for the significance of the rule X → y in comparison to one of its generalizations X-{z} → y for the Aggressive search strategy.

29 29 Example Suppose we have already determined that the rules (A = a1)  1 and (A = a2)  1 are significant. Now we want to test if X=(A =a1) ^ (A=a2)  1 is significant Then we carry out a FET and calculate the CCR on X and X –{A=a1} (i.e. z = {a2})and X and X- {A=a2} (i.e. z = {a1}). If the minimum of their p-value is less than the significance level, and their CCR is greater than 1, we keep the X  1 rule, otherwise we discard it.

30 30 Ranking Rules Strength Score (SS): –In order to determine how interesting a rule is, we need a ranking (ordering) of the rules, and the ordering is defined by the Strength Score.

31 31 Overview Data Mining Tasks Associative Classifiers Downside of Support and Confidence Mining Rules from Imbalanced Data Sets –Fisher’s Exact Test –Class Correlation Ratio (CCR) –Searching and Pruning Strategies –Experiments

32 32 Experiments (Balanced Data) The preceding approach is represented by “SPARCCC”. The experiments on Balanced Data Sets show that the average accuracy of SPARCCC compares favourably to CBA and C4.5. –The table below is the prediction accuracy on balanced data sets.

33 33 Experiments (Imbalanced Data) True Positive Rate (Recall/Sensitivity) is a better performance measure for imbalanced data sets. SPARCCC overcomes other rule based techs such as CBA and CCCS. –The table below is True Positive Rate of the Minority Class on Imbalanced version of the Datasets.

34 34 References Florian Verhein, Sanjay Chawla. Using Significant, Positively Associated and Relatively Class Correlated Rules For Associative Classification of Imbalanced Datasets. The 2007 IEEE International Conference on Data Mining. Omaha NE, USA. October 28-31, 2007.

