Download presentation
Presentation is loading. Please wait.
1
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer Science and Engineering University of Minnesota Presented by Ahmet Bulut
2
2 Bulut, Singh # Motivation Major Data mining problem: analysis of relationships among variables Finding sets of binary variables that co- occur together –Association Rule Mining [Agrawal et al.] How to find a suitable metric to capture the dependencies between variables (defined in terms of contingency tables) Many metrics provide conflicting information Goal: Automated way of choosing the best metric for an application domain
3
3 Bulut, Singh #
4
4 Justification for Conflicts E10 is ranked highest by I measure but lowest according to coefficient. Recognize intrinsic properties of the existing measures
5
5 Bulut, Singh # Analysis of a Measure The relationship between the gender of a student and the grade obtained in a course Number of male students X 2, Number of female students X 10 One expects scale-invariance in this particular application Most measures are sensitive to scaling of rows and columns such as: gini index, interest, mutual information etc.
6
6 Bulut, Singh # Solutions to zero-in Support based pruning –Eliminate uncorrelated and poorly correlated patterns Table Standardization –To modify contingency tables to have uniform margins –Many measures provide non-conflicting information Expectation of domain experts –Choose the measure that agrees with the expectations the most. –Number of contingency tables, |T|, is high –It is possible to extract a small set, S, of contingency tables –Find the best measure for S to approximate for T
7
7 Bulut, Singh # Preliminaries The similarity between any two measures M1 and M2: the similarity between O M1 (T) and O M2 (T) The similarity metric used is correlation coefficient corr(O M1 (T), O M2 (T) ) > threshold then similar
8
8 Bulut, Singh # Desired Properties of a Measure M
9
9 Bulut, Singh # Properties of a Measure M cont’d. Denote 2X2 contingency table as a contingency matrix Interestingness measure is a matrix operator, O such that –OM = k where k is a scalar. –For instance for Coefficient as the interestingness measure k equals to normalized form of the determinant operator Det(M) = f 11 f 00 – f 01 f 10 Statistical Independence a singular matrix M whose determinant equal to 0.
10
10 Bulut, Singh # Properties of a Measure M cont’d. Property 1: Symmetry under variable permutation: O(M T ) = O(M) –cosine (IS), interest factor(I), odds ratio ( ) Property 2: Row/Column Scaling Invariance: R=C=[k1 0; 0 k2] –R x M is row scaling and M x R is column scaling –If O(RM) = O(M) and O(MR) = O(M), then M is row/scale invariant – odds ratio ( ) satisfies this property along with Yule’s Q and Y Property 3: Antisymmetry under row/column permutation –S = [0 1;1 0] –If O(SM) = -O(M), antisymmetric under row permutation –If O(MS) = -O(M), antisymmetric under column permutation –Measures that are symmetric under the row and column permutation operations: no distinction between positive and negative correlations of a table For example: gini index Property 4: Inversion Invariance –S=[0 1;1 0] –row and column permutation together –If O(SMS)=O(M), inversion invariant –Insight: flip presence with absence and vice versa for binary variables. – coefficient, odds ratio, collective strength are symmetric binary measures –Jaccard measure is asymmetric
11
11 Bulut, Singh # Property 4 and Property 5 Market Basket analysis requires unequal treatment of binary values of a variable –A symmetric measure like the one above is not suitable Property 5: Null Invariance: If O(M+C) = O(M) where C=[0 0;0 k] and k is a positive constant –For binary variables; more records added that do not contain the two variables under consideration: Co-occurrence emphasized
12
12 Bulut, Singh # Effect of Support Based Pruning Randomly generated synthetic dataset of 10,000 contingency tables Darker cells, correlation > 0.85, and lighter cells indicate otherwise Tighter bounds on the support of the patterns: many measures become correlated
13
13 Bulut, Singh # Elimination of poorly correlated tables using Support-based Pruning Minimum support threshold to prune out the low support patterns Having a maximum support threshold: equal elimination of uncorrelated, negatively correlated and positively correlated tables Having a lower bound of support will prune out the negatively correlated or uncorrelated tables.
14
14 Bulut, Singh # Table standardization A standardized table: visual depiction of the disjoint distribution of two variables after elimination of non-uniform marginals
15
15 Bulut, Singh # Implications of standardization The rankings from different measures become identical
16
16 Bulut, Singh # Implications of standardization cont’d After standardization, a matrix has [x y ; y x] where y = N/2-x and x = f * 11 If you consider monotonically increasing functions of x (nearly all of the measures are) –Identical rankings on standardized, positively correlated tables –Some measures do not satisfy this property Consider the values of x where N/4 < x < N/2 IPF favors “odds ratio” measure, therefore final rankings agree with odds ratio rankings before standardization Leave with: Different standardization techniques may be more appropriate for different application domains
17
17 Bulut, Singh # Measure Selection Based on Rankings by Experts Ideally, experts rank all the contingency tables, choose the best measure accordingly Laborious task if the number of tables is too large Provide a smaller set of tables to decide the best measure
18
18 Bulut, Singh # Table Selection via Disjoint Algorithm Use Disjoint algorithm to choose a subset of tables of cardinality k. Rank tables according to various measures Compute the similarity between different measures A good table selection scheme minimizes
19
19 Bulut, Singh # Experimental Results
20
20 Bulut, Singh # Conclusions Key properties to consider for selecting the right measure No measure is consistently better than others Situations where most measures provide correlated info Choosing the right measure on a non-biased small set of all the tables give good estimates to the ideal solution As a future work –Extension to k-way contingency tables –Association between mixed data types
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.