Discriminative Pattern Mining By Mohammad Hossain
Based on the paper Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data by 1. Gang Fang 2. Gaurav Pandey 3. Wen Wang 4. Manish Gupta 4.Michael Steinbach 5.Vipin Kumar
What is Discriminative Pattern A pattern is said to be Discriminative when its occurrence in two data sets (or in two different classes of a single data set) is significantly different. One way to measure such discriminative power of a pattern is to find the difference between the supports of the pattern in two data sets. When this support-difference (DiffSup) is greater then a threshold the the pattern is called discriminative.
An example Transaction-id Items 10 A, C 20 B, C 30 A, B, C 40 A, B, C, D Transaction-id Items 10 A, B 20 A, C 30 A, B, E 40 A, C, D Pattern Support in D+ Support in D- DiffSup A 3 4 1 B 2 C AB AC ABC If we consider the DiffSup =2 then the pattern C and ABC become interesting patterns.
Importance Discriminative patterns have been shown to be useful for improving the classification performance for data sets where combinations of features have better discriminative power than the individual features For example, for biomarker discovery from case-control data (e.g. disease vs. normal samples), it is important to identify groups of biological entities, such as genes and single-nucleotide polymorphisms (SNPs), that are collectively associated with a certain disease or other phenotypes
As a result, it will not work in Apriori like framework. P1 = {i1, i2, i3} P2 = {i5, i6, i7} P3 = {i9, i10} P4 = {i12, i13, i14}. P C1 C2 DifSup P1 6 P2 P3 3 P4 9 2 7 DiffSup is NOT Anti-monotonic As a result, it will not work in Apriori like framework. P1 P2 P3 P4 i1 i2 i3 i5 i6 i7 i9 i10 i12 i13 i14 1 2 6 7
Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2
Pattern Support in D+ Support in D- DiffSup A 3 4 1 B 2 C AB AC ABC But here we see, though the patterns AB and AC both have DiffSup < threshold (2) their super set ABC has DiffSup = 2 which is equal to threshold and thus becomes interesting. So AB, AC cannot be pruned.
BASIC TERMINOLOGY AND PROBLEM DEFINITION Let D be a dataset with a set of m items, I = {i1, i2, ..., im}, two class labels S1 and S2. The instances of class S1 and S2 are denoted by D1 and D2. We have |D| = |D1| + |D2|. For a pattern (itemset) α = {α1,α2,...,αl} the set of instances in D1 and D2 that contain α are denoted by Dα1 and Dα2. The relative supports of α in classes S1 and S2 are RelSup1(α) = |Dα1 |/|D1| and RelSup2(α) = |Dα2 |/}D2| The absolute difference of the relative supports of α in D1 and D2 is denoted as DiffSup(α) = |RelSup1(α) − RelSup2(α)|
New function Some new functions are proposed that has anti-monotonic property and can be used in a apriori like frame work for pruning purpose. One of them is BiggerSup defined as: BiggerSup(α) = max(RelSup1(α), RelSup2(α)). BiggerSup is anti-monotonic and the upper bound of DiffSup. So we may use it for pruning in the apriori like frame work.
BiggerSup is a weak upper bound of DiffSup. For instance, in the previous example if we want to use it to find discriminative patterns with thresold 4, P3 can be pruned, because it has a BiggerSup of 3. P2 can not be pruned (BiggerSup(P2) = 6), even though it is not discriminative (DiffSup(P2) = 0). More generally, BiggerSup-based pruning can only prune infrequent non-discriminative patterns with relatively low support, but not frequent non- discriminative patterns.
A new measure: SupMaxK The SupMaxK of an itemset α in D1 and D2 is defined as SupMaxK(α) = RelSup1(α) − maxβ⊆α(RelSup2(β)), where |β| = K If K=1 then it is called SupMax1 and defined as SupMax1(α) = RelSup1(α) − maxa∈α(RelSup2({a})). Similarly with K=2 we can define SupMax2 which is also called SupMaxPair.
Properties of the SupMaxK Family
Relationship between DiffSup, BiggerSup and the SupMaxK Family
SupMaxPair: A Special Member Suitable for High-Dimensional Data In SupMaxK, as K increases we get more complete set of discriminative patterns. But as K increased the complexity of calculation of SupMaxK also increases. In fact the complexity of calculation of SupMaxK is O(mK). So for high dimensional data (where m is large) high value of K (K>2)makes it infeasible. In that case SupMaxPair can be used.