Efficient Mining of Emerging Patterns and Emerging Substrings

Efficient Mining of Emerging Patterns and Emerging Substrings
Speaker: Sarah Chan CSIS DB Seminar Feb 1, 2002

Presentation Outline Introduction – Why EPs?
Definitions and mining of EPs EP-based classifier – CAEP Mining of EP variants Definitions of ESs and JESs Mining of ESs and JESs Conclusions

1. Introduction – Why EPs? What are emerging patterns (EPs)?
EPs are itemsets whose supports increase significantly from one dataset to another E.g. Itemset X is an EP from D1 to D2 if: Growth rate = (suppD2(X) / suppD1(X)) ≥ threshold Introduced by Dong and Li, 1999 Why are EPs useful? EPs capture multi-attribute contrasts between data classes, or trends over time EPs provide knowledge for building classifiers

1. Introduction – Why EPs? Example 1 – Mushroom Data (UCI repository)
Typical EPs: X = {(ODOR=none), (GILL_SIZE=broad), (RING_NUMBER=one)} Y = {(BRUISES=no),(GILL_SPACING=close), (VEIL_COLOR=white)} Some EPs contain more than 8 items EPs capture differentiating characteristics between edible and poisonous Mushrooms EP supp_in_poisonous supp_in_edible growth_rate X 0% 63.9% ∞ Y 81.4% 3.8% 21.4 Accurate classifiers can be built

1. Introduction – Why EPs? Example 2 – Sales figures
X = {COMPUTER, MODEMS, EDU-SOFTWARES} In 1985: purchases out of 2M transactions supp1(X) = 0.5% In 1986: purchases out of 2.1M transactions supp2(X) = 1% Growth rate = 1% / 0.5% = 2 EPs with low to medium support (e.g. 1%-20%) can give very useful new insights and guidance to experts

2. Definitions and mining of EPs
Database model D = Dataset (a set of transactions) I = the set of all (binary) items itemset = a set of items = a subset of I T = a transaction in D; it is an itemset Some definitions A transaction T contains an itemset X, if X  T count of itemset X in dataset D countD(X) = no. of transactions in D that contain X

Some definitions support of itemset X in dataset D suppD(X) = countD(X) / |D| Itemset X is -large in dataset D if suppD(X) ≥  Otherwise X is -small in D Growth rate of itemset X from D1 to D2 GrowthRateD1→D2(X) = if suppD1(X) = 0 and suppD2(X) = 0 ∞ if suppD1(X) = 0 and suppD2(X) ≠ 0 otherwise

More definitions Given growth-rate threshold  > 1 and itemset X, if GrowthRateD1→D2(X) ≥ , X is an -EP (or simply EP) from D1 to D2 X is an EP of D2 Given > 2 datasets (e.g. D1, D2, D3, .. , Dn) EPs of Dk = EPs from D’ to Dk, where D’ = Di The EP mining problem: For a given , to find all -EPs

Support plane suppD1(X) (1, 1) 1 σ1 Itemset X satisfies: suppD1(X) = σ1 suppD2(X) = σ2 x X(σ2, σ1) σ2 suppD2(X)

EPs from D1 to D2 fall onto ΔACE suppD1(X) (1, 1) GrowthRateD1→D2(X) = ≥  1 l1 E (1, 1/) C A suppD2(X)

EPs from D1 to D2 fall onto various regions suppD1(X) l3 : suppD2(X) = θmin (1, 1) 1 l1 E θmin and min are related by:  = θmin / min (1, 1/) l2 : suppD1(X) = min C A suppD2(X)

EPs from D1 to D2 fall onto 3 regions of ΔACE suppD1(X) l3 : suppD2(X) = θmin (1, 1) ΔABG: contains a lot of EPs low supports in both datasets → less useful, can be neglected 1 l1 E (1, 1/) l2 : suppD1(X) = min G C A B suppD2(X)

EPs from D1 to D2 fall onto 3 regions of ΔACE: suppD1(X) l3 : suppD2(X) = θmin (1, 1) ΔGDE high supports in both datasets usually not many EPs → itemset enumeration possible if many EPs → solve recursively 1 l1 E (1, 1/) l2 : suppD1(X) = min G D C A B suppD2(X)

EPs from D1 to D2 fall onto 3 regions of ΔACE: suppD1(X) l3 : suppD2(X) = θmin (1, 1) BCDG rectangle high support in D2 but low support in D1 dedicated mining algorithms developed 1 l1 E (1, 1/) l2 : suppD1(X) = min G D C A B suppD2(X)

Challenges in EP mining? Apriori property does not hold {a,b,c} is frequent → {a}, {b}, {c}, {a,b}, {b,c} and {c,a} are all frequent {a,b,c} is an EP → any subset of {a,b,c} is an EP E.g. Mushroom dataset: X = {(ODOR=none), (GILL_SIZE=broad), (RING_NUMBER=one)} is an EP but {ODOR=none}, {GILL_SIZE=broad} & {RING_NUMBER=one} are not x Often too many candidates E.g. PUMS (U.S. census) dataset with 350 items Naïve algorithm: process 2350 = itemsets → impossible task Clever naïve algorithm: process 240 = 1012 itemsets → still take too long

To mine EPs efficiently, Need a more concise way to describe collections of large itemsets Need a more efficient way to discover EPs Border representation can help!

Border representation Based on interval-closedness property of collections of large itemsets Borders of large itemsets efficiently discovered by Bayardo’s Max-Miner EPs mined efficiently EPs concisely represented by borders

Interval-closedness A collection S of sets is interval-closed if X, Z  S and Y X  Y  Z  Y  S E.g. S = { {1,2}, {2,3}, {1,2,3}, {1,2,4}, {2,3,4}, {1,2,3,4} } For any fixed support threshold , the collection of all -large itemsets is interval-closed Proof: By Apriori property, Z is  -large and Y  Z  Y is -large

Border: <L, R> L: Left-hand bound, R: Right-hand bound Set interval of border <L, R>: [L, R] Collection of sets represented by <L, R> [L, R] = {Y | X  L, Z  R such that X  Y  Z} E.g. The set interval of < { {1}, {2,3} }, { {1,2,3}, {2,3,4} } > is: { {1}, {1,2}, {1,3}, {1,2,3}, {2,3}, {2,3,4} } Each interval-closed collection S of sets has a unique border <L, R>. L and R are the collections of minimal and maximal sets in S For collection of large itemsets in dataset D, L = {}

EP mining in region BCDG Find border representation for large itemsets (large border) in D1 and D2 with support thresholds θmin and min respectively (by Max-Miner) Required set of EPs = large border of D2 – large border of D1 (by border differential of borders) Border differential: by MBD-LLBORDER and subroutine BORDER-DIFF

BORDER-DIFF Input: two borders <{}, {U}> and <{}, R1> Output: <L2, {U}> s.t. [L2, {U}] = [{}, {U}] – [{}, R1] E.g. BORDER-DIFF(<{}, {1,2,3,4}>, <{}, {{2,3},{2,4},{3,4}}>) [{}, {1,2,3,4}] – [{}, {{2,3},{2,4},{3,4}}] = {{1,1,1},{1,1,2},{1,3,1},{1,3,2},{4,1,1},{4,1,2},{4,3,1},{4,3,2}} = {{1},{1,2},{1,3},{1,2,3},{1,1,4},{1,2,4},{1,3,4},{2,3,4}} Removing non-minimal itemsets  L2 = {{1},{2,3,4}} Output border: <{{1},{2,3,4}}, {{1,2,3,4}}> {1,4} {1,3} {1,2}

BORDER-DIFF Correctness: Efficiency: much higher than naïve algorithms Only examine borders, itemset enumeration unnecessary Improvement: By iterative removal of non-minimal itemsets, we avoid generating large intermediate results Border Set interval of border <{}, {1,2,3,4}> { {1},{2},{3},{4},{1,2},{1,3},{1,4},{2,3},{2,4}, {3,4},{1,2,3},{1,2,4},{1,3,4},{2,3,4},{1,2,3,4} } <{}, {{2,3},{2,4},{3,4}}> { {2},{3},{4},{2,3},{2,4},{3,4} } <{{1},{2,3,4}}, {{1,2,3,4}}> { {1},{1,2},{1,3},{1,4},{1,2,3},{1,2,4},{1,3,4}, {2,3,4},{1,2,3,4} }

MBD-LLBORDER Discover all EPs in rectangle BCDG By calling BORDER-DIFF multiple times Large border of D1 = <{}, {C1, C2, .., Cm}> Large border of D2 = <{}, {D1, D2, .., Dn}> Basic idea: All EPs in BCDG have support   in D2 but <  in D1 So that are elements of: Un PowerSet(Dj) – Um PowerSet(Ci) = Un (PowerSet(Dj) – Um PowerSet(Ci)) = Un (PowerSet(Dj) – Um PowerSet(Ci  Dj))

MBD-LLBORDER BORDER-DIFF(<{}, {Dj}>, <{}, {C1’, C2’, .., Ck’}>) Where Ci’denotes (Ci  Dj ) All non-maximal Ci’s are pruned This subroutine is called at most n times The algorithm returns a collection of up to n borders The collection of all EPs in BCDG is the union of up to n set intervals of all borders derived

MBD-LLBORDER E.g. Large border of D1 = <{}, {{2,3,5},{3,4,6,7,8},{2,4,5,8,9}}> Large border of D2 = <{}, {{1,2,3,4},{6,7,8}}> PowerSet({1,2,3,4}) – PowerSet({2,3,5}  {1,2,3,4}) – PowerSet({3,4,6,7,8}  {1,2,3,4}) – PowerSet({2,4,5,8,9}  {1,2,3,4})  BORDER-DIFF(<{}, {{1,2,3,4}}>, <{}, {{2,3},{3,4},{2,4}>)  1st border returned: <{{1},{2,3,4}}, {{1,2,3,4}}> PowerSet({6,7,8}) – PowerSet({3,4,6,7,8}  {6,7,8}) =   No need to call BORDER-DIFF the 2nd time  MBD-LLBORDER returns {<{{1},{2,3,4}}, {{1,2,3,4}}>}

3. EP-based classifier - CAEP
CAEP: Classification by Aggregating EPs High accuracy Each EP is a multi-attribute test CAEP uses the combined power of a set of EPs to arrive at a classification decision Usually equally accurate on all classes even if their populations are unbalanced Reported to outperform C4.5 and CBA on all except one datasets tested

Aggregating score Let all EPs of a class Ci that s contains contribute to the decision of whether s should be labeled as Ci Given an instance s and a set of EPs of a class Ci, the (aggregate) score of s for Ci is: Normalizing score norm_score(s, Ci) = score(s, Ci) / base_score(Ci) Base score: score at a fixed percentile for training instances of each class

CAEP claims to “approximate” Pr(s|Ci) x Pr(Ci) using normalized score given test instance s → estimation of Pr(Ci|s) Reduction of EPs used EPs with relatively high supports → larger coverage EPs with high growth rates → stronger differentiating power Filter away those with low supports and growth rates Reduction may increase predictive accuracy

Overview of CAEP Training phase 1. Mine EPs of each class Ci 2. Optionally filter away less significant EPs in each E(Ci) 3. Find base_score(Ci) for each class Ci Testing phase (given a test instance s) 1. Calculate norm_score(s, Ci) for each class Ci 2. Classify s to class Ci if norm_score(s, Ci) is largest

4. Mining of EP variants I. Strong EPs II. Jumping EPs (JEPs)
EPs all of whose subsets are also EPs Can be mined in a way similar to Apriori, by using subset closure property II. Jumping EPs (JEPs) Itemsets whose supports in one dataset are zero but non-zero in the other dataset (i.e. growth rate = ∞) E.g. Mushroom dataset {(ODOR=foul) and (VEIL_COLOR=white)} is a JEP from the edible category to the poisonous category with a support of 55.2% A JEP has a sharper discriminating power than a general EP

4. Mining of EP variants II. Jumping EPs (JEPs)
suppD1(X) (1, 1) 1 All JEPs from D1 to D2 lie on the horizontal axis (excluding the origin) suppD2(X)

4. Mining of EP variants II. Jumping EPs (JEPs)
Find horizontal borders from D1 and D2 resp. by HORIZON-MINER Horizontal border: a large border representing all non-zero support itemsets in a dataset Support thresholds are very small → Max-Miner fail to do the job Find JEPs by MBD-LLBORDER Input: Two discovered horizontal borders Output: All EPs on the horizontal axis

4. Mining of EP variants II. Jumping EPs (JEPs) BCDG rectangle → JEPs
suppD1(X) l3 : suppD2(X) = θmin (1, 1) BCDG rectangle support in D2  θmin (non-zero support) support in D1 < min (zero support) → JEPs 1 l1 E (1, 1/) G D l2 : suppD1(X) = min A B C suppD2(X)

4. Mining of EP variants III. Most Expressive JEPs (MEJEPs)
E.g. JEPs from D1 to D2 : <{{a,b}}, {{a,b,c,d}}> JEPs from D2 to D1 : <{{a,e},{c,d,e}}, {{a,c,d,e}}> and <{{b,e},{c,d,e}}, {b,c,d,e}}> For each border, JEPs in left bound have the highest support MEJEPs in D1 and D2 are the union of the left bounds of all the above borders i.e. { {a,b}, {a,e}, {b,e}, {c,d,e} } Building JEP-Classifier with MEJEPs High support and growth rate → strong discriminating power Strengthens resistance to noise in training data Reduces complexity

4. Mining of EP variants The JEP-Classifier and CAEP
Both are based on aggregated power of EPs Both are almost consistently better than C4.5 & CBA JEP-Classifier is simpler as growth rate is not a concern CAEP is better for cases with few or even no JEPs, JEP-Classifier is better when there are many JEPs DeEPs (Decision Making by Emerging Patterns) Instance-based classification New way of selecting sharp and relevant EPs Better accuracy, speed and dimensional scalability

5. Definitions of ESs and JESs
A sequence database consists of sequences A sequence contains one or more substrings A string(sequence) is a set of ordered symbols A substring of a string  is a sub-part of  and it is a string Some definitions count of string  in sequence database D countD() = no. of sequences in D that contain  support of string  in sequence database D suppD() = countD() / |D|

More definitions String  is -large in database D if suppD() ≥  Support ratio of string  from D1 to D2 suppRatioD1→D2() = suppD2() / suppD1() Given support ratio threshold  > 1 and sequence , if suppRatioD1→D2() ≥ ,  is an -ES (or simply ES) from D1 to D2  is an ES of D2

Jumping Emerging Substrings (JESs) suppD1() = 0 and suppD2()  0 i.e. suppRatioD1→D2() = ∞ E.g. With growth rate threshold of 1.2 ESs from C1 to C2 : b, abd ESs from C2 to C1 : a, abc, bcd, abcd JESs are in purple Class C1 Class C2 abcd bd a c abd bc cd b

6. Mining of ESs and JESs Can we transform ES mining problem into EP mining problem? Theoretically, yes Go through sequence database, extract all possible substrings in database and treat them as attributes of itemset database E.g. Previous example has 12 possible substrings: a, b, c, d, ab, bc, bd, cd, abc, abd, bcd, abcd String ab is transformed into {1,1,0,0,1,0,0,0,0,0,0,0} Single-attribute EPs found in itemset database are ESs in original sequence database

6. Mining of ESs and JESs Can we transform ES mining problem into EP mining problem? Practically, no Conversion process would be too time consuming Resulted itemset database would contain too many attributes → Inefficient EP mining Difficulties in ES mining Border approach not possible (for general ESs) No existing algorithms can find the borders of the set containing all -large substrings in a sequence database Frequent substring mining is too time-consuming No. of possible substrings S in a sequence is far more than no. of possible itemsets in a transaction T

6. Mining of ESs and JESs Brute-force approach – substring enumeration
Enumerate all possible substrings in database and find their support counts in each class Time complexity of brute-force approach A length-n sequence contains O(n2) substrings k sequences in database → O(kn2) substrings For each substring s, we match it against each sequence in database to see if s is contained in sequence → O(k2n2) substring matching operations Each substring matching: O(n) → Overall: O(k2n3)

6. Mining of ESs and JESs Shortcomings of brute-force approach
1. A sequence often contains repeated substrings. All O(n2) substrings are enumerated before duplicates are removed. 2. Different sequences often contain some common substrings. All O(kn2) substrings are enumerated. 3. Redundant matching If sequence S contains substring abcd, we know that S also contains abc But any given substring t is matched against both abc and abcd In fact, if t does not match abc, it will not match abcd as well Also, if abc is found to be a JES of S, abcd is also a JES of S

6. Mining of ESs and JESs Merged suffix tree approach
(a cleverer brute-force approach) To extract unique substrings in each sequence Transform each sequence into a suffix tree To extract unique substrings in database Merge suffix trees to form a merged suffix tree (search tree) Maintain support count of each substring for each class while merging trees To extract all ESs efficiently Traverse the resultant search tree and extract all substrings which satisfy support and support ratio thresholds

6. Mining of ESs and JESs Merged suffix tree example
(c1, c2) = (count in C1, count in C2) Class C1 Class C2 abcd bd a c abd bc cd b

6. Mining of ESs and JESs Time complexity of merged suffix tree approach To extract only unique substrings in each sequence Given a sequence of length n, a suffix tree can be constructed in O(n) time and O(n) space Total time for k sequences: O(kn) To extract only unique substrings in database Merging of k suffix trees takes O(k2n) time To extract all ESs efficiently Since the merged suffix tree takes O(kn) space, a complete tree traversal takes O(kn) time Overall: O(kn + k2n + kn) = O(k2n)

6. Mining of ESs and JESs Mining of JESs by border approach
We can find the border representation of the set of all substrings with non-zero supports We can define the set intervals of substrings and operations involved e.g. set difference, union So we can use the border approach to discover all JESs in sequence database

7. Conclusions EPs and ESs are useful since they can be used for analysis and for building powerful classifiers Mining of EPs in itemset databases and mining ESs in sequence databases are challenging problems, due to the gigantic number of itemsets or substrings involved It is not easy to apply techniques for extracting EPs to extraction of ESs There is room for improvement in the brute-force approach, and for development of novel and efficient algorithms, for extraction of ESs

References Efficient Mining of Emerging Patterns: Discovering Trends and Differences. G. Dong and J. Li. (KDD’99) CAEP: Classification by Aggregating Emerging Patterns. G. Dong, X. Zhang, L. Wong, and J. Li. (DS-99) Discovering Jumping Emerging Patterns and Experiments on Real Datasets. G. Dong, J. Li and X. Zhang. (IDC’99) Making Use of the Most Expressive Jumping Emerging Patterns for Classification. J. Li, G. Dong, and K. Ramamohanarao. (PAKDD-00) Sequence Classification and Melody Tracks Selection. F. Tang. M.Phil. dissertation

Efficient Mining of Emerging Patterns and Emerging Substrings
- The End -

Efficient Mining of Emerging Patterns and Emerging Substrings

Similar presentations

Presentation on theme: "Efficient Mining of Emerging Patterns and Emerging Substrings"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Mining of Emerging Patterns and Emerging Substrings

Similar presentations

Presentation on theme: "Efficient Mining of Emerging Patterns and Emerging Substrings"— Presentation transcript:

Similar presentations

About project

Feedback