Presentation is loading. Please wait.

Presentation is loading. Please wait.

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign."— Presentation transcript:

1 IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign

2 Sequence Database Is Growing!  Sequential pattern mining is an important problem with broad applications Customer shopping sequences Medical treatment sequences Web log mining  Many real life sequence databases grow incrementally Customer continues shopping Patient has new treatment records Web log grows with subsequent visits

3 Incremental Mining Is Challenging  Undesirable to mine from scratch each time a small fraction of sequences grow  Nontrivial to mine sequential patterns incrementally because Database growth brings in new patterns Growing subsequences interact with original ones  IncSpan: Major new techniques Buffering Semi-frequent patterns Reverse Pattern Matching

4 Major Challenge: Appending to Existing Sequences  Two kinds of sequence database growth Insert new sequences Append new transactions to existing sequences (More challenging—our focus)  Example: Minimum Support=10%

5 Semi-Frequent: A Buffer In Between  Given minsup andμ≤ 1, a sequence a is frequent if sup(a) ≥ min_sup semi-frequent if μ·min_sup ≤ sup(a) < min_sup infrequent if sup(a) <μ·min_sup  Incremental sequential pattern mining Given a sequence database D, a min_sup threshold, the set of frequent subsequences FS in D, and an appended sequence database D’ of D Mine the set of frequent subsequences FS’ in D’ based on FS instead of mining on D’ from scratch

6 Semi-Frequent Sequence Buffering and Maintenance  Keeping some additional information about the original database for incremental mining  Buffering semi-frequent subsequences SFS of the original database SFS are “almost frequent”, they are likely to become frequent in the growing database SFS is a boundary between frequent and infrequent sequences Keep FS and SFS of the original database

7 Possible State Transitions After Appending Status In DStatus In D’Comment Frequent Easy Semi-frequentFrequent Easy Semi-frequent Easy Not appearAppear Have no information of infrequent pattern or new items InfrequentFrequent InfrequentSemi-frequent

8 Buffering Technique (I)  Handle “infrequent-to-frequent” case. If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS Solution: Start from its frequent prefix p and construct p-projected database to discover p’ Theorem (Used for search space pruning) For a frequent pattern p, if its support in satisfies the condition, then there is no sequence p’ having p as prefix changing from infrequent in D to frequent in D’

9 Buffering Technique (II)  Handle “infrequent-to-semi-frequent” case If an infrequent pattern p’ in D becomes semi- frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS Solution: Start from its frequent or semi- frequent prefix p and construct p-projected database to discover p’

10 Reverse Pattern Matching  An optimization technique: Match a pattern against a sequence from end towards front Since the item sets are appended at the end, reverse matching can save some computation If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p) So, just scan Sa for the last item in p and prune search if the above condition meets

11 Performance Study  Compare with ISM algorithm [Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99] PrefixSpan – mining from scratch approach to see how much we can save  Compare CPU time and memory usage Figure 1. Memory Usage under varied minsup

12 Performance Study (II) Figure 2. Varying minsupFigure 3. Varying percentage of updated sequences

13 Discussion and Conclusion  Buffering semi-frequent patterns is effective User can control the size of SFS by μ SFS is within 1­ μ from being frequent, so likely to become frequent with dababase growth  When only a small portion (5%) of the database is appended, IncSpan is more efficient than mining from scratch  IncSpan can be easily extended to handle inserting or deleting sequences from database  Handling incremental mining in Stream data? No. still needs more than one scan of the database


Download ppt "IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google