IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign

Sequence Database Is Growing!  Sequential pattern mining is an important problem with broad applications Customer shopping sequences Medical treatment sequences Web log mining  Many real life sequence databases grow incrementally Customer continues shopping Patient has new treatment records Web log grows with subsequent visits

Incremental Mining Is Challenging  Undesirable to mine from scratch each time a small fraction of sequences grow  Nontrivial to mine sequential patterns incrementally because Database growth brings in new patterns Growing subsequences interact with original ones  IncSpan: Major new techniques Buffering Semi-frequent patterns Reverse Pattern Matching

Major Challenge: Appending to Existing Sequences  Two kinds of sequence database growth Insert new sequences Append new transactions to existing sequences (More challenging—our focus)  Example: Minimum Support=10%

Semi-Frequent: A Buffer In Between  Given minsup andμ≤ 1, a sequence a is frequent if sup(a) ≥ min_sup semi-frequent if μ·min_sup ≤ sup(a) < min_sup infrequent if sup(a) <μ·min_sup  Incremental sequential pattern mining Given a sequence database D, a min_sup threshold, the set of frequent subsequences FS in D, and an appended sequence database D’ of D Mine the set of frequent subsequences FS’ in D’ based on FS instead of mining on D’ from scratch

Semi-Frequent Sequence Buffering and Maintenance  Keeping some additional information about the original database for incremental mining  Buffering semi-frequent subsequences SFS of the original database SFS are “almost frequent”, they are likely to become frequent in the growing database SFS is a boundary between frequent and infrequent sequences Keep FS and SFS of the original database

Possible State Transitions After Appending Status In DStatus In D’Comment Frequent Easy Semi-frequentFrequent Easy Semi-frequent Easy Not appearAppear Have no information of infrequent pattern or new items InfrequentFrequent InfrequentSemi-frequent

Buffering Technique (I)  Handle “infrequent-to-frequent” case. If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS Solution: Start from its frequent prefix p and construct p-projected database to discover p’ Theorem (Used for search space pruning) For a frequent pattern p, if its support in satisfies the condition, then there is no sequence p’ having p as prefix changing from infrequent in D to frequent in D’

Buffering Technique (II)  Handle “infrequent-to-semi-frequent” case If an infrequent pattern p’ in D becomes semi- frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS Solution: Start from its frequent or semi- frequent prefix p and construct p-projected database to discover p’

Reverse Pattern Matching  An optimization technique: Match a pattern against a sequence from end towards front Since the item sets are appended at the end, reverse matching can save some computation If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p) So, just scan Sa for the last item in p and prune search if the above condition meets

Performance Study  Compare with ISM algorithm [Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99] PrefixSpan – mining from scratch approach to see how much we can save  Compare CPU time and memory usage Figure 1. Memory Usage under varied minsup

Performance Study (II) Figure 2. Varying minsupFigure 3. Varying percentage of updated sequences

Discussion and Conclusion  Buffering semi-frequent patterns is effective User can control the size of SFS by μ SFS is within 1 μ from being frequent, so likely to become frequent with dababase growth  When only a small portion (5%) of the database is appended, IncSpan is more efficient than mining from scratch  IncSpan can be easily extended to handle inserting or deleting sequences from database  Handling incremental mining in Stream data? No. still needs more than one scan of the database

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback