Mining Sequential Patterns Dimitrios Gunopulos, UCR.

Mining Sequential Patterns Dimitrios Gunopulos, UCR

Finding Frequent Sequential Patterns The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. Applications in telecommunication data, networks, biology

Sequences Sequence ((T=90F)  (H=60%, P=1.1atm)) time t1 Later time t2 attributevalue item itemset k-sequence: sequence with k items T1  H2P1T3  P2, P1T2  H4P2T5: 5-sequences S1 is subsequence of S2 (S1  S2) T1  P1T2  H1T1  P2  H2P1T2 (T1  H1T1, P1T2  H2P1T2) H1P1  T2  H1T1  P2  H2P1T2

Sequential Patterns: The Problem support or frequency of a sequence S (  (S)): = the total number of times sequence S is encountered user specified minimum threshold min_sup S is frequent   (S)  min_sup S:maximal frequent sequence  S is frequent and all of its supersequences are non-frequent S:minimal non-frequent sequence  S is non frequent and all of its subsequences are frequent The problem Given: database D and min_sup the problem: find all frequent sequences in D

Example Database

Algorithms for Sequential Patterns Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] SPADE, Parallel Spade [Zaki, 2001] FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] Sequential Patterns with constraints [Garofalakis et al, VLDB 99] DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]

The Lattice Structure Lemma: All subsequences of a frequent sequence are frequent

SPADE ([Zaki, 2001]) Lattice-based approach vertical id-list format enumerates all frequent sequences equivalence classes to decompose the problem: two k-sequences belong in the same [  ]  i class if they have the same i-length prefix  each class fits in main memory generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix minimizes I/O cost - 2 database scans: frequent 1-sequences, frequent 2-sequences

DFS_MINE Depth-First-Search approach fast discoveries of long maximal frequent patterns uses minimal amount of memory some frequent sequences are deduced to be frequent from lattice candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) in main memory: S.Useless: set of items sequence S must not be intersected with MaxFreqList: List of Maximal Frequent Sequences MinNonFreqList: List of Minimal Non Frequent Sequences scan database to determine the support of candidate sequences

BCDBCD CDCD In MinNonFreqList candidate BCDBCD ABCDEABCDE In MaxFreqList candidate MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent Supersequences S is inserted in MinNonFreqList if: S is not in MinNonFreqList S is not a supersequence of a sequence in MinNonFreqList S was scanned in database and was found to be non-frequent Supersequences of S in MinNonFreqList are removed.  S is inserted in MaxFreqList if: S is not in MaxFreqList S is not a subsequence of a sequence in MaxFreqList S was scanned in database and was found to be frequent Subsequences of S in MaxFreqList are removed.

Examining Candidate Sequences k-sequence S is intersected with all items I j in FreqItems-S.Useless resulting sets SET(S+I j ) for all I j each sequence S: check MinNonFreqList check MaxFreqList scan database for all unknown sequences (if any) in SET(S+I j ) for all I j (1pass) update MaxFreqList, MinNonFreqList

Generating sequences k-sequence S + I j in FreqItems-S.Useless = candidate (k+1)-sequences A  B  C  D + E 1. E  A  B  C  D 2. AE  B  C  D 3. A  E  B  C  D 4. A  BE  C  D 5. A  B  E  C  D 6. A  B  CE  D 7. A  B  C  E  D 8. A  B  C  DE 9. A  B  C  D  E A  B  C  D + D 1. D  A  B  C  D 2. AD  B  C  D 3. A  D  B  C  D 4. A  BD  C  D 5. A  B  D  C  D 6. A  B  CD  D 7. A  B  C  D  D 8. A  B  C  DD  9. A  B  C  D  D  insert item I j in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions. AAAAAA AD  A  AA  A  AD AD  A  AD DD DD 

Useless Set of a sequence S after intersecting S with item I j, it is inserted in S.Useless when intersecting S with item I j, all items I k (k<j) are in S.Useless S.Useless is ‘inherited’ by the (k+1)-sequences produced SET(S+A,A)SET(S+B,B) SET(S+A) SET(S+B) Sequence S SET(S+A,B)=SET(S+B,A) A ABAB B A  B+E E  A  B AE  B A  E  B A  BE A  B  E  A  B+D D  A  B AD  B A  D  B A  BD A  B  D D  A  B +E E  D  A  B DE  A  B D  E  A  B D  AE  B D  A  E  B D  A  BE D  A  B  E Sequence SSET(S+D) D SET(S+E) E SET(S+D,E) E not frequent Bound to be not frequent  Scenario 1 Scenario 2

Open Problems Output subexponential maximal sequential pattern algorithms Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)

Spatiotemporal Datasets Temperature Map US Snow-ice-rain radar USSnow-ice-rain radar NEPrecipitation radar Bay Area Precipitation radar Lakes

Mining Spatiotemporal Data CONQUEST, [Stolorz et al, KDD 1995] –Patterns in global climate change SKICAT, [Fayyad et al, 1996] –Image processing techniques and classification techniques to identify objects in satellite pictures GeoMiner [Han et al, 1997] MultiMediaMiner, [Zaiane et al, 1998] –Data Cube structure. Mining of association and classification rules. DFS-Mine, [Tsoukatos et al, 2001] –Discovery of spatiotemporal patterns

Open Problems Similarity models and indexing techniques for higher- dimensional time series Efficient trend detection/subsequence matching algorithms Algorithms to capture the data distribution when it changes over time New models for capturing the evolution of spatial phenomena over time

Mining Sequential Patterns Dimitrios Gunopulos, UCR.

Similar presentations

Presentation on theme: "Mining Sequential Patterns Dimitrios Gunopulos, UCR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Sequential Patterns Dimitrios Gunopulos, UCR.

Similar presentations

Presentation on theme: "Mining Sequential Patterns Dimitrios Gunopulos, UCR."— Presentation transcript:

Similar presentations

About project

Feedback