Download presentation
Presentation is loading. Please wait.
1
Mining Sequential Patterns Dimitrios Gunopulos, UCR
2
Finding Frequent Sequential Patterns The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. Applications in telecommunication data, networks, biology
3
Sequences Sequence ((T=90F) (H=60%, P=1.1atm)) time t1 Later time t2 attributevalue item itemset k-sequence: sequence with k items T1 H2P1T3 P2, P1T2 H4P2T5: 5-sequences S1 is subsequence of S2 (S1 S2) T1 P1T2 H1T1 P2 H2P1T2 (T1 H1T1, P1T2 H2P1T2) H1P1 T2 H1T1 P2 H2P1T2
4
Sequential Patterns: The Problem support or frequency of a sequence S ( (S)): = the total number of times sequence S is encountered user specified minimum threshold min_sup S is frequent (S) min_sup S:maximal frequent sequence S is frequent and all of its supersequences are non-frequent S:minimal non-frequent sequence S is non frequent and all of its subsequences are frequent The problem Given: database D and min_sup the problem: find all frequent sequences in D
5
Example Database
6
Algorithms for Sequential Patterns Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] SPADE, Parallel Spade [Zaki, 2001] FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] Sequential Patterns with constraints [Garofalakis et al, VLDB 99] DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]
7
The Lattice Structure Lemma: All subsequences of a frequent sequence are frequent
8
SPADE ([Zaki, 2001]) Lattice-based approach vertical id-list format enumerates all frequent sequences equivalence classes to decompose the problem: two k-sequences belong in the same [ ] i class if they have the same i-length prefix each class fits in main memory generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix minimizes I/O cost - 2 database scans: frequent 1-sequences, frequent 2-sequences
9
SPADE
10
DFS_MINE Depth-First-Search approach fast discoveries of long maximal frequent patterns uses minimal amount of memory some frequent sequences are deduced to be frequent from lattice candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) in main memory: S.Useless: set of items sequence S must not be intersected with MaxFreqList: List of Maximal Frequent Sequences MinNonFreqList: List of Minimal Non Frequent Sequences scan database to determine the support of candidate sequences
11
BCDBCD CDCD In MinNonFreqList candidate BCDBCD ABCDEABCDE In MaxFreqList candidate MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent Supersequences S is inserted in MinNonFreqList if: S is not in MinNonFreqList S is not a supersequence of a sequence in MinNonFreqList S was scanned in database and was found to be non-frequent Supersequences of S in MinNonFreqList are removed. S is inserted in MaxFreqList if: S is not in MaxFreqList S is not a subsequence of a sequence in MaxFreqList S was scanned in database and was found to be frequent Subsequences of S in MaxFreqList are removed.
12
Examining Candidate Sequences k-sequence S is intersected with all items I j in FreqItems-S.Useless resulting sets SET(S+I j ) for all I j each sequence S: check MinNonFreqList check MaxFreqList scan database for all unknown sequences (if any) in SET(S+I j ) for all I j (1pass) update MaxFreqList, MinNonFreqList
13
Generating sequences k-sequence S + I j in FreqItems-S.Useless = candidate (k+1)-sequences A B C D + E 1. E A B C D 2. AE B C D 3. A E B C D 4. A BE C D 5. A B E C D 6. A B CE D 7. A B C E D 8. A B C DE 9. A B C D E A B C D + D 1. D A B C D 2. AD B C D 3. A D B C D 4. A BD C D 5. A B D C D 6. A B CD D 7. A B C D D 8. A B C DD 9. A B C D D insert item I j in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions. AAAAAA AD A AA A AD AD A AD DD DD
14
Useless Set of a sequence S after intersecting S with item I j, it is inserted in S.Useless when intersecting S with item I j, all items I k (k<j) are in S.Useless S.Useless is ‘inherited’ by the (k+1)-sequences produced SET(S+A,A)SET(S+B,B) SET(S+A) SET(S+B) Sequence S SET(S+A,B)=SET(S+B,A) A ABAB B A B+E E A B AE B A E B A BE A B E A B+D D A B AD B A D B A BD A B D D A B +E E D A B DE A B D E A B D AE B D A E B D A BE D A B E Sequence SSET(S+D) D SET(S+E) E SET(S+D,E) E not frequent Bound to be not frequent Scenario 1 Scenario 2
15
Open Problems Output subexponential maximal sequential pattern algorithms Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)
16
Spatiotemporal Datasets Temperature Map US Snow-ice-rain radar USSnow-ice-rain radar NEPrecipitation radar Bay Area Precipitation radar Lakes
17
Mining Spatiotemporal Data CONQUEST, [Stolorz et al, KDD 1995] –Patterns in global climate change SKICAT, [Fayyad et al, 1996] –Image processing techniques and classification techniques to identify objects in satellite pictures GeoMiner [Han et al, 1997] MultiMediaMiner, [Zaiane et al, 1998] –Data Cube structure. Mining of association and classification rules. DFS-Mine, [Tsoukatos et al, 2001] –Discovery of spatiotemporal patterns
18
Open Problems Similarity models and indexing techniques for higher- dimensional time series Efficient trend detection/subsequence matching algorithms Algorithms to capture the data distribution when it changes over time New models for capturing the evolution of spatial phenomena over time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.