Presentation is loading. Please wait.

Presentation is loading. Please wait.

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Similar presentations


Presentation on theme: "CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference."— Presentation transcript:

1 CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-177, San Fransisco, CA, May 2003. Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-177, San Fransisco, CA, May 2003. Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2006/01/10

2 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 2 Outline Introduction Introduction Search Space Pruning Search Space Pruning CloSpan CloSpan Experimental Results Experimental Results Conclusions Conclusions

3 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 3 Introduction Apriori-like algorithm will generate a huge set of candidate sequences. Apriori-like algorithm will generate a huge set of candidate sequences. Ex. There are 1000 frequent sequences of length-1  1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining. Many scans of databases in mining. Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}  The Apriori-based method must scan the database at least 15 times. Difficulties at mining long sequential patterns. Difficulties at mining long sequential patterns. Ex. There is only a single sequence of length 100, min_sup=1 length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1  10^30

4 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 4 Introduction (Cont.) Definition Definition – Sequence, Elements, Subsequence and Sequential Pattern A sequence : Elements items within an element are listed alphabetically is a subsequence of Given support threshold min_sup_count =2, is a sequential pattern A sequence database <eg(af)cbc>40 30 <(ad)c(bc)(ae)>20 10 sequenceSID

5 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 5 Introduction (Cont.) Definition Definition – Frequent Sequential Pattern (FS) Include all the sequences whose support is no less than min_sup Include all the sequences whose support is no less than min_sup – Closed Frequent Sequential Pattern (CS) Include no sequence which has a super- sequence with the same support Include no sequence which has a super- sequence with the same support CS  FS CS  FS

6 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 6 Introduction (Cont.) Example – FS & CS Example – FS & CS IDSequence (af)dea eab e(abf)(bde) 0 1 2 min_sup_count = 2 FS: CS: a:3, b:2, d:2, e:3, f:2, ab:2, ad:2, ae:2, (af):2, ea:3, eb:2, fd:2, fe:2, (af)d:2, (af)e:2, eab:2 ea:3, (af)d:2, (af)e:2, eab:2

7 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 7 Introduction (Cont.) Definition Definition – Prefix and Postfix (Projection),, and are prefixes of sequence,, and are prefixes of sequence Given sequence Given sequence Prefix Postfix /Projection <a><(abc)(ac)d(cf)> <aa><(_bc)(ac)d(cf)> <ab><(_c)(ac)d(cf)>

8 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 8 Introduction (Cont.) Definition Definition – sequence s = – sequence s = – an item  – I-Step extension s  i  = s  i  = Ex: is an I-Step extension of Ex: is an I-Step extension of – S-Step extension s  s  = s  s  = Ex: is an S-Step extension of Ex: is an S-Step extension of

9 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 9 Introduction (Cont.) Definition Definition – Prefix Search Tree <> asasasas bibibibi asasasas bsbsbsbs asasasas bsbsbsbs bsbsbsbs didididi cicicici <><(a)><(b)> <(ab)><(a)(a)><(a)(b)> <(ab)(a)><(ab)(b)><(a)(bc)><(a)(bd)>

10 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 10 Search Space Pruning Definition Definition – Common Prefix Example Example –D s = {de(af), de(fg)} –s  not closed  unnecessary to extend s  –s  not closed  unnecessary to extend s  – Partial Order Example Example –Before projecting D into D a, D b, D d, D e, D f –a is always before the f in all the sequences –Need not search any sequence beginning with f

11 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 11 Search Space Pruning (Cont.) Definition Definition –  (D) Total number of items in D Total number of items in D – Equivalence of Projected Database Two sequences s and s’, s  s’ Two sequences s and s’, s  s’ D s = D s’   (D s ) =  (D s’ ) D s = D s’   (D s ) =  (D s’ ) Example Example –D (af) = D f = {de, (de)} –  (D (af) ) =  (D f ) = 4 IDSequence (af)dea eab e(abf)(bde) 0 1 2

12 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 12 Search Space Pruning (Cont.) Definition Definition – Early Termination by Equivalence Two sequences s and s’, s  s’ Two sequences s and s’, s  s’ And also  (D s ) =  (D s’ ) And also  (D s ) =  (D s’ ) Then , support(s   ) = support(s’   ) Then , support(s   ) = support(s’   ) Example Example –  (D (af) ) =  (D f ) –(af)d & (af)e are frequent –support((af)d) = support(fd) –support((af)e) = support(fe) –don’t know the support of fd and fe

13 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 13 Search Space Pruning (Cont.) Definition Definition – Backward Sub-Pattern sequence s < s’ and s  s’ sequence s < s’ and s  s’  (D s ) =  (D s’ )  (D s ) =  (D s’ ) Stop searching any descendant of s’ in the prefix search tree Stop searching any descendant of s’ in the prefix search tree a f f ss’ a ff

14 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 14 Search Space Pruning (Cont.) Definition Definition – Backward Super-Pattern sequence s < s’ and s  s’ sequence s < s’ and s  s’  (D s ) =  (D s’ )  (D s ) =  (D s’ ) Transplanting the descendants of s to s’ instead of searching any descendant of s’ in the prefix search tree Transplanting the descendants of s to s’ instead of searching any descendant of s’ in the prefix search tree b b e s s’ bb e

15 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 15 Search Space Pruning (Cont.) Definition Definition – Partial Prefix Sequence Lattice Search space Search space<> fifififi fsfsfsfs asasasas eseseses bsbsbsbs bsbsbsbs asasasas bsbsbsbs bsbsbsbs dsdsdsds eseseses  (D eb ) =  (D b )  (D eab ) =  (D ab )  (D af ) =  (D f )

16 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 16 CloSpan CloSpan(s, D s, min_sup, L) CloSpan(s, D s, min_sup, L) – Input: A sequence s, a projectd DB D s, and min_sup – Output: The prefix search lattice L – Check whether a discovered sequence s’ exist s.t. either s  s’ or s’  s, and  (D s ) =  (D s’ ); – if such super-pattern or sub-pattern exists then Modify the link in L, return; Modify the link in L, return; – else insert s into L; – scan D s once, find every frequent item  such that s can be extended to (s  i  ), or s can be extended to (s  i  ), or s can be extended to (s  s  ); s can be extended to (s  s  ); – if no valid  available then return; return; – for each valid  do  I-Step Call CloSpan(s  i , D s  i , min_sup, L ); Call CloSpan(s  i , D s  i , min_sup, L ); – for each valid  do  S-Step Call CloSpan(s  s , D s  s , min_sup, L ); Call CloSpan(s  s , D s  s , min_sup, L ); – return;

17 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 17 CloSpan (Cont.) Hash for Fast Condition Checking Hash for Fast Condition Checking <> fifififi asasasas eseseses bsbsbsbs asasasas dsdsdsds eseseses Hash Table: Hash Table: nil nil

18 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 18 CloSpan (Cont.) Example Example IDSequence (af)dea eab e(abf)(bde) 0 1 2 min_sup_count = 2 Hash Function  Mod 4 a:3, b:2, d:2, e:3, f:2

19 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 19 CloSpan (Cont.) Example (Cont.) Example (Cont.) DaDaDaDa DbDbDbDb DdDdDdDd DeDeDeDe DfDfDfDf (_f)dea, b, (_bf)(bde) (_f)(bde) ea, (_e) a, ab, (abf)(bde) dea, (bde) <> 0 1 2 3 nil nil nil nil (_f)de, b, (_f)(bde) 8  (D s ) DaDaDaDa (_f):2, b:2, d:2, e:2 a:3, b:2 6 DeDeDeDe a, ab, (ab)b  (D s ) de, (de) 4 DfDfDfDf d:2, e:2  (D s ) X0 DbDbDbDbX X0 DdDdDdDdX

20 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 20 CloSpan (Cont.) Example (Cont.) Example (Cont.)<>0 1 2 3 8nil a s :3 (_f)de, b, (_f)(bde) 8  (D s ) DaDaDaDa (_f):2, b:2, d:2, e:2 0 Mod 4

21 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 21 CloSpan (Cont.) Example (Cont.) Example (Cont.) D (af) de, (bde) D ab de D ad e, e D ae de, (de) 4  (D s ) D (af) d:2, e:2 X0  (D s ) D ab X e, e 2  (D s ) D ad e:2 X0  (D s ) D ae X

22 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 22 CloSpan (Cont.) Example (Cont.) Example (Cont.) de, (de) 4  (D s ) D (af) d:2, e:2 0 Mod 4 <> 0 1 2 3 8nil a s :3 4 f i :2

23 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 23 CloSpan (Cont.) Example (Cont.) Example (Cont.) D (af)d e, (_e) D (af)e X 0  (D s ) D (af)d X X 0  (D s ) D (af)e X

24 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 24 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D (af)d X 0 Mod 4 <>0 1 2 3 80 a s :3 4 f i :2 nil d s :2

25 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 25 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D (af)e X 0 Mod 4 <>0 1 2 3 80 a s :3 4 f i :2 nil d s :2 0 e s :2

26 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 26 CloSpan (Cont.) Example (Cont.) Example (Cont.)<>0 1 2 3 80 a s :3 4 f i :2 nil d s :2 0 e s :2 0 b s :2 X0  (D s ) D ab X 0 Mod 4

27 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 27 CloSpan (Cont.) Example (Cont.) Example (Cont.) X0  (D s ) DbDbDbDbX 0 Mod 4 <>0 1 2 3 80 a s :3 4 f i :2 nil d s :2 0 e s :2 0 b s :2

28 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 28 CloSpan (Cont.) Example (Cont.) Example (Cont.) X0  (D s ) DdDdDdDdX 0 Mod 4 <>0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 nil

29 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 29 CloSpan (Cont.) Example (Cont.) Example (Cont.) a, ab, (ab)b 6  (D s ) DeDeDeDe a:3, b:2 2 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 nil6 e s :3 nil

30 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 30 CloSpan (Cont.) Example (Cont.) Example (Cont.) D ea b, (_b)b b, b 2  (D s ) D ea b:2 X 0  (D s ) D eb X

31 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 31 CloSpan (Cont.) Example (Cont.) Example (Cont.) b, b 2  (D s ) D ea b:2 2 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 nil

32 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 32 CloSpan (Cont.) Example (Cont.) Example (Cont.) D eab X 0  (D s ) D eab X 0 Mod 4

33 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 33 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eab X 0 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 nil

34 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 34 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eab X 0 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

35 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 35 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eb X 0 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

36 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 36 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eb X 0 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

37 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 37 CloSpan (Cont.) Example (Cont.) Example (Cont.) <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil de, (de) 4 DfDfDfDf d:2, e:2  (D s ) 0 Mod 4

38 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 38 CloSpan (Cont.) Example (Cont.) Example (Cont.) de, (de) 4 DfDfDfDf d:2, e:2  (D s ) 0 Mod 4 <> 0 1 2 3 80 a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

39 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 39 CloSpan (Cont.) Example (Cont.) Example (Cont.)<> a s :3 f i :2 d s :2 e s :2 b s :2 e s :3 a s :3 b s :2 (af)d:2(af)e:2eab:2 ea:3

40 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 40 Experimental Results Synthetic Data Synthetic Data – Parameters D : Number of sequences in 000s D : Number of sequences in 000s C : Average itemsets per sequence C : Average itemsets per sequence T : Average items per itemset T : Average items per itemset N : Number of different items in 000s N : Number of different items in 000s S : Average itemsets in maximal sequences S : Average itemsets in maximal sequences I : Average items in maximal sequences I : Average items in maximal sequences – Two Data Set D10 C10 T2.5 N10 S6 I2.5 D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20 D5 C20 T20 N10 S20 I20 Real world datasets Real world datasets – KDDCup2000 – Gazelle Click Stream

41 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 41 Experimental Results (Cont.) Synthetic Data Synthetic Data D10 C10 T2.5 N10 S6 I2.5 D10 C10 T2.5 N10 S6 I2.5

42 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 42 Experimental Results (Cont.) Synthetic Data Synthetic Data D5 C20 T20 N10 S20 I20 D5 C20 T20 N10 S20 I20

43 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 43 Experimental Results (Cont.) Real world datasets Real world datasets – KDDCup2000 29,369 sequences 29,369 sequences 35,722 sessions 35,722 sessions 87,546 page views 87,546 page views The average number of sessions in a sequence is around 1 The average number of sessions in a sequence is around 1 The average number of pageviews in a session is 2 The average number of pageviews in a session is 2 The largest session contains 342 views The largest session contains 342 views The longest sequence has 140 sessions The longest sequence has 140 sessions The largest sequence contains 651 page views The largest sequence contains 651 page views

44 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 44 Experimental Results (Cont.)

45 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 45 Conclusions Clospan to mine frequent closed sequences efficiently. Clospan to mine frequent closed sequences efficiently. Clospan outperforms PrefixSpan. Clospan outperforms PrefixSpan.

46 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 46 Lexicographic Order Definition Definition – Lexicographic Order t = {i 1, i 2, …,i k }, i 1  i 2  …  i k t = {i 1, i 2, …,i k }, i 1  i 2  …  i k t’ = {j 1, j 2, …,j l }, j 1  j 2  …  j l t’ = {j 1, j 2, …,j l }, j 1  j 2  …  j l t<t’ iff either of the following is true: t<t’ iff either of the following is true: –For some h, 0  h  min{k,l}, we have i r = j r for r < h, and i h < j h, or –k < l, and i 1 = j 1, i 2 = j 2, …,i k = j k Example Example –(a,f) < (b,f) –(a,b) < (a,b,c) –(a,b,c) < (b,c)

47 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 47 Sequence Lexicographic Order Definition Definition – Sequence Lexicographic Order If s’ = s  p, then s < s’ If s’ = s  p, then s < s’ If s =   i p and s’ =   s p’, no matter what the order relation between p and p’ is, s < s’ If s =   i p and s’ =   s p’, no matter what the order relation between p and p’ is, s < s’ If s =   i p and s’ =   i p’, p<p’, indicates s<s’ If s =   i p and s’ =   i p’, p<p’, indicates s<s’ If s =   s p and s’ =   s p’, p<p’, indicates s<s’ If s =   s p and s’ =   s p’, p<p’, indicates s<s’ Example Example –(ab) < (ab)(a) –(ac) < (a)(d), (ad) < (a)(c) –(ab) < (ac) –(a)(b) < (a)(c)

48 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 48 Lexicographic Sequence Tree Definition Definition – Lexicographic Sequence Tree <><(a)><(b)> <(ab)><(a)(a)><(a)(b)> <(ab)(a)><(ab)(b)><(a)(bc)><(a)(bd)>

49 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 49 Search Space Pruning Definition Definition – Common Prefix a subsequence s, projected database D s a subsequence s, projected database D s if ,  is a common prefix for all the sequence with the same extension type (either itemset- extension or sequence-extension) in D s if ,  is a common prefix for all the sequence with the same extension type (either itemset- extension or sequence-extension) in D s , if s   is closed,  must be a prefix of  , if s   is closed,  must be a prefix of  , we need not search s   and its descendants except the branch of s   , we need not search s   and its descendants except the branch of s   Example Example –D s = {de(af), de(fg)} –s  not closed  unnecessary to extend s  –s  not closed  unnecessary to extend s 

50 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 50 Search Space Pruning (Cont.) CommonPrefix CommonPrefix – An intermediate algorithm – Developed which adopts the PrefixSpan framework plus the common prefix pruning technique – Outperforms PrefixSpan

51 Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 51 Search Space Pruning (Cont.) Definition Definition – Partial Order A sequence s, projected database D s A sequence s, projected database D s if among all the sequences in D s, an item  does always occur before an item  (either in the same itemset for all sequences in D s or in a different itemset but not both), then D s  = D s  if among all the sequences in D s, an item  does always occur before an item  (either in the same itemset for all sequences in D s or in a different itemset but not both), then D s  = D s  , s  is not closed. Need not search any sequence in the branch of s  , s  is not closed. Need not search any sequence in the branch of s  Example Example –Before projecting D into D a, D b, D d, D e, D f –a is always before the f in all the sequences –Need not search any sequence beginning with f


Download ppt "CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference."

Similar presentations


Ads by Google