Download presentation
Presentation is loading. Please wait.
Published byBrice Owens Modified over 9 years ago
1
1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker: Ming Jing Tsai
2
2 Strategies Candidate Pruning Database partitioning Customer reducing DISC : Direct Sequence Comparison Reducing the costs for support counting Reducing decomposition of customer sequences
3
3 Order of sequences Identify the leftmost items located in different transactions in two sequences having common prefixes Exam the leftmost distinct items in alphabetic order > <
4
4 DISC frequent k sequences (a)(b)(b) (b)(d)(e) (b,f,g) (a)(b)(b) CIDCustomer Sequences3-minimum Subsequences 1(a,e,g)(b)(h)(f)(c)(b,f) 2(b)(d,f)(e) 3(b,f,g) 4(f)(a,g)(b,f,h)(b,f)
5
5 3-sorted database CIDCustomer Sequences3-minimum Subsequences 1(a,e,g)(b)(h)(f)(c)(b,f)(a)(b)(b) 4(f)(a,g)(b,f,h)(b,f)(a)(b)(b) 2(b)(d,f)(e)(b)(d)(e) 3(b,f,g)
6
6 Compare α 1,α δ k-minimum subsequence in k-sorted database at first position α 1 at δ-th positionα δ : conditional k-minimum sequence α1=α δ, α 1 is frequent next potential frequent k-sequence > α δ α 1 ≠α δ, α 1 is not frequent Next potential frequent k-sequence ≧ α δ
7
7 Re-sorting 3-sorted database CIDCustomer Sequences3-minimum Subsequence s 2(b)(d,f)(e)(b)(d)(e) 4(f)(a,g)(b,f,h)(b,f)(b,f)(b) 3(b,f,g) 1(a,e,g)(b)(h)(f)(c)(b,f)(b)(f)(b)
8
8 Advantage No candidate sequence is generated Cost of decomposing customer sequences are reduced Frequent k-sequences can be directly discovered.
9
9 DISC_ALL
10
10 Running example δ=3 CIDCustomer Sequences 1(a,d)(d)(a,g,h)(c) 2(b)(a)(f)(a,c,e,g)(c) 3(a,g) 4(a,f,g)(a,e,g,h)(c,g,h) 5(b,f)(b,e)(e,f,h) 6(d,f)(d,f,g,h) 7(b,f,g)(c,e,h) a4 b3 c4 d2 e4 f5 g6 h5 (a) (b) (a) (b) (d) First-level partition
11
11 First-level Partition1 λ=a,δ=3 CIDCustomer Sequences 1(a,d)(d)(a,g,h)( c) 2(b)(a)(f)(a,c,e,g )(c) 3(a,g) 4(a,f,g)(a,e,g,h)( c,g,h) (a)(b)(c)(d)(e)(f)(g)(h) Sup Last_ CID (_a)(_b)(_c)(_d)(_e)(_f)(_g)(_h) Sup Last _CID Frequent 2-sequences :(a)(a) , (a)(c) , (a)(g) , (ag) 30312132 30313233 00112152 00213353
12
12 Whether an item to the right of the min point can be removed or not Condition1:The transaction having x contains λ Condition2:The min point is to the left of the transaction having x X can be removed Condition1 does not hold, and is not frequent. Condition1 holds, condition2 does not hold, and is not frequent Condition1 and2 both hold, and and are not frequent.
13
13 DISC λ=(a), δ=3 CID3-minimum subsequences Customer Sequences Apriori pointer 1(a)(a,g)(c) 2(b)(a)(a,c,g)(c) 4(a,g)(a,g)(c,g) The 2-sorted List NoFrequent 2- sequences 1 (a)(a) 2 (a)(c) 3 (a)(g) 4 (a,g) (a)(a)(c) (a)(a,c) (a)(a)(c) 1 1 1 CID3-order DB 2(a)(a,c) 1(a)(a)(c) 4 (a)(a,g) CID3-order DB 1(a)(a)(c) 4 2(a)(a,g) Frequent 3-sequences : (a)(a,g) removed (a)(c,g) 2 2 2
14
14 Bi-level (a)(b)(c)(d)(e)(f)(g)(h) Sup 00300010 Last_ CID 00400040 (_a)(_b)(_c)(_d)(_e)(_f)(_g)(_h) Sup 00000000 Last _CID 00000000 CIDCustomer Sequences 1(a)(a,g)(c) 2(b)(a)(a,c,g)(c) 4(a,g)(a,g)(c,g) Frequent 4-sequence (a)(a,g)(c)
15
15 First-level partition 2 CIDCustomer Sequences First-level partitioning 1(a,d)(d)(a,g,h)(c) 2(b)(a)(f)(a,c,e,g)(c) 3(a,g) 4(a,f,g)(a,e,g,h)(c,g,h) 5(b,f)(b,e)(e,f,h) 6(d,f)(d,f,g,h) 7(b,f,g)(c,e,h) (c) (b) removed (c) (b) (d) (b)
16
16 Experiment Intel P4 2.8GHz with 512 MB main memory Windows XP IBM data generator Compared with PrefixSpan Pseudo-projection named Pseudo
17
17 Parameter
18
18 Different database size δ= 0.0025
19
19 Different minimum sup DB=10k Slen=8 Tlen=8 Seq.patlen=8
20
20 Multi-level partitioning DB=10k NRR Q =1/N Q ∑ Size P /Size Q P is a child partition of Q
21
21 Dynamic DISC-all Customer =50k Items = 1000 θ:transactions# customer
22
22 Compare on different θ
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.