Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker:

Similar presentations


Presentation on theme: "1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker:"— Presentation transcript:

1 1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker: Ming Jing Tsai

2 2 Strategies  Candidate Pruning  Database partitioning  Customer reducing  DISC : Direct Sequence Comparison Reducing the costs for support counting Reducing decomposition of customer sequences

3 3 Order of sequences  Identify the leftmost items located in different transactions in two sequences having common prefixes  Exam the leftmost distinct items in alphabetic order > <

4 4 DISC frequent k sequences (a)(b)(b) (b)(d)(e) (b,f,g) (a)(b)(b) CIDCustomer Sequences3-minimum Subsequences 1(a,e,g)(b)(h)(f)(c)(b,f) 2(b)(d,f)(e) 3(b,f,g) 4(f)(a,g)(b,f,h)(b,f)

5 5 3-sorted database CIDCustomer Sequences3-minimum Subsequences 1(a,e,g)(b)(h)(f)(c)(b,f)(a)(b)(b) 4(f)(a,g)(b,f,h)(b,f)(a)(b)(b) 2(b)(d,f)(e)(b)(d)(e) 3(b,f,g)

6 6 Compare α 1,α δ  k-minimum subsequence in k-sorted database at first position α 1 at δ-th positionα δ : conditional k-minimum sequence  α1=α δ, α 1 is frequent next potential frequent k-sequence > α δ  α 1 ≠α δ, α 1 is not frequent Next potential frequent k-sequence ≧ α δ

7 7 Re-sorting 3-sorted database CIDCustomer Sequences3-minimum Subsequence s 2(b)(d,f)(e)(b)(d)(e) 4(f)(a,g)(b,f,h)(b,f)(b,f)(b) 3(b,f,g) 1(a,e,g)(b)(h)(f)(c)(b,f)(b)(f)(b)

8 8 Advantage  No candidate sequence is generated  Cost of decomposing customer sequences are reduced  Frequent k-sequences can be directly discovered.

9 9 DISC_ALL

10 10 Running example δ=3 CIDCustomer Sequences 1(a,d)(d)(a,g,h)(c) 2(b)(a)(f)(a,c,e,g)(c) 3(a,g) 4(a,f,g)(a,e,g,h)(c,g,h) 5(b,f)(b,e)(e,f,h) 6(d,f)(d,f,g,h) 7(b,f,g)(c,e,h) a4 b3 c4 d2 e4 f5 g6 h5 (a) (b) (a) (b) (d) First-level partition

11 11 First-level Partition1 λ=a,δ=3 CIDCustomer Sequences 1(a,d)(d)(a,g,h)( c) 2(b)(a)(f)(a,c,e,g )(c) 3(a,g) 4(a,f,g)(a,e,g,h)( c,g,h) (a)(b)(c)(d)(e)(f)(g)(h) Sup Last_ CID (_a)(_b)(_c)(_d)(_e)(_f)(_g)(_h) Sup Last _CID Frequent 2-sequences :(a)(a) , (a)(c) , (a)(g) , (ag) 30312132 30313233 00112152 00213353

12 12 Whether an item to the right of the min point can be removed or not  Condition1:The transaction having x contains λ  Condition2:The min point is to the left of the transaction having x  X can be removed Condition1 does not hold, and is not frequent. Condition1 holds, condition2 does not hold, and is not frequent Condition1 and2 both hold, and and are not frequent.

13 13 DISC λ=(a), δ=3 CID3-minimum subsequences Customer Sequences Apriori pointer 1(a)(a,g)(c) 2(b)(a)(a,c,g)(c) 4(a,g)(a,g)(c,g) The 2-sorted List NoFrequent 2- sequences 1 (a)(a) 2 (a)(c) 3 (a)(g) 4 (a,g) (a)(a)(c) (a)(a,c) (a)(a)(c) 1 1 1 CID3-order DB 2(a)(a,c) 1(a)(a)(c) 4 (a)(a,g) CID3-order DB 1(a)(a)(c) 4 2(a)(a,g) Frequent 3-sequences : (a)(a,g) removed (a)(c,g) 2 2 2

14 14 Bi-level (a)(b)(c)(d)(e)(f)(g)(h) Sup 00300010 Last_ CID 00400040 (_a)(_b)(_c)(_d)(_e)(_f)(_g)(_h) Sup 00000000 Last _CID 00000000 CIDCustomer Sequences 1(a)(a,g)(c) 2(b)(a)(a,c,g)(c) 4(a,g)(a,g)(c,g) Frequent 4-sequence (a)(a,g)(c)

15 15 First-level partition 2 CIDCustomer Sequences First-level partitioning 1(a,d)(d)(a,g,h)(c) 2(b)(a)(f)(a,c,e,g)(c) 3(a,g) 4(a,f,g)(a,e,g,h)(c,g,h) 5(b,f)(b,e)(e,f,h) 6(d,f)(d,f,g,h) 7(b,f,g)(c,e,h) (c) (b) removed (c) (b) (d) (b)

16 16 Experiment  Intel P4 2.8GHz with 512 MB main memory Windows XP  IBM data generator  Compared with PrefixSpan Pseudo-projection named Pseudo

17 17 Parameter

18 18 Different database size δ= 0.0025

19 19 Different minimum sup DB=10k Slen=8 Tlen=8 Seq.patlen=8

20 20 Multi-level partitioning DB=10k NRR Q =1/N Q ∑ Size P /Size Q P is a child partition of Q

21 21 Dynamic DISC-all Customer =50k Items = 1000 θ:transactions# customer

22 22 Compare on different θ


Download ppt "1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker:"

Similar presentations


Ads by Google