Download presentation
Presentation is loading. Please wait.
Published byGabriel Blake Modified over 8 years ago
1
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University
2
2 Two Topics Introduced in This Talk The strategies for mining fault-tolerant frequent itemsets (patterns) from a transaction database The strategies for mining fault-tolerant repeating patterns from a data sequence
3
3 An Efficient Approach for Mining Fault-Tolerant Frequent Patterns based on Bit Vector Representations Jia-Ling Koh and Pei-Wy Yo DASFAA 2005
4
4 Motivation Related works Problem Definition Appearing Bit Vectors VB_FT_Mine algorithm (Vector-Based Fault Tolerant frequent patterns Mining) Experiments Conclusion and future works
5
5 Min-sup=4 frequent pattern : E Min-sup=3 frequent patterns : B 、 D 、 E 、 F 、 G 、 BE 、 DE an expected minimum support few frequent patterns are discovered Low min-support no general information and representative frequent patterns is returned A B D E GT5 sample database C F GT4 B E F GT3 A C D ET2 B D E FT1 ItemsTID E E E E
6
6 contain 4 out of the 5 items {B, D, E, F, G} whether a transaction containing a pattern with fault-tolerance contain 4 out of 5 items a longer “approximate” pattern (BDEFG) with support count 4 TIDItems T1 B D E F T2A C D E T3 B E F G T4C F G T5 B D E G A B D E G sample database
7
7 FT-Apriori algorithm (ACM-SIGMOD,2001) Apriori approach Apply the “downward closure” property suffered from generating a large number of candidates repeatedly scanning database
8
8 When fault tolerance is set to be 1 A transaction FT-contains BDE : If a transaction contains any (|BDE|-1) items in BDE BD, BE, DE it FT-contains BDE
9
9 (fault tolerance) =1 Itemset P={B, D, E} FT-body 1 (P)={T1,T2,T3,T5} FT-sup 1 (P) = 4 item B Item-Sup (B)=3 item D Item-Sup (D)=3 item E Item-Sup (E)=4 TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database B D E D E B E B D E B B B D D D E E E E δ
10
10 A fault-tolerant frequent pattern P (1) FT-sup δ (P) min-sup FT (2) p P, Item-Sup(p) min-sup item
11
11 δ=1 min-sup FT =4, min-supitem=3 Itemset P={B, D, E} FT-sup 1 (P) = 4 item B : Item-Sup (B)=3 item D : Item-Sup (D)=3 item E : Item-Sup (E)=4 BDE is a FT-frequent pattern TIDItems T1 B D E B D E F T2 D E A C D E T3 B E B E F G T4C F G T5 B D E A B D E G sample database
12
12 TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database Appearing vector table Item Appearing vector ( Appear P ) A 0 1 0 0 1 B 1 0 1 0 1 C 0 1 0 1 0 D 1 1 0 0 1 E 1 1 1 0 1 F 1 0 1 1 0 G 0 0 1 1 1 A A 11 111 B B B
13
13 A Appearing vector : A the support count of an item count the number of bits with 1s
14
14 Appear A = I 5 = Vector(Appear A ) ․ I 5 = 2 Appearing vector
15
15 Appear A = 01001 Appear D = 11001 Appear AD = Appear A Appear D = 01001 11001 = 01001 TIDItems T1B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database
16
16 Pattern P=AD = 1 T1, T2 and T5 FT-contain AD FT-Appear AD (1) = 11001 TIDItems T1 D B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database
17
17 FT-Appear AD (1) = 11001 FT-sup 1 (AD) = ․ = 3 Item-Sup (A) = ․ = 2 Item-Sup (D) = ․ = 3 TIDItems T1 D B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database
18
18 Itemset AB FT-Appear AB (1) = Appear A Appear B Itemset ABC FT-Appear ABC (1) = Appear AB Appear BC Appear AC FT-Appear ABC (2) = Appear A Appear B Appear C Perform C -1 OR operations
19
19 【 Theorem 】 Let P´ = P ∪ {x} If transaction T FT-contains P’ with fault-tolerance δ δ-1, or T FT-contains P with fault-tolerance δ-1, or δ T FT-contains P with fault-tolerance δ and contains P
20
20 P = ABD P´ = P ∪ {G} = ABDG T FT-contains P ´ with fault tolerance 2 Case 1 : T1 = BDEF FT-contains ABD with fault tolerance 1 Case 2 T3 = BEFGFT-contains ABD with fault tolerance 2 and contains G TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database B D G
21
21 Ifδ = 0,FT-Appear P´ (δ) = Appear P´ If |P´| δ, FT-Appear P´ ( δ )= I |DB| Otherwise, FT-Appear P´ ( δ ) = FT-Appear P (δ-1) (FT-Appear P (δ) Appear x )
22
22 = 1 Itemset A FT-Appear A (1) FT-Appear A (1) = I |DB| FT-Appear A (0) FT-Appear A (0) = Appear A Itemset AB FT-Appear A (0)FT-Appear A (1) FT-Appear AB (1) = FT-Appear A (0) (FT-Appear A (1) Appear B ) = Appear A (I |DB| Appear B ) = Appear A Appear B FT-Appear AB (0) = Appear AB FT-Appear A (0) = FT-Appear A (0) Appear B = Appear A Appear B
23
23 =2 Itemsets AB FT-Appear AB (2) = I |DB| FT-Appear AB (1) = FT-Appear A (0) (FT-Appear A (1) Appear B ) FT-Appear AB (0) = Appear AB Itemset ABC FT-Appear ABC (2) = FT-AppearAB(1) (FT-AppearAB (2) Appear C ) FT-Appear AB (0) FT-Appear ABC (1) = FT-Appear AB (0) (FT-AppearAB (1) Appear C ) FT-Appear ABC (0) = Appear ABC FT-Appear AB (0) = FT-Appear AB (0) Appear C
24
24 construct appearing vector table TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database ItemAppear P A0 1 0 0 1 B1 0 1 0 1 C0 1 0 1 0 D1 1 0 0 1 E1 1 1 0 1 F1 0 1 1 0 G0 0 1 1 1
25
25 Check the item supports min-sup item min-sup item 3 min-sup FT 4 min-sup item = 3 min-sup FT = 4 δ=1 A : 2, B : 3, C : 2 D : 3, E : 4, F : 3 G : 3 Candidate items for constructing frequent FT-patterns B, D, E, F, G ItemAppear P A0 1 0 0 1 B1 0 1 0 1 C0 1 0 1 0 D1 1 0 0 1 E1 1 1 0 1 F1 0 1 1 0 G0 0 1 1 1
26
26
27
27FT-SupportFT-sup 1 (BD)= Vector (FT-Appear BD ) ․ I 5 = ․ = 4 Item- Support Item-Sup (B) = Vector (FT-Appear BD (1)) ․ Vector (Appear B ) = ․ = 3 Item-Sup (D) = Vector (FT-Appear BD (1)) ․ Vector (Appear D ) = ․ = 3 FT- appearing vector FT-Appear BD (1) = FT-Appear B (0) (FT-Appear B (1) Appear D ) = ( ) = FT-Appear BD (0) = FT-Appear B (0) Appear D = =
28
28
29
29FT-SupportFT-sup 1 (BDE) = Vector (FT-Appear BDE (1)) ․ I 5 = ․ = 4 Item- Support Item-Sup (B) = ․ = 3 Item-Sup (D) = ․ = 3 Item-Sup (E) = ․ = 4 FT- appearing vector FT-Appear BDE (1) = FT-Appear BD (0) (FT-Appear BD (1) Appear E ) = ( ) = FT-Appear BDE (0) = FT-Appear BD (0) Appear E = =
30
30
31
31FT-SupportFT-sup 1 (BDEF) = Vector (FT-Appear BDEF (1)) ․ I 5 = ․ = 3 BDEF is not a FT frequent pattern FT-appearing vector FT-Appear BDEF (1) = ( ) = FT-Appear BDEF (0) = =
32
32
33
33FT-Support FT-sup 1 (BDEG) = Vector ( FT-Appear BDEG (1) )․ I 5 = 3 BDEG is not a FT frequent pattern FT-appearing vector FT-Appear BDEG (1) = FT-Appear BDEG (0) =
34
34
35
35 Visual C++ 6.0 P4 2.4 GHz CPU 256MB main memory OS: Windows XP Professional Synthesis generator: IBM website http://www.almaden.ibm.com/cs/quest/DEMOS.html http://www.almaden.ibm.com/cs/quest/DEMOS.html
36
36 Experiment 1: min-sup item is changed T10I8D100kN450 ( =1)
37
37 Experiment 2: min-sup FT is changed T10I8D10kN1k ( =1 )
38
38 Experiment 3: fault tolerance is changed T10I8D100kN450
39
39 Experiment 4: database size is changed T10I8N450 ( =1 )
40
40 Experiment 5: the number of various items in database is changed T10I8D100k ( =1 )
41
41 Conclusion VB-FT-Mine algorithm is proposed Construct FT-appearing vectors of candidates Compute FT-support and Item-support efficiently significant improvement on execution time than FT-Apriori algorithm Future work extend VB-FT-Mine algorithm for mining frequent patterns in data streams
42
42 An Efficient Approach for Mining Top-K Fault-Tolerant Repeating Patterns Jia-Ling Koh and Yu-Ting Kung DASFAA 2006
43
43 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works
44
44 Introduction Repeating patterns the sub-patterns appearing in a data sequence repeatedly music feature extraction, user behavior monitoring In most studies, only exact matching was considered
45
45 Introduction (Cont.) For example: data sequence=ACDE……ACEDE…. using exact matching approache Allow insertion error the frequency of “ACDE” is 1 Find the implicit repeating pattern “ACDE”
46
46 Introduction (Cont.) Idea: 1.Discover fault-tolerant repeating patterns, FT- RPs in short, and 2.Avoid finding “duplicated” information & “short” patterns Mining “top-K non-trivial FT-RPs with length no less than min_len”
47
47 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works
48
48 Data Sequence E = {A,B,…Z} data items DSeq=D 1 D 2 …D n is a data sequence where D i E( i=1…n) e.g. DSeq = ABCDABCACDEE |DSeq| = 12 the length of DSeq
49
49 Contain & Appear DSeq = ABCDABCDA P = CDA Contain (on position “3”) Appear (on position “3”) CDA 3 7 freq(P)? = 2
50
50 FT-contain: insertion error DSeq = ABCDABCA P = ABCA DSeq FT-contain P on position 1 with 1 insertion error ABC A ABCA DSeq FT-contain P on position 5 with 0 insertion error 15
51
51 FT-contain: deletion error DSeq = ABCBCA P = BCD BC DSeq FT-contain P on position 2 and 4 with 1 deletion error
52
52 IFT-contain & IFT-appear insertion error: 0, 1, or 2 DSeq = ABCDABCA P = ABCA ABC A ABCA IFT-contain IFT-appear
53
53 DFT-contain & DFT-appear deletion error: 0, 1, or 2 DSeq = ABCBCA P =BCD BC DFT-contain DFT-appear
54
54 Fault-Tolerant Frequency DSeq = ABCDABCAECDAA P = CA C C A C A A FT-freq DSeq (P) = 3
55
55 Fault-tolerant Repeating Patterns (FT-RPs) DSeq, P If FT-freq DSeq (P) ≧ min_freq P is a FT-RP
56
56 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works
57
57 Appearing Bit Sequence DSeq = ABCDABCACDEEABCCDEAC Appear A 00000000000000000000 A A A AA Initially 11111 freq(“A”) = 5
58
58 Bit Index Table DSeq = ABCDABCACDEEABCCDEAC Data Item Appearing Bit Sequence( Appear N ) A 10001001000010000010 B 01000100000001000000 C 00100010100000110001 D 00010000010000001000 E 00000000001100000100
59
59 Appearing Bit Sequence of longer patterns DSeq = Appear AB ? Appear A = 10000001000011000001 ABCDBCCADEECAABCCDEA Appear B = 01001000000000100000 l_shif(Appear B,1) = 10000000000011000000 Appear AB = 10000000000011000000 freq(“AB”) = 3
60
60 Appearing Bit Sequence of longer patterns (Cont.) DSeq = Appear ABC ? ABCDBCCADEECAABCCDEA Appear AB = 10000000000011000000 Appear C = 00100110000100011000 l_shift(Appear C,2) = 10000100000011100010 ︿ Appear ABC = 10000000000011000000 freq(“ABC”) =3
61
61 Recursive Function-Appear P P=P 1 P 2 …P m-1 P m Appear P P’X
62
62 Fault-Tolerant Appearing Bit Sequences Represent the positions where the data sequence IFT/DFT-contains P under fault- tolerance Insertion Fault Tolerance Deletion Fault Tolerance
63
63 Appearing Bit Sequence of Insertion Fault Tolerance (E=0, 1, …, ) -The appearing bit sequence of P with E insertion errors
64
64 How to get ?? When |P| > 1 and E > 0 P = A B C, E = 2 1) A B x x C 2) A x B x C 3) A x x B C Shift 4 = |P|+ E -1 bit positions 0 insertion error in P’ 1 insertion error in P’ 2 insertion errors in P’ P’X
65
65 Recursive Function- P=P 1 P 2 …P m-1 P m, for E = 0 ~ P’=P 1 …P m-1, X=P m
66
66 Example V V
67
67 Fault-Tolerant Frequency FT-freq DSeq (P) Get it by counting the number of bits with value “1” in A pattern P can be evaluate whether P is FT-RP or not efficiently.
68
68 Appearing Bit Sequence of Deletion Fault Tolerance P=P 1 P 2 …P m -The appearing bit sequence of P with E deletion errors Y P’’ 0, 1, …, D deletion errors in P’’ Shift 1 bit position
69
69 Recursive Function- P’’=P 2 …P m, for E = 0 ~ Q=P 2 …P m-1, X=P m
70
70 Example V V
71
71 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works
72
72 TFTRP-Mine Algorithm DSeq = ABCDABCACDEEABCCDEAC , min_len = 2 and K = 3 min_freq is set to be 3. 1. Scan DSeq once to construct the bit index table.
73
73 TFTRP-Mine (Cont.) 2.Generate candidate patterns in Depth-First order root >= 3 A 5 < min_len = 2 B 3 = min_len = 2 C 3 >= 3 D 3 A 2 < 3 E 1 Minlen_Set A 1 >= min_freq = 3 AB(3) A 2 B 0 < 3 ABC (3) B 1 C 2 ABCD (3) A 2 EBCD 110 2 Temporal Output Set Check non-trivial Empty ABCD (3) Check non-trivial D 2 E 2 C 5 D 2 E 1 AC (5) ACD (3) AC (5) ACD (3)
74
74 TFTRP-Mine (Cont.) root B 3 A B CDE 2 0320 ABCDE 2 1 23 1 ABCDE 2 1 1 02 Minlen_Set Temporal Output Set AB(3) ABC (3) ABCD (3) min_freq = 3, K = 3, min_len = 2 BC(3) BCD (3) AC (5) ACD (3) AC (5) ACD (3)
75
75 TFTRP-Mine (Cont.) Temporal Output Set ABCD (3) CAC (3) CDAC (3) CDA (4) CDEAC (3) CDE (4) CEA (3) AC (5) ACD (3) CD (5) Sort Temporal Output Set AC (5) CD (5) CDA (4) CDE (4) CAC (3) ACD (3) CEA (3) ABCD (3) CDEAC (3) Results: AC(5), CD(5), CDA (4), CDE (4) CDAC (3)
76
76 RE-TFTRP-Mine Algorithm min_len = 2, K = 3 min_freq = 3 Minlen_Set AC (5) CD (5) AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) vvvvvvvv Temporal Top-K Set AC (5) CD (5) AB (3) min_freq = 3
77
77 RE-TFTRP-Mine (Cont.) Minlen_Set AC (5) CD (5) AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) Temporal Top-K Set AC (5) CD (5) AB (3) v ACD (3) Check non-trivial v v CDA (4) CDE (4) Check non-trivial min_freq = 3 A AB C D E 13521 5 AB C D E 20231 6 C ABCDE 31253 ABCDE 41204
78
78 RE-TFTRP-Mine (Cont.) Minlen_Set AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) ACD (3) CDA (4) CDE (4) Temporal Top-K Set AC (5) CD (5) AB (3) min_freq = 3 CDA 4 A B C D E 12300 v CDAC (3) Check non-trivial CDA (4) 4
79
79 Performance Study Implementation Environment Borland C++ Builder 5.0 2.4 GHz Intel Pentium IV PC machine 512 MB Microsoft XP Professional Data Sequence Generator ParametersMeaning LLength of the generated data sequence ENumber of various data items in the generated data sequence
80
80 Performance Study (Cont.) Experiment 1 Performance evaluation on efficiency Vary one of the five parameters data parameters: L, E, Runtime parameters:, min_len, K Experiment 2 Performance evaluation on effectiveness on music objects
81
81 Experiment 1 Changing the size of a data sequence L min_len = 8, K = 5 and E = 5 Candidates Patterns Unit: Numbers Algorithm L TFTRP-MineRE-TFTRP-Mine 100010,2958,705 200041,73024,300 3000120,07524,760 4000348,61036,090 5000533,77041,280
82
82 Experiment 1 (Cont.) Changing insertion fault tolerance L2000.E5, min_len = 8, K = 5 Candidate Patterns Unit: Numbers TFTRP-MineRE-TFTRP-Mine 15,7605,465 241,73024,300 3394,51536,600 43,434,80576,195 Algorithm
83
83 Experiment 1 (Cont.) Changing the setting of min_len L2000.E5, and K = 5 Candidate patterns Unit :Numbers Algorithm min_len TFTRP RE- TFTRP 541,7303,795 1041,73025,240 1541,73026,375 2041,73027,615 2541,73029,000 3041,73030,500 3541,73038,095 4041,73039,620 4541,730 5041,730
84
84 Experiment 1 (Cont.) Changing the setting of K K = max_K x 1%, max_K x 20%, …max_K x 100% L2000.E5, and min_len = 8 Candidate Patterns Unit: Numbers Algorithm K/max_K TFTRP-Mine RE-TFTRP- Mine 1%41,73014,320 20%41,73024,300 40%41,73031,735 60%41,73037,325 80%41,730 100%41,730
85
85 Experiment 2 Music Object Found FT-RPs under insertion fault tolerance 0 Found FT-RPs under insertion fault tolerance 1 Found FT-RPs under insertion fault tolerance 2 Motif in Music Object 1 (252 seconds)None 1. Ecgcegcdcgebbdgbbcaeaecaecba aegegeegecbbggdgacaffbfacfcgee ccgeb 2. None ecgcegcdcgebbdgbbcaeaecaecba aegegeegecbbggdgacaffbfacfcgee ccgebceaadfd 2 (270 seconds)None 1. dcbgddebdefgaabbadgcfbge dcdbaccaaccffaaccccddddcca afdd 2. None ddcbgddebdefgaabbadgcfbg edcdbaccaaccffaaccccddddc caaffddgend 3 (256 seconds)None 1. gggbbgfffdcbeeedccfeeeffgf 2. gggbbgfffdcbfeedccfeeeffgf cbcgggbbgfffdcbfeedccfeeef fgfccc 4 (288 seconds)None 1. deededacededaceded aagggagg 2. deecedacededaceded aagggagg 1. gedeededacededacededaagg gaggedgggcbbaaagabeeegaaa agedeededacedegedcaageedd eccgacaagegdegedcacdegedc dagagedddedcacccccbbaaaga beegaaaagedeededacededace dedaagggaggedgggc 2. None baaagabeeegaaaagedeededa cededacededaagggaggedggg cbbaaagabeeegaaaagedeede dacedegedcaageeddeccgaca agegdegedcacdegedcdagage dddedcacccccbbaaagabeeeg aaaagedeededacededaceded aagggaggedgggcbbaaagab 5 (291 seconds)None 1. aegcfaecaholdonfdbfebgfba holdonbgfeebfbgefdabfddbfd holdonbfcaccaecegcholdoncd fbdf 2. None ecgacaegcfaecaholdonfdbfe bgfbaholdonbgfeebfbgefdab fddbfdholdonbfccaccaecegc holdoncdfbdf min_freq = 3, K = 2 and min_len =8
86
86 Conclusion and Future Works Conclusion fault-tolerant appearing bit sequences TFTRP-Mine and RE-TFTRP-Mine algorithms For mining top-K non-trivial FT-RPs with length no less than min_len in data sequences efficiently Future works partition the bit index table into several parts to perform parallel mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.