Download presentation
Presentation is loading. Please wait.
Published byErick May Modified over 9 years ago
1
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001
2
2 Content Introduction Problem Definition Related Works Incremental Update Algorithms Performance Conclusion
3
3 Introduction In our life, sequences exist in many areas. – An on-line bookstore: customer’s buying sequences – Web site: web-log sequences The knowledge of frequent sequences is useful. Some algorithms have been proposed, such as AprioriAll, GSP, SPADE, MFS and PrefixSpan. These algorithms assume the database is static. In practice, the content of a sequence database changes continually.
4
4 Problem Definition Item – I={i 1, i 2, …, i M }: a set of literals called items. Transaction (or Itemset) – Transaction t: a set of items such that t I. Sequence – Sequence s= : a set of ordered transactions. – The length of s (represented by |s|) is defined as the number of items contained in s. E.g. if s=, then |s|=5.
5
5 Problem Definition Subsequence – s 1 =, s 2 = – If there exist integers j 1, j 2, …, j n 1 j 1 <j 2 <… <j n m b 1 a j 1, b 2 a j 2, …, b n a j n – s 2 is a subsequence of s 1, or s 1 contains s 2 ( represented by s 2 s 1 ). – Example: If s 1 =, s 2 =, then s 2 s 1. Maximal Sequence – Given a sequence set V, a sequence s in V is maximal : if s is not a subsequence of any other sequence in V.
6
6 Problem Definition Given a sequence database D and a sequence s – support count: the number of sequences in D that contain s. – support: the fraction of sequences in D that contain s. – frequent: the support of s is no less than a threshold s. Mining Frequent Sequences – Inputs: a database D of sequences a user specified minimum support threshold s (e.g. s =1%) – Output: maximal frequent sequences
7
7 Problem Definition Database update Incremental Update – Inputs: - D - + s Frequent sequences in D and their supports – Output: Maximal frequent sequences in D’
8
8 Problem Definition Notations
9
9 Related Works--GSP GSP is put forward by Srikant and Agrawal (EDBT 96).
10
10 Related Works--GSP Candidate Generation Function GGen() – Input: L i – Output: C i+1 – Join: for each pair of sequences s 1, s 2 L i If the sequence got by deleting the first item in s 1 = the sequence got by deleting the last item in s 2 (or vice versa), then a candidate sequence is generated and inserted into C i+1. E.g: if s 1 =, s 2 = ( s’= ), then c 1 = is generated. – Prune: if a sequence s in C i+1, has infrequent subsequences, then delete s from C i+1. Reason: If a sequence is frequent, then all its subsequences must be frequent.
11
11 Related Works--MFS The I/O cost of GSP is high in some cases. MFS tries to reduce the I/O cost needed by GSP (IC-AI 2001). – Make use of a suggested frequent sequence set S est Mine a sample of the database using GSP Results of the previous mining action – Generalize the candidate generation function of GSP Its input: frequent sequences of various lengths Its output: candidate sequences of various lengths – Longer sequences can be generated and counted early, therefore MFS reduces I/O cost.
12
12 Related Works--MFS MFS algorithm
13
13 Incremental Update Algorithms It is inefficient to apply GSP and MFS to mine the new database from scratch. – Information available: frequent sequences in D and their supports Basic Idea: – If a sequence s is frequent in D, then its support count in D’ can be deduced by scanning - and +, without D -. – If a sequence s is infrequent in D, then it cannot be frequent in D’ unless its support count in + is large enough its support count in - is small enough
14
14 Incremental Update Algorithms Mathematical formulae: – = - + – Define (s’ is a subsequence of s), then is an upper bound of. – Lemma 1: For a sequence s to be frequent in D’, the following formula must be true: – Lemma 2: If a sequence s is infrequent in D but frequent in D’, the following formula must be true:
15
15 Incremental Update Algorithms Algorithms GSP+ and MFS+ – Structures are similar as those of GSP and MFS – Difference: each time after generating candidates, use the 2 lemmas to delete some candidates by scanning - and/or + when necessary. For a frequent sequence s in D, we know, apply lemma 1. For an infrequent sequence s in D, we don’t know, apply lemma 2. – CPU saving is achieved by avoiding processing D - for some candidates.
16
16 Performance Synthetic dataset – Parameter of the dataset ParameterDescriptionValue | D | Number of customers1,500,000 | C |Average number of transactions per customer10 | T |Average number of items per transaction2.5 | S |Average No. of itemsets in maximal potentially frequent sequences4 | I |Average size of itemsets in maximal potentially frequent sequences1.25 NsNs Number of maximal potentially frequent sequences5,000 NINI Number of maximal potentially frequent itemsets25,000 NNumber of items10,000
17
17 Performance Comparison of four algorithms under different support thresholds – |D| = |D’|=1,500,000, | + | = | - | = 150,000 = 10% |D| – s = 0.35%--0.65%
18
18 Performance Comparison of four algorithms under different support thresholds – GSP+ and MFS+ need less CPU time. – GSP+ and MFS+ usually require a little more I/O cost due to the processing of -, which is not required by GSP and MFS. – MFS-based algorithms perform better especially in I/O cost. Use old frequent sequences as S est – MFS+ is the overall winner in terms of both CPU and I/O costs. ss 0.35%0.4%0.45%0.5%0.55%0.6%0.65% Total No. of candidates34,06518,35610,0245,8123,3652,0531,160 Those require scanning D - 13,0427,1613,9662,3131,353867509 Percentage (row3/row2)38%39%40% 42%44%
19
19 Performance Varying | + | and | - | – s = 0.5%, |D| = |D’|=1,500,000 – | + | = | - | change from 1% to 40% of |D|
20
20 Performance Varying | + | and | - | – The CPU costs of GSP and MFS stay relatively steady. GSP and MFS deal with D’ only, while |D’| doesn’t change. – The CPU cost of GSP+ and MFS+ increase linearly with | + | and | - |. GSP+ and MFS+ need more time to process | + | and | - |. – MFS+ is the most CPU-efficient algorithm when | + | = | - | is less than 25% of |D|.
21
21 Conclusion GSP+ and MFS+ outperform their non- incremental counterparts in CPU cost at the expense of a small penalty in I/O cost. The MFS-based algorithms perform better than the GSP-based ones, particularly in I/O cost. The performance gains of GSP+ and MFS+ are the most prominent when the changed part of the database is small compared with the unchanged part.
22
22 The End ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.