Spring 2016 Presentation by: Julianne Daly Mining Sequential Patterns Rakesh Agrawal & Ramakrishnan Srikant Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. Spring 2016 Presentation by: Julianne Daly
Outline Introduction Problem Description Finding Sequential Patterns Performance Conclusion Final Exam Questions 1
Outline Introduction Problem Description Finding Sequential Patterns Performance Conclusion Final Exam Questions 2
Introduction Bar-code technology allows the collection of massive amounts of sales data (basket data). A typical data record consists of: transaction date items bought customer-id 3
Introduction So far, we have seen frequent pattern mining in the context of association rules, where we were interested in what items were purchased in the same transaction. These are intra-transactional patterns. The problem of sequential pattern mining is concerned with inter-transactional patterns. A pattern in the first case consists of a set of unordered items: {a,c,d,g} A pattern in the second case is an ordered list of sets of items: <{a},{c,d},{g}> 4
Introduction An example of a sequential pattern: Customers typically rent* “Star Wars”, then “The Empire Strikes Back”, followed by “Return of the Jedi”. * Note that these rentals do not need to be consecutive. Customers who rent other videos in between also support this sequential pattern. 6
Introduction Elements of a sequential pattern can be sets of items as well. For example: “Fitted sheet, flat sheet, and pillow cases”, followed by “comforter”, followed by “drapes and ruffles”. 7
Outline Introduction Problem Description Finding Sequential Patterns Performance Conclusion Final Exam Questions 8
Problem Description We are given a database D of customer transactions. Each transaction consists of the fields: customer-id transaction-time items purchased in the transaction 9
Problem Description No customer has more than one transaction with the same transaction-time. Quantities of items bought are not considered: each item is a binary variable representing whether an item was bought or not. 10
Problem Description (Terminology and definitions) Itemset: non-empty set of items. Each itemset is mapped to an integer. Sequence: Ordered list of itemsets. Customer Sequence: List of customer transactions ordered by increasing transaction time. A customer supports a sequence if the sequence is contained in the customer-sequence. Support for a Sequence: Fraction of total customers that support a sequence. Maximal Sequence: A sequence that is not contained in any other sequence. 11
Problem Description (Terminology and definitions) Large Sequence: Sequence that meets minisup. Length of a sequence: The # of itemsets in the sequence. A sequence of length k is called a k-sequence. The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction. an itemset with minimum support is called a large itemset or Litemset. * Note that each itemset in a large sequence must have minimum support. Therefore, any large sequence must be a list of Litemsets. 12
Problem Description Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support. Each such sequence represents a sequential pattern. 14
Problem Description Example: Seq with minimum support Note: Use Minisup of 25%, no less than two customers must support the sequence < (10 20) (30) > Does not have enough support (Only by Customer #2) < (30) >, < (70) >, < (30) (40) > … are not maximal. 15
Outline Introduction Problem Description Finding Sequential Patterns Performance Conclusion Final Exam Questions 16
Finding Sequential Patterns The problem of finding sequential patterns is split into five phases: Sort Phase Large itemset (Litemset) Phase Transformation Phase Sequence Phase Maximal Phase 17
Finding Sequential Patterns: 1. Sort Phase The DB is sorted, with customer-id as the major key and transaction-time as the minor-key. This step implicitly converts the original transaction DB into a DB of customer sequences. Recall, a Customer Sequence is a list of customer transactions ordered by increasing transaction time. 18
Finding Sequential Patterns: 2. Litemset Phase In this phase we find the set of all Large itemsets (Litemsets) L. We are also simultaneously finding the set of large 1- sequences, since this set is just: {< l > | l ∈ L } The authors state that it is straightforward to adapt previously seen algorithms for finding Litemsets. The main difference is that the support count should incremented only once per customer even if the customer buys the same set of items in two different transactions. 19
Finding Sequential Patterns: 2. Litemset Phase In Apriori, the support for an itemset was defined as the fraction of transactions in which an itemset is present. In the sequential pattern finding problem, the support is the fraction of customers who bought the itemset in any one of their possibly many transactions. 20
Finding Sequential Patterns: 2. Litemset Phase The set of Litemsets is mapped to a set of contiguous integers. By treating Litemsets as single entities, two Litemsets can be compared for equality in constant time, reducing the time required to check if a sequence is contained in a customer sequence. 21
Finding Sequential Patterns: 2. Litemset Phase Example with the minimum support 40% 22
Finding Sequential Patterns: 3. Transformation Phase As we shall see later, we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. In order to make this test fast, the customer sequences are transformed into an alternative representation. 23
Finding Sequential Patterns: 3. Transformation Phase Each transaction is replaced by the set of all Litemsets contained in the transaction. Transactions with no Litemsets are dropped. (But empty customer sequences still contribute to the total customer count) A customer sequence is now represented by a list of sets of Litemsets 24
Finding Sequential Patterns: 3. Transformation Phase Note: (10 20) dropped because of lack of support. (40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup). 25
Finding Sequential Patterns 4. Sequence Phase Seed set of large sequences Create candidate sequences Scan data to find support of candidate sequences Determine large sequences 26
Finding Sequential Patterns 4. Sequence Phase Use the set of Litemsets to find the desired sequences. Two families of algorithms are presented: Count-all Count-some 27
Finding Sequential Patterns 4. Sequence Phase Count-all algorithms count all the large sequences, including non-maximal sequences, which are pruned out in the maximal phase. Count-some algorithms try to avoid counting non- maximal sequences by first counting longer sequences in a forward phase, then counting the sequences skipped in a backward phase. 28
Finding Sequential Patterns 4. Sequence Phase Count-all algorithm -AprioriAll Count-some algorithms- AprioriSome and DynamicSome
Finding Sequential Patterns 4. Sequence Phase: AprioriAll L1 = {large 1-sequences}; //result of Litemset phase for (k = 2; Lk-1 ≠ {}; k++) do begin Ck = New candidates generated from Lk-1 foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. end Answer = Maximal Sequences in ∪k Lk Notation: Lk: Set of all large k-sequences , Ck: Set of candidate k- sequences. 30
Finding Sequential Patterns 4. AprioriAll Candidate Generation The apriori-generate function takes as argument Lk-1, the set of all large (k-1)-sequences. The function works as follows: First, join Lk-1 with Lk-1: insert into Ck select p.litemset1 , ..., p.litemsetk-1, q.litemsetk-1 from Lk-1 p, Lk-1 q where p.litemset1 = q.litemset1, . . ., p.litemsetk-2 = q.litemsetk-2 ; Next, delete all sequences c ∈ Ck such that some (k-1)-subsequence of c is not in Lk-1 31
Finding Sequential Patterns 4. AprioriAll Candidate Generation Example <1 2 4 3> is pruned out because the subsequence <2 4 3> is not in L3. The authors cite a previous paper for the proof of correctness of the candidate generation procedure. 32
Finding Sequential Patterns 4. AprioriAll Maximal Phase Having found the set S of all large sequences in the sequence phase, the following algorithm can be used to find the maximal sequences. Let n = length of the longest sequence for ( k = n; k > 1; k --) foreach k-sequence sk do Delete from S all subsequences of sk Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees), citing two of their earlier papers. 33
Finding Sequential Patterns 4. AprioriAll Example 34
Finding Sequential Patterns 4. AprioriSome AprioriSome uses the function “next” to determine which sequences to skip. Let hitk = |Lk| / |Ck| (i.e., ratio of large k-sequences to candidate k-sequences) function next(k: integer) //k is the length of seq counted last pass begin if (hitk < 0.666) return k + 1; elseif (hitk < 0.75) return k + 2; elseif (hitk < 0.80) return k + 3; elseif (hitk < 0.85) return k + 4; else return k + 5; end next returns the length of sequences to count in the next pass. The idea behind this heuristic is that as the percentage of candidates counted in the current pass which had minimum support increases, the time wasted when counting extensions of small candidates when skipping a length goes down. In other words, if almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also likely to be large (frequent). 35
Finding Sequential Patterns 4. AprioriSome Forward Phase L1 = {large 1-sequences}; //Result of Litemset phase C1 = L1; last = 1; //We last counted Clast for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do begin if (Lk-1 known) then Ck = New candidates generated from Lk-1 else Ck = New candidates generated from Ck-1 if (k== next(last) ) then begin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. last = k; end We use the apriori-generate function given earlier to generate new candidate sequences. However, in the kth pass, we may not have the large sequence set Lk-1 available as we did not count the (k-1)-candidate sequences. In that case, we use the candidate set Ck-1 to generate Ck . Correctness is maintained because Lk-1 is contained in Ck-1 . 36
Finding Sequential Patterns 4. AprioriSome Backward Phase for (k--; k>=1; k--) do if (Lk not found in forward phase) then begin Delete all sequences in Ck contained in some Li , i>k; foreach customer-sequence c in DT do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support end else // Lk already known Delete all sequences in Lk contained in some Li , i>k; Answer = Uk Lk //(Maximal Phase not Needed) *Notation: DT; Transformed database In the backward phase, we count sequences for the lengths we skipped over during the forward phase, after first deleting all sequences contained in some large sequence. These smaller sequences cannot be in the answer because we are only interested in maximal sequences. We also delete the large sequences found in the forward phase that are non-maximal. 37
Finding Sequential Patterns 4. AprioriSome Example Minimum Support = 40% (2 customer sequences). 39
Finding Sequential Patterns 4. AprioriSome Example For illustration simplicity, the next function is taken to be f(k) = 2k. Using the same database as for the AprioriAll example, we find the large 1-sequences in the Litemset phase. In the second pass we count C2 to get L2. After the third pass, apriori-generate is called with L2 as its argument to get C3. However, this pass, the next function takes 2 as its argument, which was the length of the last candidate sequence counted, so it returns f(2) = 2*2 = 4. Therefore, k =/= next(last), so C3 is not counted, and we do not generate L3. Next, apriori-generate is called with C3, since L3 is not known, and we generate C4 from C3, which after pruning gives us the C4 shown in the figure. After counting C4 to get L4, we try to generate C5, which turns out to be empty, so we exit the loop. During the backward phase, nothing is deleted from L4, since there exist no sequences longer than 4. We skipped counting the support for sequences in C3 during the forward phase. After deleting those sequences in C3 that are subsequences in L4, i.e., subsequences of <1 2 3 4>, we are left with the sequences <1 3 5> and <3 4 5>. These would be counted to get <1 3 5> as a maximal large 3-sequence. Next, all the sequences in L2 except <4 5> are deleted since they are all contained in some longer sequence. For the same reason, all sequences in L1 are also deleted. 38
Finding Sequential Patterns 4. AprioriSome Example For illustration simplicity, the next function is taken to be f(k) = 2k. Using the same database as for the AprioriAll example, we find the large 1-sequences in the Litemset phase. In the second pass we count C2 to get L2. After the third pass, apriori-generate is called with L2 as its argument to get C3. However, this pass, next takes 2 as its argument, which was the length of the last candidate sequence counted, so it returns f(2) = 2*2 = 4. Therefore, k =/= next(last), so C3 is not counted, and we do not generate L3. Next, apriori-generate is called with C3, since L3 is not known, and we generate C4 from C3, which after pruning gives us the C4 shown in the figure. After counting C4 to get L4, we try to generate C5, which turns out to be empty, so we exit the loop. During the backward phase, nothing is deleted from L4, since there exist no sequences longer than 4. We skipped counting the support for sequences in C3 during the forward phase. After deleting those sequences in C3 that are subsequences in L4, i.e., subsequences of <1 2 3 4>, we are left with the sequences <1 3 5> and <3 4 5>. These would be counted to get <1 3 5> as a maximal large 3-sequence. Next, all the sequences in L2 except <4 5> are deleted since they are all contained in some longer sequence. For the same reason, all sequences in L1 are also deleted. Answer: <1 2 3 4> , <1 3 5> , <4 5> 40
Finding Sequential Patterns 4. DynamicSome Like AprioriSome, it skips counting candidate sequences of certain lengths in the forward phase. AprioriSome generates Ck from Lk-1 or Ck-1 DynamicSome generates Ck “on the fly” based on large sequences found from the previous passes and the customer sequences read from the database. 41
Finding Sequential Patterns 4. DynamicSome In the initialization phase, count only sequences up to and including step variable length. If step is 3, count sequences of length 1, 2 and 3 In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database. If step is 3 then it would be lengths of 6,9, 12, etc. 42
Finding Sequential Patterns 4. DynamicSome In the intermediate phase, generate the candidate sequences for the skipped lengths If we have counted L6 and L3 , and L9 turns out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5. The backward phase is identical to AprioriSome. 43
Finding Sequential Patterns 4. DynamicSome Initialization Phase // step is an integer ≥ 1 L1 = {large 1-sequences}; // Result of litemset phase for ( k = 2; k <= step and Lk-1 ≠ {}; k++ ) do begin Ck = New candidates generated from Lk-1; foreach customer-sequence c in DT do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. end 44
Finding Sequential Patterns 4. DynamicSome Forward Phase for ( k = step; Lk ≠ {}; k+= step ) do begin //find Lk+step from Lk and Lstep Ck+step ={}; foreach customer-sequence c in DT do X = otf-generate(Lk , Lstep , c) // foreach sequence x ∈ X’, increment its count in Ck+step (adding it to Ck+step if necessary). end Lk+step = Candidates in Ck+step with minimum support. 45
Finding Sequential Patterns 4. DynamicSome OTF-Generate // c is the customer sequence < c1c2...cn > Xk = subseq(Lk, c); forall sequences x ∈ Xk do x.end = min{ j | x ⊆ < c1c2...cj > }; Xj = subseq(Lj , c); forall sequences x ∈ Xj do x.start = max{ j | x ⊆ < cjcj+1...cn >}; Answer = join of Xkwith Xj with the join condition: Xk.end < Xj.start; The otf-generate function takes as arguments Lk, the set of large k-sequences, Lj , the set of large j-sequences, and the customer sequence c. It returns the set of candidate (k + j)-sequences contained in c. The intuition behind this generation procedure is that if sk ∈ Lk and sj ∈ Lj are both contained in c, and they don't overlap in c, then < sk.sj > is a candidate (k + j)-sequence. The reason is that apriori-generate generates less candidates than otf-generate when we generate Ck+1 from Lk. However, this may not hold when we try to find Lk+step from Lk and Lstep, as is the case in the forward phase. In addition, if the size of |Lk| + |Lstep| is less than the size of Ck+step generated by apriori-generate, it may be faster to find all members of Lk and Lstep contained in c than to find all members of Ck+step contained in c. 46
Finding Sequential Patterns 4. DynamicSome Intermediate Phase for ( k--; k > 1; k-- ) do if (Lk not yet determined) then if (Lk-1 known) then Ck = New candidates generated from Lk-1; else Ck = New candidates generated from Ck-1; 48
Finding Sequential Patterns 4. DynamicSome Example Let step = 2 use L2 and L2 as argument in otf-generate to get C4 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 49
Finding Sequential Patterns 4. DynamicSome Example Get 2 candidate sequences: C4 Minisup <1 2 3 4> 2 <1 3 4 5> 1 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 50
Finding Sequential Patterns 4. DynamicSome Example Only <1 2 3 4> is large. C4 Minisup <1 2 3 4> 2 <1 3 4 5> 1 L4 Minisup <1 2 3 4> 2 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 51
Finding Sequential Patterns 4. DynamicSome Example pass as arg. to otf-gen. L2 and L4 to get C6 L4 sup <1 2 3 4> 2 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 52
Finding Sequential Patterns 4. DynamicSome Example C6 is found to be empty L4 sup <1 2 3 4> 2 C6 = {} Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 53
Finding Sequential Patterns 4. DynamicSome Example In the intermediate phase, C3 is generated from L2 , and C5 from L4 using apriori-generate. C3 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 L4 sup <1 2 3 4> 2 C5 54
Finding Sequential Patterns 4. DynamicSome Example C5 is found to be empty, so only C3 is counted during the backward phase to get L3 . C5 = {} L3 C3 Let step = 2 In the init. phase we determine L2. Then, in the forward phase, we find pass as argument to otf-generate L2 and L2 to generate C4. We get 2 candidate sequences in C4: <1 2 3 4> and <1 3 4 5>, with support of 2 and 1, respectively. Only <1 2 3 4> is large In the next pass, we find C6 to be empty. In the intermediate phase, we generate C3 from L2 and C5 from L4. C5 is empty, so we only count C3 to get L3 55
Outline Introduction Problem Description Finding Sequential Patterns Sequence Phase Performance Conclusion Final Exam Questions 56
Performance: Synthetic Data |D| = 250,000 NS = 5000, NI = 25000, and N = 10000. 57
Performance: Execution Times “We have not plotted the execution times for DynamicSome for low values of minimum support since it generated too many candidates and ran out of memory. Even if DynamicSome had more memory, the cost of finding the support for that many candidates would have ensured execution times much larger than those for Apriori or AprioriSome. As expected, the execution times of all the algorithms increase as the support is decreased because of a large increase in the number of large sequences in the result. DynamicSome performs worse than the other two algorithms mainly because it generates and counts a much larger number of candidates in the forward phase. The difference in the number of candidates generated is due to the otf-generate candidate genera- tion procedure it uses. The apriori-generate does not count any candidate sequence that contains any subsequence which is not large. The otf-generate does not have this pruning capability. The major advantage of AprioriSome over AprioriAll is that it avoids counting many non-maximal sequences. Second, although AprioriSome skips over counting candidates of some lengths, they are generated nonetheless and stay memory resident. If memory gets filled up, AprioriSome is forced to count the last set of candidates generated even if the heuristic suggests skipping some more candidate sets. This effect decreases the skipping distance between the two candidate sets that are indeed counted, and AprioriSome starts behaving more like AprioriAll. For lower supports, there are longer large sequences, and hence more non-maximal sequences, and AprioriSome does better.” 58
Performance: Scale-Up # Customers 59
Performance: Scale-Up scale-up properties with respect to # customer transactions, and # items in a transaction. “For support level of 200, the execution time actually went down a little when the transaction size was increased. The reason for this decrease is that there is an overhead associated with reading a transaction. At high level of support, this overhead comprises a significant part of the total execution time. Since this decreases when the number of transactions decrease, the total execution time also decreases a little.” 60
Outline Introduction Problem Description Finding Sequential Patterns Sequence Phase Performance Conclusion Final Exam Questions 61
Conclusion The problem of mining sequential patterns from a customer DB was introduced. Two types of algorithms were introduced to find sequential patterns. CountAll -AprioriAll CountSome -AprioriSome, DynamicSome AprioriAll and AprioriSome have comparable performance, with AprioriSome slightly better for lower minisup. AprioriAll and AprioriSome have excellent scale-up properties. 62
Outline Introduction Problem Description Finding Sequential Patterns Sequence Phase Performance Conclusion Final Exam Questions 63
Final Exam Question 1: Compare and contrast association rules and sequential patterns. How do they relate to each other in the context of the Apriori algorithms? 64
Final Exam Question 1: Compare and contrast association rules and sequential patterns. How do they relate to each other in the context of the Apriori algorithms? Association rules refer to intra-transaction patterns, while sequential patterns refer to inter-transaction patterns. Both of these are used in the Apriori algorithms studied here, because the algorithms are looking for different sequential patterns made up of association rules. 65
Final Exam Question 2: What is the major difference between the two algorithms CountSome and CountAll? 66
Final Exam Question 2: What is the major difference between the two algorithms CountSome and CountAll? CountAll (AprioriAll) is careful with respect to minimum support, and careless with respect to maximality. (The minimum support is checked for each sequence on each run, but maximal sequences must be checked for later.) CountSome (AprioriSome) is careful with respect to maximality, but careless with respect to minimum support. (Non-maximal sequences are pruned out during runtime, but the minimum support is not tested at all values of k.) 67
Final Exam Question 3: Why is the Transformation stage of these pattern mining algorithms so important to their speed? 68
Final Exam Question 3: Why is the Transformation stage of these pattern mining algorithms so important to their speed? The transformation allows each record to be looked up in constant time, reducing the run time. 69