Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke, Tomi Yiu and Jason Flannick Dept. of Computer Science Cornell University Presenter 0259636 林哲存 0259639 林庭宇 0159638 徐敏容
Outline Introduction The SPAM Algorithm Data Representation Lexicographic Tree for Sequences Depth First Tree Traversal Pruning Data Representation Data Structure Candidate Generation Experimental Evaluation Synthetic data generation Comparison With SPADE and PrefixSpan Consideration of space requirements Conclusion
Introduction I = {i1,i2,...,in} be a set of items a subset X ⊆ I an itemset |X| is the size of X A sequence s = (s1, s2, . . . , sm) is an ordered list of itemsets, where si ⊆ I, i ∈ {1,...,m} The length l of a sequence s = (s1, s2, . . . , sm) is defined as
Introduction Sa:A Sequence. Example:({a}, {b,c}) sup D(Sa):The support of Sa in database D. Example:Sa = ({a},{b, c}), we know sup D(Sa) = 2 Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if sup D(Sa) >=minSup.
Introduction Consider the sequence of customer 2 the size of this sequence is 2. the length of this sequence is 4. the support of Sa is 2, or 0.67. If the minSup is < 0.67, then Sa is deemed frequent. {b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Sa:({a}, {b,c}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID Table 1 Table 2
Introduction Contributions of SPAM Novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. Vertical bitmap representation of the database with efficient support counting. SPAM outperforms previous work by up to an order of magnitude.
The SPAM Algorithm The Lexicographic Sequence Tree Consider all sequences arranged in a sequence tree with the following structure: The root of the tree is labeled with null I={a, b} Assume that there is a lexicographical ordering i<= of the items I in the database. If item i occurs before item j in the ordering, then we denote this by i <= j. This ordering can be extended to sequences by defining sa I <= sb if sa is a subsequence of sb. If sa is not a subsequence of sb, then there is no relationship in this ordering.
The SPAM Algorithm The Lexicographic Sequence Tree Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) = S-Step = I-Step
The SPAM Algorithm For example: Sa= ({a, b, c}, {a, b}) The Lexicographic Sequence Tree For example: Sa= ({a, b, c}, {a, b}) sequence-extended sequence of Sa itemset-extended sequence of Sa = S-Step = I-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b}, {d}) ……… S-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b, d}) I-Step
The SPAM Algorithm A standard depth-first manner. Depth First Tree Traversal A standard depth-first manner. sup D (S) minSup, store S and repeat DFS. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) sup D (S) < minSup, not need to repeat DFS. A huge search space.
The SPAM Algorithm S-step Pruning By Apriori principle Suppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d}) ({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) = null .
The SPAM Algorithm I-step Pruning Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}.
Pseudocode for DFS with pruning
Data Representation Data Structure Our algorithm uses a vertical bitmap representation of the data. A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. A bit is set to 1 if the transaction it represents contains the last itemset in the sequence, and previous transactions contain all previous itemsets in the sequence (i.e. the customer contains the sequence of itemsets)
If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero. If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. (the size of a sequence =3), k=1, 21+1-bit=4-bit
Data Representation Data Structure Dataset sorted by CID and TID {b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Dataset sorted by CID and TID 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 0 {a} {c} {d} CID TID 1 1 1 3 1 6 - - 2 2 2 4 - - - - 3 5 3 7 - - - -
Data Representation (Candidate Generation) Process S-step requires that we set the first 1 in the current sequence’s bit slice to 0, and all bits afterward to 1, to indicate that the new item can only come in a transaction after the last transaction in the current sequence. I-step, only the AND is necessary, since the item on the end occurs within the same transaction as the last itemset in the current sequence:
S-step Process & S-step result process 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 ({a})s ({a}) {b} ({a},{b}) S-step process & result ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID
I-step Process & result 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 {d} ({a},{b, d}) & result ({a},{b}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID
Experimental Evaluation Synthetic data generation Using the 「IBM AssocGen program」 to generate dataset. We also compared the performance of the algorithms as the minimum support was varied for several datasets of different sizes.
Experimental Evaluation Comparison With SPADE and PrefixSpan We compared the three algorithms on several small, medium, and large datasets for various minimum support values. This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude.
Sample running time graph Figure 6: Varying support for small dataset #1 Figure 7: Varying support for small dataset #2
Sample running time graph Figure 8: Varying support for medium-sized dataset #1 Figure 9: Varying support for medium-sized dataset #2
Sample running time graph Figure 10: Varying support for large dataset #1 Figure 11: Varying support for large dataset #2
Experimental Evaluation Comparison With SPADE and PrefixSpan SPAM performs so well for large datasets. PrefixSpan runs slightly faster for small datasets. The SPAM excels at finding the frequent sequences for many different types of large datasets.
Figure 12: Varying number of customers Figure 13: Varying number of transactions per customer
Figure 14: Varying number of items per transaction Figure 15: Varying average length of maximal sequences
Figure 16: Varying average length of transactions 0.19 0.29 0.39 0.49 Figure 16: Varying average length of transactions within maximal sequences Figure 17: Varying support with large number of customers in dataset
Experimental Evaluation Consideration of space requirements Uses a depth-first traversal of the search space, it is quite space-inefficient in comparison to SPADE. Using the vertical bitmap representation to store transactional data is inefficient when an item is not present in a transaction, storing a zero to represent that fact. The representation of the data that SPADE uses is more efficient.
Experimental Evaluation Consideration of space requirements SPAM is less space-efficient then SPADE as long as 16T < N(the total number of items across all of the transactions). Thus the choice between SPADE and SPAM is clearly a space-time trade off.
Conclusion Consideration of space requirements We presented an algorithm to quickly find all frequent sequences in a list of transactions. The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets.
Discussion Strongest part of this paper Weak points of this paper SPAM outperforms other algorithms on large dataset. Weak points of this paper Space utility may not good. Need to load all data into main memory. Possible improvement Using linking list to improve the vertical bitmap space utility. Possible extension & applications To forecast in the stock market or financial investment.
Sequential PAttern Mining using A Bitmap Representation Q&A Sequential PAttern Mining using A Bitmap Representation Presenter 0259636 林哲存 0259639 林庭宇 0159638 徐敏容