Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation
Jay Ayres, Johannes Gehrke, Tomi Yiu and Jason Flannick Dept. of Computer Science Cornell University Presenter 林哲存林庭宇徐敏容

Outline Introduction The SPAM Algorithm Data Representation
Lexicographic Tree for Sequences Depth First Tree Traversal Pruning Data Representation Data Structure Candidate Generation Experimental Evaluation Synthetic data generation Comparison With SPADE and PrefixSpan Consideration of space requirements Conclusion

Introduction I = {i1,i2,...,in} be a set of items
a subset X ⊆ I an itemset |X| is the size of X A sequence s = (s1, s2, , sm) is an ordered list of itemsets, where si ⊆ I, i ∈ {1,...,m} The length l of a sequence s = (s1, s2, , sm) is defined as

Introduction Sa：A Sequence. Example：({a}, {b,c})
sup D(Sa)：The support of Sa in database D. Example：Sa = ({a},{b, c}), we know sup D(Sa) = 2 Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if sup D(Sa) >=minSup.

Introduction Consider the sequence of customer 2
the size of this sequence is 2. the length of this sequence is 4. the support of Sa is 2, or 0.67. If the minSup is < 0.67, then Sa is deemed frequent. {b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Sa：({a}, {b,c}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID Table 1 Table 2

Introduction Contributions of SPAM
Novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. Vertical bitmap representation of the database with efficient support counting. SPAM outperforms previous work by up to an order of magnitude.

The SPAM Algorithm The Lexicographic Sequence Tree Consider all sequences arranged in a sequence tree with the following structure： The root of the tree is labeled with null I={a, b} Assume that there is a lexicographical ordering i<= of the items I in the database. If item i occurs before item j in the ordering, then we denote this by i <= j. This ordering can be extended to sequences by defining sa I <= sb if sa is a subsequence of sb. If sa is not a subsequence of sb, then there is no relationship in this ordering.

The SPAM Algorithm The Lexicographic Sequence Tree Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) = S-Step = I-Step

The SPAM Algorithm For example: Sa= ({a, b, c}, {a, b})
The Lexicographic Sequence Tree For example: Sa= ({a, b, c}, {a, b}) sequence-extended sequence of Sa itemset-extended sequence of Sa = S-Step = I-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b}, {d}) ……… S-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b, d}) I-Step

The SPAM Algorithm A standard depth-first manner.
Depth First Tree Traversal A standard depth-first manner. sup D (S)  minSup, store S and repeat DFS. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) sup D (S) < minSup, not need to repeat DFS. A huge search space.

The SPAM Algorithm S-step Pruning By Apriori principle
Suppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d}) ({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) = null .

The SPAM Algorithm I-step Pruning
Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}.

Pseudocode for DFS with pruning

Data Representation Data Structure Our algorithm uses a vertical bitmap representation of the data. A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. A bit is set to 1 if the transaction it represents contains the last itemset in the sequence, and previous transactions contain all previous itemsets in the sequence (i.e. the customer contains the sequence of itemsets)

If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero. If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. (the size of a sequence =3), k=1, 21+1-bit=4-bit

Data Representation Data Structure Dataset sorted by CID and TID
{b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Dataset sorted by CID and TID {a} {c} {d} CID TID

Data Representation (Candidate Generation) Process S-step requires that we set the first 1 in the current sequence’s bit slice to 0, and all bits afterward to 1, to indicate that the new item can only come in a transaction after the last transaction in the current sequence. I-step, only the AND is necessary, since the item on the end occurs within the same transaction as the last itemset in the current sequence:

S-step Process & S-step result process 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1
({a})s ({a}) {b} ({a},{b}) S-step process & result ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID

I-step Process & result 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 {d}
({a},{b, d}) & result ({a},{b}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID

Experimental Evaluation
Synthetic data generation Using the 「IBM AssocGen program」 to generate dataset. We also compared the performance of the algorithms as the minimum support was varied for several datasets of different sizes.

Comparison With SPADE and PreﬁxSpan We compared the three algorithms on several small, medium, and large datasets for various minimum support values. This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude.

Sample running time graph
Figure 6: Varying support for small dataset #1 Figure 7: Varying support for small dataset #2

Figure 8: Varying support for medium-sized dataset #1 Figure 9: Varying support for medium-sized dataset #2

Figure 10: Varying support for large dataset #1 Figure 11: Varying support for large dataset #2

Comparison With SPADE and PrefixSpan SPAM performs so well for large datasets. PreﬁxSpan runs slightly faster for small datasets. The SPAM excels at ﬁnding the frequent sequences for many different types of large datasets.

Figure 12: Varying number of customers
Figure 13: Varying number of transactions per customer

Figure 14: Varying number of items per transaction
Figure 15: Varying average length of maximal sequences

Figure 16: Varying average length of transactions
Figure 16: Varying average length of transactions within maximal sequences Figure 17: Varying support with large number of customers in dataset

Consideration of space requirements Uses a depth-first traversal of the search space, it is quite space-inefficient in comparison to SPADE. Using the vertical bitmap representation to store transactional data is inefficient when an item is not present in a transaction, storing a zero to represent that fact. The representation of the data that SPADE uses is more efficient.

Consideration of space requirements SPAM is less space-efficient then SPADE as long as 16T < N(the total number of items across all of the transactions). Thus the choice between SPADE and SPAM is clearly a space-time trade off.

Conclusion Consideration of space requirements We presented an algorithm to quickly find all frequent sequences in a list of transactions. The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets.

Discussion Strongest part of this paper Weak points of this paper
SPAM outperforms other algorithms on large dataset. Weak points of this paper Space utility may not good. Need to load all data into main memory. Possible improvement Using linking list to improve the vertical bitmap space utility. Possible extension & applications To forecast in the stock market or financial investment.

Sequential PAttern Mining using A Bitmap Representation
Q&A Sequential PAttern Mining using A Bitmap Representation Presenter 林哲存林庭宇徐敏容

Sequential PAttern Mining using A Bitmap Representation

Similar presentations

Presentation on theme: "Sequential PAttern Mining using A Bitmap Representation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequential PAttern Mining using A Bitmap Representation

Similar presentations

Presentation on theme: "Sequential PAttern Mining using A Bitmap Representation"— Presentation transcript:

Similar presentations

About project

Feedback