Sequential PAttern Mining using A Bitmap Representation 2014/11/20 Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason Flannick Dept. of Computer Science Cornell University (SIGKDD 2002) Presenter 李佩書 P76034525 楊璨瑜 P76034672 陳奕廷 P78031125 李昕純 Q56034035
Outline Introduction The SPAM algorithm Data representation 2014/11/20 Outline Introduction The SPAM algorithm Data representation Experimental Conclusion & Discussion
2014/11/20 Introduction
Sequential Patterns R. Agrawal and R. Srikant.(In ICDE 1995) 2014/11/20 Sequential Patterns R. Agrawal and R. Srikant.(In ICDE 1995) Algorithm:AprioriALL, AprioriSOME, PrefixSpan…
Problem Mining sequential patterns Given a minimum support minSup 2014/11/20 Problem Mining sequential patterns Given a minimum support minSup Find all frequent sequential patterns Sa supD(Sa) ≥ minSup
SPAM Algorithm Sequential PAttern Mining Algorithm 2014/11/20 SPAM Algorithm Sequential PAttern Mining Algorithm The first DFS(depth-first search) strategy for mining sequential patterns Vertical bitmap representation for simple, efficient counting.
2014/11/20 The SPAM Algorithm
Lexicographic Tree Sequence-extended Sequence (S-step) 2014/11/20 Lexicographic Tree Sequence-extended Sequence (S-step) Generate by adding a new transaction consisting of a single item to the end of sequence Ex: ({a, b, c}, {a, b})→({a, b, c}, {a, b}, {a}) Itemset-extended sequence (I-step) Generate by adding an item to the last itemset in the sequence Ex 1: ({a, b, c}, {a, b}) →({a, b, c}, {a, b, d}) Ex 2: ({a, b, c}, {a, b, d}) →({a, b, c}, {a, b, d, c}) Identifies two sets of each node n Sn: the set of candidate items for S-step extensions In: the set of candidate items for I-step extensions
2014/11/20 I={a,b}
Pruning Apriori-Based Minimizing the size of Sn and In 2014/11/20 Pruning Apriori-Based Minimizing the size of Sn and In Pruning candidate by DFS. S-step Pruning I-step Pruning
S-step Pruning S({a}) = {a, b, c, d} I({a}) = {b, c, d} 2014/11/20 S-step Pruning S({a}) = {a, b, c, d} I({a}) = {b, c, d} S({a}, {a}) = S({a}, {b}) = {a, b, c, d} I({a}, {a}) = {b, c, d} I({a}, {b}) = {c, d}
I-step Pruning S({a, b}) = S({a, d}) = {a, b} I({a}, {b}) = {c, d} 2014/11/20 I-step Pruning S({a, b}) = S({a, d}) = {a, b} I({a}, {b}) = {c, d} I({a}, {d}) = {}
2014/11/20
2014/11/20 Data Representation
We store each candidate sequence as a vertical bitmap 2014/11/20 We store each candidate sequence as a vertical bitmap Each customer is assigned a fixed slice of each bitmap for all of its transactions If the size of a sequence between 2k+1 and 2k+1 2k+1-bit sequence
2014/11/20 Bitmap of itemset {a} {b} {a,b} 1 1 1 &
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c})
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c})
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c})
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c})
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c})
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c}) 1
Bitmap of sequence Define B(s) as the bitmap for sequence s. Example1: 2014/11/20 Bitmap of sequence Define B(s) as the bitmap for sequence s. In sequence s If the last itemset is in transaction j and the other itemsets is in transaction before j Then set 1,otherwise set 0 Example1: Customer ID Transaction ID Itemset 1 {b} 2 {d} 3 {e} 4 {c} ({b},{c}) 1
Example2 ({a},{b,d}) Customer ID Transaction ID Itemset 1 {a,b,d} 3 2014/11/20 Example2 Customer ID Transaction ID Itemset 1 {a,b,d} 3 {b,c,d} 6 -- ({a},{b,d}) 1
2014/11/20 S-step Process Step 1 : S-Step Process to construct the transformed bitmap ({a})s Step 2 : ANDing B({a})s and B({b})s Support=2
2014/11/20 S-step Process Step 1:S-Step Process to construct the transformed bitmap ({a})s Step 2:ANDing B({a}) s and B({b})s
2014/11/20 I-step Process Support=2
2014/11/20 I-step Process
2014/11/20 Experimental
Comparison With SPADE and PrefixSpan 2014/11/20 Comparison With SPADE and PrefixSpan Method-1 Compare for various minimum support values on Small datasets Medium datasets Large datasets Methods-2 Compare several parameters in the dataset Number of customers Number of transactions per customer Number of items per transaction Average length of the maximal sequences
2014/11/20
Conclusion & Discussion 2014/11/20 Conclusion & Discussion
CONCLUSION ALGORITHM DATA REPRESENTATION 2014/11/20 CONCLUSION ALGORITHM Outperforms SPADE and PrefixSpan on large datasets Faster then SPADE and PrefixSpan DATA REPRESENTATION Bitmap representation S-step/I-step traversal S-step/I-step pruning Especially efficient when the sequential patterns are very long
Implement SPAM algorithm 2014/11/20 Implement SPAM algorithm SPMF is an mining mining framework Written in Java/Open-source data http://www.philippe-fournier-viger.com/spmf/index.php Philippe-Fournier-Viger, Antonio Gomariz, Ted Gueniche, Azadeh Soltani, Cheng-Wei Wu and Vincent S. Tseng, "SPMF: a Java Open-Source Pattern Mining Library," accepted and to appear in Journal of Machine Learning Research.
2014/11/20 DISCUSSION SPAM assumes that the entire database completely fit into main memory, what is the solution ? Why they set the size of a sequence between 2k+1 and 2k+1 ?