Sequential Pattern Mining Using A Bitmap Representation Authors: Jay Ayres, Johannes Gehrke, Tomi Yiu and Jason Flannick Source: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002.
Outline Introduction SPAM (Sequential PAttern mining) algorithm Lexicographic tree for sequences Depth first tree traversal Pruning S-step I-step Data representation - Bitmap
S= ({a}, {b, c}) is a sequence The support of S is SupD(S) Frequent sequential pattern: SupD(S) >= Min Support SupD(S) = SupD ({a}, {b, c} ) = 2
SPAM (Sequential pattern mining) S = ({a, b, c}, {a, b}) Sequence length: Length (S) = 5 Sequence size: Size (S) = 2 Sequence-extended sequence Itemset-extended sequence S’ = ({a, b, c}, {a, b}, {a}) S’ = ({a, b, c}, {a, b, d})
SPAM (Sequential pattern mining) Max Size = 3 Items = {a, b} Level 1 Level 2 Level 3 Level 4 Level 5 Sequence-extended Item-extended Level 6
SPAM (Sequential pattern mining) Max Size = 3 Items = {a, b} Level 1 Level 2 Level 3 Level 4 Level 5 Level 6
SPAM (Sequential pattern mining) Pruning Items = {a, b, c, d}
Data Representation – BitMap 2K+1 < 3 < 2K+1
S-type S = {a} S’={a},{b} S’={a},{c} …
I-type S = {a} S’={a, b} S’={a, c} …
Expirations and results D3 C2.5 T3 SPAM SPADE PrefixSpan
Small database Small database middle database middle database SPADE SPAM PrefixSpan prefix middle database middle database
large database
Conclusions SPAM DFS traversal search S-type I-type Efficient in large database but inefficient in small database Space-inefficient in comparison to SPADE