Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006
Outline Introduction GENERALIZED SEQUENTIAL PATTERN MINING WITH ITEM INTERVALS (PrefixSpan algorithm base) Sequential Pattern Mining with Constraints on Large Protein Databases (SPAM algorithm base) Conclusion
Introduction Sequential pattern mining: extracts patterns that appear more frequently than a user-specified minimum support while maintaining their item occurrence order. These sequential pattern mining algorithms PrefixSpan SPADE SPAM … consider only the item occurrence order, but do not consider the item intervals between successive items. EX: 1 year (not interesting) 1day(interesting)
Introduction How to solve ??? We generalize sequential pattern mining with item interval. (a) a capability to handle two kinds of item-interval measurement, item gap and time interval (b) adopting four item-interval constraints
Sequential Pattern Mining Min_sup=0.5,,,, and, are extracted
B. PrefixSpan Algorithm SIDSequence SIDSequence SIDSequence 10 Min_sup=0.5 supSDB( ) =3 supSDB( ) =2 supSDB( ) =2. SIDSequence 10 SIDSequence 10 SIDSequence 10 proj_sdb
GENERALIZED SEQUENTIAL PATTERN MINING WITH ITEM INTERVALS Interval extended sequence is a list of items with item intervals is = When the datasets have item occurrence time information, such as time-stamp, t αβ may becomes the time interval and is defined by the following equation : when the datasets do not have item occurrence time information, t αβ may become an item gap and is defined by the following equation:
GENERALIZED SEQUENTIAL PATTERN MINING WITH ITEM INTERVALS anti-monotone constraint satisfies :when a sequence A does not satisfy the constraint, any superset of A also does not satisfy the constraint. ” anti monotone constraints monotone constraint A monotone constraint satisfies :when a sequence A satisfies the constraint, any superset of A also satisfies the constraint. ”
Example,, represent item a, b, c occur respectively. represents once item a occurs, item c will occur with item interval (172800, ]. represents item a, b occur at the same time represents once item a occurs, item a will occur again with item interval (86400, ]. Min_sup=0.5 IF max_interval = (c2) is not extracted
Algorithm-interval extended projection Level 1 Projection: EX:a sequence projection result with ,, and. Level 2 or later Projection:
Algorithm for Example Min_sup=0.5 Max_interval=172800
Sequential Pattern Mining with Constraints on Large Protein Databases COMAD 2005b Joshua Ho, Lior Lukov, Sanjay Chawla School of Information Technologies University of Sydney
Introduction we generalize a well known sequential pattern mining algorithm, SPAM [1], by incorporating gap and regular expression constraints along the lines proposed in SPIRIT [2]. (a) it allows us to push the constraints deeper inside the mining process by exploiting the prefix antimonotone property of some constraints (b) It uses a simple vertical bitmap data structure for counting (c) it is known to be efficient for mining long patterns.
The SPAM Algorithm (Lexicographic Tree for Sequences) S n, the set of candidate items that are considered for a possible S-step extensions of node n (abbreviated s-extensions). Example : S ({a}) ={a, b, c, d} CIDSequence 1({a, b, d}, {b, c, d}, {b, c, d}) 2({b}, {a, b, c}) 3({a, b}, {b, c, d}) Sequence for each customer a, b a, c a, d S-Step a a, a
The SPAM Algorithm (Lexicographic Tree for Sequences) I n, which identifies the set of candidate items that are considered for a possible I-step extensions (abbreviated, i- extensions). Example : I ({a}) ={b, c, d} a (a, b) (a, c) (a, d) I-Step CIDSequence 1({a, b, d}, {b, c, d}, {b, c, d}) 2({b}, {a, b, c}) 3({a, b}, {b, c, d}) Sequence for each customer
a,a a,b a,c a,a,b a,a,c a,a,d a,{a,b} a,{a,c} a,{a,d} a,b,a a,b,b a,b,c a,b,d a,{b,c} a,{b,d} a {a,b} a,d a,a,a {a,c}{a,d}
Overview of SPAM
Pushing Gap Constraints Here we describe a way to push minGap and maxGapconstraints into SPAM at the bitmap level. With minGap and maxGap constraints, the transformation step is modified to restrict the number of position that {b} can appear after {a}. For any position p with bit one in the original bitmap section of {a}, we transform only the bits between position (p+minGap+1) to the bit at position (p+maxGap+1)inclusively to one and all other bits are set to zero. If the maxGap is set to infinity (no maxGap constraint), all bits between (p + minGap + 1) till the end of the bitmap are set to one.
Pushing Gap Constraints
Pushing Regular Expression Constraints Definition 1: Let R ’ be a constraint such that sequence s satisfies R ’ if s is legal w.r.t. R. Lemma 1 R ’ is a relaxed constraint of R. Lemma 2 R ’ is a prefix-antimonotonic constraint.
Pushing Regular Expression Constraints
Overall Algorithm The first round of support counting do not include any constraint and thus prune the search tree with only minSup. The second round of support counting incoporates the contraints and prune all child nodes that contain sequence that does not satisfy the constraints.
Conclusion Sequential pattern mining with Constrain is good issue PrefixSpan, SPAM Algorithm are popular with constrain mining
Related work Item intervals are represented in two ways: item gap and time interval. Item gap is defined as the number of items between successive items time interval is defined as the length of time between the occurrence times of successive items. 1. Item constraint approach using item gap: EX: minimum gap is 0 and maximum gap is 1. is count is not count 2. Item constraint approach using time interval 3. Extended sequence approach using item gap: EX: are difference 4. Extended sequence approach using time interval: x and y be a pseudo item that represents a user-specified time unit.,, and as different sequences.