Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008
Outline Introduction and motivation SWIM algorithm DTV 、 DFV algorithm Experiments Conclusion
Introduction and motivation Conditional counting Verifiers: DTV,DFV verify the frequency of previously frequent itemsets over newly arriving windows Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)
SWIM algorithm The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides W: window PT (Pattern tree): a superset of the frequent patterns over W aux_array: stores the frequency of a pattern for each window, for which the frequency is unknown p.fi: the frequency of p in the ith slide p.freq: p`s cumulative frequency in the current window
SWIM algorithm (cont.) Example: S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 …… W 4 : aux_array= p.freq=p.f 4 W 5 : aux_array= p.freq=p.f 4 +p.f 5 W 6 : aux_array= p.freq=p.f 4 +p.f 5 +p.f 6 W 7 : p.freq=p.f 5 +p.f 6 +p.f 7 W4W4 W5W5 W6W6 W7W7
Analysis of SWIM algorithm Delay: the frequency of pattern turns out to be larger than the minimum support Maximum delay:n-1 slides (n: number of slides) Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)
Conditional counting Goal: verifies counts for a given set of patterns 1.p`s true frequency in D if it has occurred at least min_freq times 2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)
Conditional counting (cont.) Verification given a set of transaction T, a set of pattern P and a threshold s goal: find the exact freq of each p P w.r.t T, iff its freq is ≧ s if s=0,verification=counting, but if s>0 extra computation can be avoided Proposed fast verifiers DTV, DFV, hybrid ∈
Double-Tree Verifier (DTV) FP-tree root:? b:?g:? e:? d:? a b d c e f g h f:?g:? Pattern-tree b d c e f a root:? b:? d:? root d:4 b:5 a:5 c:5 e:1 b:1 g:1 e:1 h:1g:1 f:1 a b d c e f g h a b d c e f root a:3b:1 c:3 b:3e:1 d:2 c:2 a:2 root b:2 a b c b d c e f a root:4 b:4 d:2 root:? b:?g:4 e:? d:? a b d c e f g h f:?g:2 Conditionalized fp-tree on gConditionalized fp-tree |g on dOriginal fp-tree Initial pattern treepattern tree | ”g”pattern tree | ”g” after verification against FP-tree Filling original pattern tree using reverse pointers g:2
Double-Tree Verifier (DTV) for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths Advantage: it is useful when the minimum support decreases
Depth-First Verifier (DFV) Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property) Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p
Hybrid Version many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV trees are small: DFV is faster than DFV Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV
Experiments
Experiments (cont.) transaction=100k
Conclusion Speed up many other application: incremental mining (SWIM) enhancing static algorithms (counting phase) privacy preserving techniques (long transaction) monitoring /concept shift detection Hybrid : no exactly point to switch DTV to DFV