False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu , Aoying Zhou VLDB 2004
Introduction Mining data stream: Data items arrive continuously One scan of data Limited memory Bounded error
Introduction In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption Use false-negative
False Positive Most existing algorithm of mining frequent itemset are false-positive oriented Control memory consumption by error parameter ε Allow item’s support below min support s but above s –ε as frequent Approximate frequency counts over data streams (VLDB 02)
False Positive Memory bound : O ( .log (εN)) Dilemma of false-positive approach ε smaller, less # of false-positive item included Memory consumption increase reciprocally in terms of ε In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset
False Positive & False Negative All itemsets will output All itemsets will output S + ε Some will output s Some will output S - ε False Positive False Negative
False Negative Error control and pruning ε : prune data, control error bound, changeable ε decrease and approach to zero when # of observation increase ε reciprocal of n s : minimum support n : # of observation
False Negative Memory control δ : reliability, instead ε control memory consumption Memory consumption related to ln(1/ δ) In this approach not allow 1-itemsets with support below s as frequent
Comparison: False Positive & False Negative Recall and Precision A : true frequent itemsets B : obtained frequent itemsets Recall = Precision = |A∩B| |A| |A∩B| |B|
Comparison: False Positive δ=0.1 S(%) True Size Mined Size Recall Precision 0.08 21,361 126,307 1.00 0.17 0.10 12,252 68,275 0.18 0.20 2,359 23,154 0.16
Comparison: False Negative s+ε: minimum support S (%) True Size Mined Size Recall Precision 0.08 21,361 18,351 0.86 1.00 0.10 12,252 10,411 0.85 0.20 2,359 1,739 0.74
Chernoff Bound Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Pr{ T ≥ еE[T]} ≤ е-E[T] For example : Pick a lottery number 0000,0001, …,9999. 1,000,000 people buy $1 ticket E[#winners] = 100 Pr{T≧273} ≦ e-100
Chernoff Bound Bernolli trails (coin flips): for any γ> 0 Pr[oi=1]=p, Pr[oi=0]=1-p r : # of heads in n coin flips np: expectation of r for any γ> 0
Chernoff Bound Let r as r/n, min support s as p Replace sγ with ε Right of equation be δ Pr{|RunningSupport – TrueSupport|≥εn } ≤δ
Frequent or Infrequent A pattern X is potential infrequent if count(X) / n < s –εn in terms of n A pattern X is potential frequent if it is not potential infrequent in terms of n
FDPM-1(s, δ)
FDPM-1(s, δ) Delete infrequent items A D B C Memory is full Count D B C 1 2 1 1 Memory is full Compute new εn C D A Source
FDPM-1(s, δ) Algorithm ensure : item whose true frequency exceeds sN are output with probability of at least 1-δ No item whose true frequency is less than sN are output Probability of the estimated support that equal true support no less than 1-δ
Memory Bound Sup(X) ≥ ( s – εn) n |P| ≤ 1/( s – εn), when s – εn>0 |P| = n = = n = 1 S – εn
FDPM-2(s, δ)
Mining Frequent Itemsets from a Data Stream {B} {AB} … Count {C} {F} {EF} {D} {E} 5 F 13 10 9 8 6 3 6 4 3 Mining Frequent Itemsets {A,B} …….. {E,F,G} Memory is full, compute new εn Delete infrequent itemsets n1 Item Set {A} {B} {AB} {E} {F} {EF} Count 4 5 6 …… P Source
Conclusion False negative Limited memory Error bound with some probability