Presentation is loading. Please wait.

Presentation is loading. Please wait.

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.

Similar presentations


Presentation on theme: "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu."— Presentation transcript:

1 False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu , Aoying Zhou VLDB 2004

2 Introduction Mining data stream: Data items arrive continuously
One scan of data Limited memory Bounded error

3 Introduction In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption Use false-negative

4 False Positive Most existing algorithm of mining frequent itemset are false-positive oriented Control memory consumption by error parameter ε Allow item’s support below min support s but above s –ε as frequent Approximate frequency counts over data streams (VLDB 02)

5 False Positive Memory bound : O ( .log (εN))
Dilemma of false-positive approach ε smaller, less # of false-positive item included Memory consumption increase reciprocally in terms of ε In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset

6 False Positive & False Negative
All itemsets will output All itemsets will output S + ε Some will output s Some will output S - ε False Positive False Negative

7 False Negative Error control and pruning
ε : prune data, control error bound, changeable ε decrease and approach to zero when # of observation increase ε reciprocal of n s : minimum support n : # of observation

8 False Negative Memory control
δ : reliability, instead ε control memory consumption Memory consumption related to ln(1/ δ) In this approach not allow 1-itemsets with support below s as frequent

9 Comparison: False Positive & False Negative
Recall and Precision A : true frequent itemsets B : obtained frequent itemsets Recall = Precision = |A∩B| |A| |A∩B| |B|

10 Comparison: False Positive
δ=0.1 S(%) True Size Mined Size Recall Precision 0.08 21,361 126,307 1.00 0.17 0.10 12,252 68,275 0.18 0.20 2,359 23,154 0.16

11 Comparison: False Negative
s+ε: minimum support S (%) True Size Mined Size Recall Precision 0.08 21,361 18,351 0.86 1.00 0.10 12,252 10,411 0.85 0.20 2,359 1,739 0.74

12 Chernoff Bound Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Pr{ T ≥ еE[T]} ≤ е-E[T] For example : Pick a lottery number 0000,0001, …,9999. 1,000,000 people buy $1 ticket E[#winners] = 100 Pr{T≧273} ≦ e-100

13 Chernoff Bound Bernolli trails (coin flips): for any γ> 0
Pr[oi=1]=p, Pr[oi=0]=1-p r : # of heads in n coin flips np: expectation of r for any γ> 0

14 Chernoff Bound Let r as r/n, min support s as p Replace sγ with ε
Right of equation be δ Pr{|RunningSupport – TrueSupport|≥εn } ≤δ

15 Frequent or Infrequent
A pattern X is potential infrequent if count(X) / n < s –εn in terms of n A pattern X is potential frequent if it is not potential infrequent in terms of n

16 FDPM-1(s, δ)

17 FDPM-1(s, δ) Delete infrequent items A D B C Memory is full
Count D B C 1 2 1 1 Memory is full Compute new εn C D A Source

18 FDPM-1(s, δ) Algorithm ensure :
item whose true frequency exceeds sN are output with probability of at least 1-δ No item whose true frequency is less than sN are output Probability of the estimated support that equal true support no less than 1-δ

19 Memory Bound Sup(X) ≥ ( s – εn) n |P| ≤ 1/( s – εn), when s – εn>0
|P| = n = = n = 1 S – εn

20 FDPM-2(s, δ)

21 Mining Frequent Itemsets from a Data Stream
{B} {AB} Count {C} {F} {EF} {D} {E} 5 F 13 10 9 8 6 3 6 4 3 Mining Frequent Itemsets {A,B} …….. {E,F,G} Memory is full, compute new εn Delete infrequent itemsets n1 Item Set {A} {B} {AB} {E} {F} {EF} Count 4 5 6 …… P Source

22 Conclusion False negative Limited memory
Error bound with some probability


Download ppt "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu."

Similar presentations


Ads by Google