Presentation is loading. Please wait.

Presentation is loading. Please wait.

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.

Similar presentations


Presentation on theme: "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying."— Presentation transcript:

1 False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou VLDB 2004

2 Introduction Mining data stream: Mining data stream: –Data items arrive continuously –One scan of data –Limited memory –Bounded error

3 Introduction In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption Use false-negative Use false-negative

4 False Positive Most existing algorithm of mining frequent itemset are false-positive oriented Most existing algorithm of mining frequent itemset are false-positive oriented –Control memory consumption by error parameter ε –Allow item’s support below min support s but above s –ε as frequent Approximate frequency counts over data streams (VLDB 02) Approximate frequency counts over data streams (VLDB 02)

5 False Positive Memory bound : O ( . log (εN)) Memory bound : O ( . log (εN)) Dilemma of false-positive approach Dilemma of false-positive approach –ε smaller, less # of false-positive item included –Memory consumption increase reciprocally in terms of ε –In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset 1l ε

6 False Positive & False Negative s S + ε S - ε False PositiveFalse Negative All itemsets will output Some will output All itemsets will output Some will output

7 False Negative Error control and pruning Error control and pruning –ε : prune data, control error bound, changeable changeable ε decrease and approach to zero when # of observation increase ε decrease and approach to zero when # of observation increase ε reciprocal of n ε reciprocal of n – s : minimum support – n : # of observation

8 False Negative Memory control Memory control –δ : reliability, instead ε control memory consumption –Memory consumption related to ln(1/ δ) In this approach not allow 1-itemsets with support below s as frequent In this approach not allow 1-itemsets with support below s as frequent

9 Comparison: False Positive & False Negative Recall and Precision Recall and Precision A : true frequent itemsets B : obtained frequent itemsets –Recall = –Precision = |A ∩ B| |A| |A ∩ B| |B|

10 Comparison: False Positive ε= S/10 ε= S/10 δ=0.1 δ=0.1 S(%) True Size Mined Size Recall Precisio n 0.0821,361126,3071.000.17 0.1012,25268,2751.000.18 0.202,35923,1541.000.16

11 Comparison: False Negative s+ε: minimum support S (%) True Size Mined Size Recall Precisio n 0.0821,36118,3510.861.00 0.1012,25210,4110.851.00 0.202,3591,7390.741.00

12 Chernoff Bound Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Pr{ T ≥ е E[T]} ≤ е - E[T] Pr{ T ≥ е E[T]} ≤ е - E[T] For example : Pick a lottery number For example : Pick a lottery number 0000,0001, …,9999. 0000,0001, …,9999. 1,000,000 people buy $1 ticket 1,000,000 people buy $1 ticket E[#winners] = 100 E[#winners] = 100 Pr{T ≧ 273} ≦ e -100 Pr{T ≧ 273} ≦ e -100

13 Chernoff Bound Bernolli trails (coin flips): Bernolli trails (coin flips): –Pr[o i =1]=p, Pr[o i =0]=1-p –r : # of heads in n coin flips –np: expectation of r for any γ> 0 for any γ> 0

14 Chernoff Bound Let r as r/n, min support s as p Let r as r/n, min support s as p Replace s γ with ε Replace s γ with ε Right of equation be δ Right of equation be δ –Pr{| RunningSupport – TrueSupport |≥ε n } ≤δ

15 Frequent or Infrequent A pattern X is potential infrequent if A pattern X is potential infrequent if count(X) / n < s – ε n in terms of n count(X) / n < s – ε n in terms of n A pattern X is potential frequent if it A pattern X is potential frequent if it is not potential infrequent in terms of n is not potential infrequent in terms of n

16 FDPM-1(s, δ)

17 ACD Item A Count Source B 11 2 C 1 Memory is full Compute new ε n Delete infrequent items D

18 FDPM-1(s, δ) Algorithm ensure : Algorithm ensure : –item whose true frequency exceeds sN are output with probability of at least 1-δ –No item whose true frequency is less than sN are output –Probability of the estimated support that equal true support no less than 1-δ

19 Memory Bound Sup(X) ≥ ( s – ε n ) n Sup(X) ≥ ( s – ε n ) n |P| ≤ 1/( s – ε n ), when s – ε n >0 |P| ≤ 1/( s – ε n ), when s – ε n >0 |P| = n = = |P| = n = = n = n = 1 S – ε n

20 FDPM-2(s, δ)

21 Mining Frequent Itemsets Mining Frequent Itemsets from a Data Stream Itemset{A}{B}{AB}… Count… Source …… n1n1 {A,B} …….. {E,F,G} ItemSet{A}{B}{AB}{E}{F}{EF}Count454564 P F {C}{D} 9 863313 10 {E} 5 Memory is full, compute new ε n Delete infrequent itemsets {F}{EF} 64

22 Conclusion False negative False negative Limited memory Limited memory Error bound with some probability Error bound with some probability


Download ppt "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying."

Similar presentations


Ads by Google