Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge.

Similar presentations


Presentation on theme: "1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge."— Presentation transcript:

1 1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge Discovery and Data Ming, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date: 2004.8.12

2 2 Introduction This paper proposes a method of finding recent frequent itemsets : –Significant itemsets are maintained by a prefix- tree lattice structure called monitoring lattice. –Decaying the old occurrence count of each itemset as time goes by. –Minimize the number of significant itemsets : delayed-insertion pruning operations

3 3 Preliminaries (1) Data Stream can be defined : –I={i 1, i 2, …, i n } : a set of current items. –e : itemset, a set of item. –Tid : transaction id, T k generate at the kth turn. –D k =, When new transaction D k is generated. –|D| k : the number of transactions in D k. –C k (e) : the number of transactions in D k that contain the itemset e. –S k (e) : Support of itemset e in D k.

4 4 Preliminaries (2) Decay rate : the reducing rate of a weight for a fixed decay-unit. d=b -(1/h), (b>1, h ≧ 1, b -1 ≦ d<1) –decay-unit : the chunk of information to be decayed together. –decay-base b : the amount of weight reduction per a decay-unit and greater than 1. –decay-base-life h : defined by the number of decay-units that makes the current weight be b -1.

5 5 Preliminaries (3) The total number of transactions |D| k in the current data stream D k : –The value of |D| k converges to 1/(1-d) as the value k increases infinitely. The count C k (e) of an itemset e in the current data stream D k :

6 6 Count Estimation of an itemset (1) The maximum possible count of an itemset is estimated by the minimum value among the maximum possible counts of all of its subsets.

7 7 Count Estimation of an itemset (2) Definition 1 : – : a set of itemset e ’ s subsets – : a set of e ’ s m-subsets – : a set of counts for e ’ s m-subsets Definition 2 : –Union-itemset is composed of all items that are members of either e 1 or e 2. –Intersection-itemset is composed of all items that are members of both e 1 and e 2.

8 8 exclusively distributed (LED) : the items of an itemset appear together in as many transactions as possible. most exclusively distributed (MED) : the items of an itemset appear exclusively as many transactions as possible. The maximum count of n-itemset e : Count Estimation of an itemset (3)

9 9 Count Estimation of an itemset (4) Two itemsets e 1, e 2 : The minimum count of C min (e) can be estimated by (n-1)-subset union : Estimation error : –E(e)=C max (e)-C min (e)

10 10 estDec Method (1) Every node in a monitoring lattice maintains a triple (cnt, err, MRtid) for its corresponding itemset e : –cnt : count of e. –err : maximum error count of e –Mrtid : the most recent transacrion id that contain e

11 11 estDec Method (2) estDec Method is composed of four phase : –Phase Ⅰ: parameter updating phase –Phase Ⅱ: count updating phase –Phase Ⅲ: Delayed insertion phase –Phase Ⅳ: frequent itemset selection phase

12 12 estDec Method (3) Phase II : the counts of those itemsets in ML that appear in T k are updated. –S prn : threshold for pruning. –If a 1-itemset is pruned from ML, it is impossible to estimate its count later. Phase I : |D| k is updated.

13 13 estDec Method (4) Phase III : Find new itemset that has high possibility to become frequent. Two cases insert new itemset to a ML : –new 1-itemset, the cnt of 1-itemset is actual. –Itemset e C max (e)/|D| k ≧ S ins, S ins : threshold for delayed-insertion. cntt_for_subsets=(1-d |e|-1 )/(1-d) max_xnt_before_subsets=Sins*(|D| k-(|e|-1) )*d |e|-1 ) C upper (e)=Max_xnt_before_subsets+ Cntt_for_subsets

14 14 estDec Method (5) Phase IV : produces all current frequent itemsets in ML. –itemset e is frequent if its current support (cnt * d (k-MRtid) )/|D| k is greater than S min –its current support error : (err*d (k-MRtid) )/|D| k

15 15 estDec Method (6) Force-pruning operation : –all insignificant itemsets in ML can be pruned –perform when the current size of ML reaches a threshold.

16 16 Experimental (1) Performance of the estDec method for the data set T10.I4.D1000K –S ins is denoted p%, the actual value=S min *p%. –Force-pruning operation perform in every 1,000 transactions. –(a) memory usage (b) performance time of Phases I~III (c) performance time of Phases IV

17 17 Experimental (2) Accuracy of mining result –Average support error ASE(R estDec |R dApriori )

18 18 Experimental (3) The adaptability of the estDec method for the change of information in a data stream. –Coverage rate CR(X) |R| : total nmber of frequent itemdets in ML


Download ppt "1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge."

Similar presentations


Ads by Google