1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge Discovery and Data Ming, Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date:
2 Introduction This paper proposes a method of finding recent frequent itemsets : –Significant itemsets are maintained by a prefix- tree lattice structure called monitoring lattice. –Decaying the old occurrence count of each itemset as time goes by. –Minimize the number of significant itemsets : delayed-insertion pruning operations
3 Preliminaries (1) Data Stream can be defined : –I={i 1, i 2, …, i n } : a set of current items. –e : itemset, a set of item. –Tid : transaction id, T k generate at the kth turn. –D k =, When new transaction D k is generated. –|D| k : the number of transactions in D k. –C k (e) : the number of transactions in D k that contain the itemset e. –S k (e) : Support of itemset e in D k.
4 Preliminaries (2) Decay rate : the reducing rate of a weight for a fixed decay-unit. d=b -(1/h), (b>1, h ≧ 1, b -1 ≦ d<1) –decay-unit : the chunk of information to be decayed together. –decay-base b : the amount of weight reduction per a decay-unit and greater than 1. –decay-base-life h : defined by the number of decay-units that makes the current weight be b -1.
5 Preliminaries (3) The total number of transactions |D| k in the current data stream D k : –The value of |D| k converges to 1/(1-d) as the value k increases infinitely. The count C k (e) of an itemset e in the current data stream D k :
6 Count Estimation of an itemset (1) The maximum possible count of an itemset is estimated by the minimum value among the maximum possible counts of all of its subsets.
7 Count Estimation of an itemset (2) Definition 1 : – : a set of itemset e ’ s subsets – : a set of e ’ s m-subsets – : a set of counts for e ’ s m-subsets Definition 2 : –Union-itemset is composed of all items that are members of either e 1 or e 2. –Intersection-itemset is composed of all items that are members of both e 1 and e 2.
8 exclusively distributed (LED) : the items of an itemset appear together in as many transactions as possible. most exclusively distributed (MED) : the items of an itemset appear exclusively as many transactions as possible. The maximum count of n-itemset e : Count Estimation of an itemset (3)
9 Count Estimation of an itemset (4) Two itemsets e 1, e 2 : The minimum count of C min (e) can be estimated by (n-1)-subset union : Estimation error : –E(e)=C max (e)-C min (e)
10 estDec Method (1) Every node in a monitoring lattice maintains a triple (cnt, err, MRtid) for its corresponding itemset e : –cnt : count of e. –err : maximum error count of e –Mrtid : the most recent transacrion id that contain e
11 estDec Method (2) estDec Method is composed of four phase : –Phase Ⅰ: parameter updating phase –Phase Ⅱ: count updating phase –Phase Ⅲ: Delayed insertion phase –Phase Ⅳ: frequent itemset selection phase
12 estDec Method (3) Phase II : the counts of those itemsets in ML that appear in T k are updated. –S prn : threshold for pruning. –If a 1-itemset is pruned from ML, it is impossible to estimate its count later. Phase I : |D| k is updated.
13 estDec Method (4) Phase III : Find new itemset that has high possibility to become frequent. Two cases insert new itemset to a ML : –new 1-itemset, the cnt of 1-itemset is actual. –Itemset e C max (e)/|D| k ≧ S ins, S ins : threshold for delayed-insertion. cntt_for_subsets=(1-d |e|-1 )/(1-d) max_xnt_before_subsets=Sins*(|D| k-(|e|-1) )*d |e|-1 ) C upper (e)=Max_xnt_before_subsets+ Cntt_for_subsets
14 estDec Method (5) Phase IV : produces all current frequent itemsets in ML. –itemset e is frequent if its current support (cnt * d (k-MRtid) )/|D| k is greater than S min –its current support error : (err*d (k-MRtid) )/|D| k
15 estDec Method (6) Force-pruning operation : –all insignificant itemsets in ML can be pruned –perform when the current size of ML reaches a threshold.
16 Experimental (1) Performance of the estDec method for the data set T10.I4.D1000K –S ins is denoted p%, the actual value=S min *p%. –Force-pruning operation perform in every 1,000 transactions. –(a) memory usage (b) performance time of Phases I~III (c) performance time of Phases IV
17 Experimental (2) Accuracy of mining result –Average support error ASE(R estDec |R dApriori )
18 Experimental (3) The adaptability of the estDec method for the change of information in a data stream. –Coverage rate CR(X) |R| : total nmber of frequent itemdets in ML