Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Proceeding of The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006) Advisor:Jia-Ling Koh Speaker:Chun-Wei Hsieh 05/26/06
Introduction Existing approximation techniques for mining frequent itemsets are mainly false-positive using an error parameter, ε ε= γ* min_sup, 0 γ 1 ε is smaller a larger number of itemsets to be maintained ε is larger lower accuracy
Introduction MineSW progressively increasing ε a false-negative approach for recent data (sliding window) for batches , not for transactions
Introduction w_size = 2 min_sup,σ = 3/5 sup(bc, W1) = 4, sup(bc, W2) = 2 the set of FIs over W1 is {b, c, bc} the set of FIs over W2 is {b, c, d, bd} W1 W2
Preliminaries A computed support A time interval The computed support of an itemset X over a time interval T the number of transactions that arrive in a time interval T
MST Function Requiring the support of the itemset to progressively as it stays longer in a window K : the time of a itemset stays in a window, MST Function
MST Function For example, Let σ = 0.01, r = 0.1 and w = 10, 2000 transactions in each time unit. r1=[(1-0.1)/10](1-1)+0.1=0.1 m1= 0.01*2000*1=20 r2=[(1-0.1)/10](2-1)+0.1=0.19 m2= 0.01*2000*2=40 r3=[(1-0.1)/10](3-1)+0.1=0.28 m3= 0.01*2000*3=60 r4=[(1-0.1)/10](4-1)+0.1=0.37 m4= 0.01*2000*4=80 r5=[(1-0.1)/10](5-1)+0.1=0.46 m5= 0.01*2000*5=100 r6=[(1-0.1)/10](6-1)+0.1=0.55 m6= 0.01*2000*6=120 r7=[(1-0.1)/10](7-1)+0.1=0.64 m7= 0.01*2000*7=140 r8=[(1-0.1)/10](8-1)+0.1=0.73 m8= 0.01*2000*8=160 r9=[(1-0.1)/10](9-1)+0.1=0.82 m9= 0.01*2000*9=180 r10=[(1-0.1)/10](10-1)+0.1=0.91 m10= 0.01*2000*10=200
MST Function ab and cd are retained in windows with Lossy Counting(ε=20) With MineSW, the computed support of ab: t1:3, sup(ab):3 > minsup(1)= 2 t2:0, sup(ab):3 < minsup(2)= 8 : : t7:4, sup(ab):4 > minsup(1)= 2 t8:7, sup(ab):11 > minsup(2)= 8
MineSW Algorithm Mining FIs from each batch with γσ Using a prefix tree to keep the FI and semi-FI of the window The node in the prefix tree has: item uid(X) sup(X)
MineSW Algorithm When the first window is not full:
MineSW Algorithm processing the expiring time unit
MineSW Algorithm processing the new time unit
MineSW Algorithm Pruning and Outputting
Approximation Quality The error bound of the computed support of a semi-frequent itemset X over T k : The set of false-negatives are
Experiments Compare with LCSW 900 MHz CPU 4G RAM Data stream: t10i4, t15i6 t: the average size of a transaction i : a maximal frequent itemset Stream :3M transactions W_Size: 20 time units 1 time unit : 50K transactions
Experiments
Experiments
Experiments
Experiments