1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶
2 Introduction Difficulties of Data Stream Mining Huge High speed Continuous Solution : one-pass algorithm Summary data structure Mines the maximal frequent itemsets
3 Definition Ψ= {i 1, i 2, …, i n } : a set of items W i : basic window i Data stream= [W 1, W 2, …, W N ) : an infinite sequence of basic windows N : the window identifier of the latest basic window Current length of data stream (CL) = |W 1 | + |W 2 | + … + |W N | CL = 3xN W 1 abc bcd acd W 2 cd abd bc W N a b cd ··· time
4 Definition X.tsup : true support of itemset X X.esup : estimated support of itemset X, 1 ≤ X.esup ≤ X.tsup X.CL = |W j |+|W j+1 |+ … +|W N | W j : the first window containing X in the summary data structure S : minimum support ε : maximum support error threshold
5 Data Stream Mining for maximal Frequent Itemsets (DSM-MFI) Step1, reads a window of transactions Step2, constructs and maintains the summary data structure Step3, prunes the infrequent information Step4, searches the maximal frequent itemsets
6 Summary Frequent Itemsets forest (SFI-forest) Composed of a FI-list and a set of SFI-trees SFI-trees item-id, the item identifier esup, the number of transactions reaching the node with the item-id window-id, assigned to a new node of the current basic window identifier node-link, links to the next node with the same item-id in the same SFI-tree
7 Summary Frequent Itemsets forest (SFI-forest) FI-list item-id, the item identifier esup, the number of transactions containing the item window-id, assigned to a new entry of the current basic window identifier head link, links to the root node of the item-id.SFI-tree
8 Summary Frequent Itemsets forest (SFI-forest) Each SFI-tree has a specific opposite frequent item list (OFI-list) OFI-list (item-id, esup, window-id, head link) head link links to the first node carrying the item-id in the SFI-tree
9 Example W 1 abc bcd acd (item-id, esup, window-id, node link) FI-list T = abc (1,1,1) (2,1,1) (3,1,1) X = aX = bX = c Transaction Projection (T) abc bc c a.OFI-listX = bX = c (2,1,1) (3,1,1) SFI-tree-maintenance (abc)SFI-tree-maintenance (bc)SFI-tree-maintenance (c) a.SFI-tree 1:1:12:1:13:1:1 b.OFI-list (3,1,1) 2:1:13:1:1 b.SFI-tree c.OFI-list c.SFI-tree 3:1:1
10 Example W 1 abc bcd acd (item-id, esup, window-id, node link) FI-list T = bcd (1,1,1) (2,1,1) (3,1,1) X = dX = bX = c Transaction Projection (T) bcd cd d SFI-tree-maintenance (d)SFI-tree-maintenance (bcd)SFI-tree-maintenance (cd) a.SFI-tree b.OFI-list (3,1,1) 2:1:13:1:1 b.SFI-tree c.OFI-list c.SFI-tree 3:1:1 (2,1,2) (3,1,2) (4,1,1)(4,1,1) X = cX = d (4,1,1)(4,1,1) (4,1,1)(4,1,1) (3,1,2) 3:1:24:1:1 3:1:22:1:2 d.SFI-tree 4:1:1 d.OFI-list
11 Example W 1 abc bcd acd (item-id, esup, window-id, node link) FI-list T = acd (1,1,1) X = dX = aX = c Transaction Projection (T) acd cd d SFI-tree-maintenance (acd) a.SFI-tree b.SFI-treec.SFI-tree (2,1,2) (3,1,2) (4,1,1) X = dX = c d.SFI-tree 1:1:12:1:13:1:1 (2,1,1) (3,1,1) a.OFI-list (1,1,2) (3,1,3) (4,1,2) (3,1,2) (4,1,1)(4,1,1) 1:1:2 3:1:1 4:1:1
12 Pruning infrequent items from SFI-forest X : 1-itemset in the FI-list if X.esup < X.CL*ε then X and its supersets are deleted from SFI-forest Step 1 deletes item-id.OFI-list item-id.SFI-tree the entry with item-id from the FI-list 2 removes the infrequent item from other OFI-lists by traversing the FI-list
13 Pruning infrequent items from SFI-forest 3 deletes the infrequent item from other SFI-trees 4 reconstructs SFI-trees by reinserting these modified item-suffix transactions or join the remainder subtrees into SFI-tree
14 Example (1,1,3)(2,1,2)(3,1,3)(4,1,3) s = 0.3, ε= 0.2 FI-list a.SFI-tree c.SFI-treeb.SFI-treed.SFI-tree 2:1:1 1:1:3 3:1:1 4:1:1 3:1:2 2:1:2 3:1:2 4:1:1 4:1:3 4:1:2 (3,1,2) (2,1,1) (4,1,1) (3,1,2) (4,1,1) (4,1,2) a.OFI-list b.OFI-list c.OFI-list d.OFI-list a.CL = b.CL = c.CL = d.CL = x 0.2 = 2.4 3:1:1 3:1:3
15 Determining maximal frequent itemsets There are k frequent 1-itemsets, e 1, e 2, …, e k, in the FI-list o 1, o 2, …, o j, the items in the e i.OFI-list Generates a candidate maximal frequent (j+1)-itemset, E = (e i, o 1, o 2, …, o j ) starts from a frequent item with the smallest estimated support traverses the path via node link to count E ’ s estimated support
16 Determining maximal frequent itemsets if E.esup ≥ s . e i.CL then E is MFI else enumerate E into itemsets with size |E|−1 until finds the set of all maximal frequent itemsets with respect to entry e
17 Example (1,1,3)(2,1,2)(3,1,3)(4,1,3) s = 0.3, ε= 0.2 FI-list a.SFI-tree c.SFI-treeb.SFI-treed.SFI-tree 2:1:1 1:1:3 3:1:1 4:1:1 3:1:2 2:1:2 4:1:1 4:1:3 4:1:2 (3,1,2) (2,1,1) (4,1,1) (3,1,2) (4,1,1) (4,1,2) a.OFI-list b.OFI-list c.OFI-list d.OFI-list a.CL = b.CL = c.CL = d.CL = 5 3:1:3 5 x 0.3 = 1.5 Caculate support (bcd)Caculate support (bc) = 1
18 Sliding Window Mining over Data Streams Modifications : uses DSM-MFI algorithm to construct a SFI-forest i for each basic window W i find local maximal frequent itemsets (local MFI i ), all local MFI are stored in a queue global MFI-list store all local MFI from W 1 to W N
19 Sliding Window Mining over Data Streams When basic window N+1 arrives removes the local MFI 1 from the queue subtracts the support of the local MFI 1 from the global MFI uses DSMMFI algorithm to mine all local maximal frequent itemsets of W N+1 Increases the support of global MFI or insert local MFI N+1 into it
20 Experiment 1GHz IBMx24, 384MB, Visual C s = 0.1%, ε= 0.01%. IBM synthetic datasets T10.I5.D1000K T30.I20.D1000K the data is broken into 20 basic windows for simulating the streaming data
21 Experiment