Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008
Outlines Introduction Problem Definition The MDSDS Approach Experimental Results Conclusions 2
Introduction We propose to consider the intrinsic multidimensionality of the streams for the extraction of more interesting sequential patterns. The search space in multidimensional framework is huge. We only focus on the most specific abstraction level for items instead of mining at all possible levels. 3
Problem Definition multidimensional item a = (d 1,..., d m ) * : wild-card value that can be interpreted by ALL. multidimensional itemset i = {a 1,..., a k } multidimensional sequence s = 4
Cont. 5 We focus on the most specific frequent items to generate the multidimensional sequential patterns. E.g. ▫If items (LA, ∗, M, ∗ ) and ( ∗, ∗, M, Wii) are frequent, we do not consider the frequent items (LA, ∗, ∗, ∗ ), ( ∗, ∗, M, ∗ ) and ( ∗, ∗, ∗, Wii).
Cont. Data stream DS = B 0, B 1,..., B n B i = { B 1, B 2, B 3,..., B k } 6 B0B0 B1B1 B1B1 B2B2 B3B3
Cont. min_sup = 50% specialization 7
The MDSDS Approach MDSDS extracts the most specific multidimensional items. MDSDS uses a data structure consisting of a prefix-tree and tilted-time windows tables. The patterns are: (1) frequent patterns, (2) sub-frequent patterns, (3) infrequent patterns (not stored in the prefix-tree). 8
Cont. 9 Step 1 : mine the most specific multidimensional items ▫.▫. ▫Multidimensional representation : (LA, ∗, ∗, ∗ ), ( ∗, ∗, M, ∗ ) ▫Detecting the specialization or generalization
Cont. 10 Step 2 : ▫Subfrequent sequences may become frequent in future batches. ▫Using PrefixSpan algorithm to mine efficiently the multidimensional sequences.
PrefixSpan algorithm Find length-1 sequential patterns, :4, :4, :4, :3, :3, :3. 2. Divide search space, (1) the ones having prefix ;…; and (6) the ones having prefix. ▫ -projected database:,,,. ▫The length-2 sequential patterns :2, :4, :2, :4, :2, :2. ▫… min_sup = 2
Cont Find subsets of sequential patterns.
Cont. 13 Step 3 : ▫Tilted-time windows table ▫The updating operations and pruning techniques are done after receiving a batch from the data stream.
Tilted-time windows 14.
Cont. 15.
Experimental Results 16
Cont. 17
Cont. 18
Conclusions Experiments on real data gathered from TCP/IP network traffic provide compelling evidence that it is possible to obtain accurate and fast results for multidimensional sequential pattern mining. We propose to take multidimensional framework into account in order to detect high-level changes like trends. 19