Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Similar presentations


Presentation on theme: "Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to."— Presentation transcript:

1 Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to use spatially oriented techniques) since a runlist can be truncated at one end and appended to at the other very easily (a Ptree, even a 1-D Ptree cannot accommodate such activity gracefully. However, if the data is spatial and there is a need for the continuity advantage of 2-D Ptrees, then Ptrees should be used!).  We begin with some slides reviewing Ptrees, RunLists and etc. Then move to stream Data Mining.

2 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Horizontal structure Processed vertically (scans) P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 0000101100001011 horizontally process these Ptrees using one multi-operand logical AND operation. Ptrees vertical partition ; compress each vertical bit slice into a basic Ptree; 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) A table, R(A 1..A n ), is a horizontal structure (set of horizontal records) processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 Eg, to count, 111 000 001 100 s, use “pure111000001100”: 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level

3 1.1 st run is Pure0  0:000 truth:start 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Run Lists : Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record the type and start-offset of pure runs. E.g., RL 11 : 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 2. 2 nd run is Pure1  1:100 3. 3 rd run is Pure0  0:101 4. 4 th run is Pure1  1:110 RL 11 0:000 1:100 0:101 1:110 (to complement, flip purity bits) Eg, to count, 111 000 001 100 s, use “pure111000001100”: RL 11 ^RL 12 ^RL 13 ^RL’ 21 ^RL’ 22 ^RL’ 23 ^RL’ 31 ^RL’ 32 ^RL 33 ^RL 41 ^RL’ 42 ^RL’ 43 RL 11 RL 12 RL 13 RL 21 RL 22 RL 23 RL 31 RL 32 RL 33 RL 41 RL 42 RL 43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 0:000 1:001 0:010 1:100 0:101 1:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 0000101100001011 R 11

4 Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and Mixed-Run Indexes on RLs: 00001011010101010000101101010101 R 11 RL 11 00:0 11:100 00:101 11:110 01:1000 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 start length MRL 11 1000:8 Or since we would not traverse the RL very often make it a link list and just concat indexes 0RL 11 0000:4 0101:1 MRL 11 1000:8 1RL 11 0100:1 0110:2 START

5 Indexed RunLists ANDing 00001011010101010000101101010101 R 11 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 MRL 11 1000:8:01011111 00101000010111110010100001011111 R 34 RL 11 1RL 34 1011:5 0RL 34 0000:2 0101:4 MRL 34 0010:3:101 1001:2:10 RL 34 1RL 11^34 0100:1 0RL 11^34 0000:4 0101:4 MRL 11^34 1000:7:1010101 RL 11^34

6 Indexed RunLists ANDing And RL 0 s 1 st, then? 00001011010101010000101101010101 R 11 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 MRL 11 1000:8:01011111 00101000010111110010100001011111 R 34 RL 11 1RL 34 1011:5 0RL 34 0000:2 0101:4 MRL 34 0010:3:101 1001:2:10 RL 34 0RL 11^34 0000:4 0101:4... RL 11^34

7 Indexed Pure RunLists (no mixed) ANDing. Only need 0RLs! Of course, you need 1RL’s to use as 0RL-comps (maintain 1RLs or construct 0RL-comps on the fly?) To get 1-counts, count 0’s and subtract from total. 00001011010101010000101101010101 R 11 0RL 11 0000:4 0101:1 1000:1 1010:1 1100:1 1110:1 00101000010111110010100001011111 R 34 0RL 34 0000:2 0011:1 0101:4 1010:1 0RL 11^34 0000:4 0101:4 1010:1 1100:1 1110:1 0-count = sum of lengths = 11 1-count=16-11 = 5

8 0RL 11’ 0,4,1,1,2,1,1,1,1,1,1,1,1 Zero RunLists ANDing of 34 and 11’ (with pure1 gaps) (0RL 11’ is 0RL 11 with a prefixed 0). 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 0RL 34 2,1,1,1,4,1,1,5 11110100101010101111010010101010 R 11’ 0RL 11’^34 2, 1 1 Intra-Run cursors 1,1, 3 1 1, 4 1 2, 1 1 6, 1 1 7, 1 1 8, 1 1 9, 1 1 1,1, 1 2 1, 1 3 1,1, 1 4 1 5 The 1count of the result is Total minus the 0count or 16 – 13 = 3 So, the coding of this AND program seems straight forward following the animation An intra-run cursor for each operand and a list cursor for each operand and one for the result. We, of course, need the 1RLs too (e.g., for 0RL of a complement). Next let’s allow the red gaps to be mixed and insist that the gaps in a 0RL and its corresponding 1RL be compatible.

9 0rl ANDing of 34 and 11’ with selected mixed gaps, differentiated by a prefix bit. We will use colors on the slides, pure gap=1, mixed gap=0 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 0rl 34 2,3:101,4,1,1,5 11110100101010101111010010101010 R 11’ 1rl 34 0,11:00101000010,5 0rl 11 4,12:101101010101 1rl 11 0,6:000010,2,8:01010101 0rl 34 2,3:101,4,1,1,5 0rl 11’ 0,6:111101,2,8:10 1 01010 23:100 4567 4:1010 Note we have to flip mixeds

10 zmrl 11’ 0,4,1,1,2,0,8 Take the philosophy that we will follow a pointer to long mixed runs only when necessary. Otherwise we will sequence straight across. 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 zmrl 34 2,3,4,1,1,5 11110100101010101111010010101010 R 11’ zmrl 11’^34 2, 1 1 zmrl 34 2,3,4,1,1,5 101 zmrl 11 4,1,1,2,0,8 01010101 3 100 3 1 4 1 1 5 2 1 6 3 1 5 01010 4 1

11 When the 16-bit window moves left (e.g., add 100 to 0rl 11 ). zmrl 11 0,1,5,1,1,2,0,5 01010 0rl 11’ 4,1,1,2,1,1,1,1,1,1,1,1 00001011010101010000101101010101 R 11 0rl 11’ 0,1,6,1,1,2,1,1,1,1,1 zmrl 11 4,1,1,2,0,8 01010101

12 Network Security Application (Network security through Vertical Structured data) Network layers do their own partitioning  Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) Fragmentation/Reassembly, Segmentation/Reassembly Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network  A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless.  Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). Send one Ptree per packet Send intra-message packets separately  Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. The message is only meaningful after destination demux-ing  Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! It seems like there ought to be a whole range of killer ideas associated with the concept of using vertical structuring data within network transmission units  Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)


Download ppt "Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to."

Similar presentations


Ads by Google