Download presentation
Presentation is loading. Please wait.
Published byMarilynn Barrett Modified over 9 years ago
1
Data Mining on Streams We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to use spatially oriented techniques) since a runlist can be truncated at one end and appended to at the other very easily (a Ptree, even a 1-D Ptree cannot accommodate such activity gracefully. However, if the data is spatial and there is a need for the continuity advantage of 2-D Ptrees, then Ptrees should be used!). We begin with some slides reviewing Ptrees, RunLists and etc. Then move to stream Data Mining.
2
6. 1 st half of 1 st of 2 nd is 1 0 0 1 1 4. 1 st half of 2 nd half not 0 0 2. 1 st half is not pure1 0 0 0 1. Whole file is not pure1 0 Horizontal structure Processed vertically (scans) P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is 1 0 0 1 0000101100001011 horizontally process these Ptrees using one multi-operand logical AND operation. Ptrees vertical partition ; compress each vertical bit slice into a basic Ptree; 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) A table, R(A 1..A n ), is a horizontal structure (set of horizontal records) processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1 0 0 7. 2 nd half of 1 st of 2 nd not 0 0 0 1 10 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 Eg, to count, 111 000 001 100 s, use “pure111000001100”: 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level
3
1.1 st run is Pure0 0:000 truth:start 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Run Lists : Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record the type and start-offset of pure runs. E.g., RL 11 : 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 2. 2 nd run is Pure1 1:100 3. 3 rd run is Pure0 0:101 4. 4 th run is Pure1 1:110 RL 11 0:000 1:100 0:101 1:110 (to complement, flip purity bits) Eg, to count, 111 000 001 100 s, use “pure111000001100”: RL 11 ^RL 12 ^RL 13 ^RL’ 21 ^RL’ 22 ^RL’ 23 ^RL’ 31 ^RL’ 32 ^RL 33 ^RL 41 ^RL’ 42 ^RL’ 43 RL 11 RL 12 RL 13 RL 21 RL 22 RL 23 RL 31 RL 32 RL 33 RL 41 RL 42 RL 43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 0:000 1:001 0:010 1:100 0:101 1:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 0000101100001011 R 11
4
Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and Mixed-Run Indexes on RLs: 00001011010101010000101101010101 R 11 RL 11 00:0 11:100 00:101 11:110 01:1000 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 start length MRL 11 1000:8 Or since we would not traverse the RL very often make it a link list and just concat indexes 0RL 11 0000:4 0101:1 MRL 11 1000:8 1RL 11 0100:1 0110:2 START
5
Indexed RunLists ANDing 00001011010101010000101101010101 R 11 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 MRL 11 1000:8:01011111 00101000010111110010100001011111 R 34 RL 11 1RL 34 1011:5 0RL 34 0000:2 0101:4 MRL 34 0010:3:101 1001:2:10 RL 34 1RL 11^34 0100:1 0RL 11^34 0000:4 0101:4 MRL 11^34 1000:7:1010101 RL 11^34
6
Indexed RunLists ANDing And RL 0 s 1 st, then? 00001011010101010000101101010101 R 11 1RL 11 0100:1 0110:2 0RL 11 0000:4 0101:1 MRL 11 1000:8:01011111 00101000010111110010100001011111 R 34 RL 11 1RL 34 1011:5 0RL 34 0000:2 0101:4 MRL 34 0010:3:101 1001:2:10 RL 34 0RL 11^34 0000:4 0101:4... RL 11^34
7
Indexed Pure RunLists (no mixed) ANDing. Only need 0RLs! Of course, you need 1RL’s to use as 0RL-comps (maintain 1RLs or construct 0RL-comps on the fly?) To get 1-counts, count 0’s and subtract from total. 00001011010101010000101101010101 R 11 0RL 11 0000:4 0101:1 1000:1 1010:1 1100:1 1110:1 00101000010111110010100001011111 R 34 0RL 34 0000:2 0011:1 0101:4 1010:1 0RL 11^34 0000:4 0101:4 1010:1 1100:1 1110:1 0-count = sum of lengths = 11 1-count=16-11 = 5
8
0RL 11’ 0,4,1,1,2,1,1,1,1,1,1,1,1 Zero RunLists ANDing of 34 and 11’ (with pure1 gaps) (0RL 11’ is 0RL 11 with a prefixed 0). 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 0RL 34 2,1,1,1,4,1,1,5 11110100101010101111010010101010 R 11’ 0RL 11’^34 2, 1 1 Intra-Run cursors 1,1, 3 1 1, 4 1 2, 1 1 6, 1 1 7, 1 1 8, 1 1 9, 1 1 1,1, 1 2 1, 1 3 1,1, 1 4 1 5 The 1count of the result is Total minus the 0count or 16 – 13 = 3 So, the coding of this AND program seems straight forward following the animation An intra-run cursor for each operand and a list cursor for each operand and one for the result. We, of course, need the 1RLs too (e.g., for 0RL of a complement). Next let’s allow the red gaps to be mixed and insist that the gaps in a 0RL and its corresponding 1RL be compatible.
9
0rl ANDing of 34 and 11’ with selected mixed gaps, differentiated by a prefix bit. We will use colors on the slides, pure gap=1, mixed gap=0 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 0rl 34 2,3:101,4,1,1,5 11110100101010101111010010101010 R 11’ 1rl 34 0,11:00101000010,5 0rl 11 4,12:101101010101 1rl 11 0,6:000010,2,8:01010101 0rl 34 2,3:101,4,1,1,5 0rl 11’ 0,6:111101,2,8:10 1 01010 23:100 4567 4:1010 Note we have to flip mixeds
10
zmrl 11’ 0,4,1,1,2,0,8 Take the philosophy that we will follow a pointer to long mixed runs only when necessary. Otherwise we will sequence straight across. 00001011010101010000101101010101 R 11 00101000010111110010100001011111 R 34 zmrl 34 2,3,4,1,1,5 11110100101010101111010010101010 R 11’ zmrl 11’^34 2, 1 1 zmrl 34 2,3,4,1,1,5 101 zmrl 11 4,1,1,2,0,8 01010101 3 100 3 1 4 1 1 5 2 1 6 3 1 5 01010 4 1
11
When the 16-bit window moves left (e.g., add 100 to 0rl 11 ). zmrl 11 0,1,5,1,1,2,0,5 01010 0rl 11’ 4,1,1,2,1,1,1,1,1,1,1,1 00001011010101010000101101010101 R 11 0rl 11’ 0,1,6,1,1,2,1,1,1,1,1 zmrl 11 4,1,1,2,0,8 01010101
12
Network Security Application (Network security through Vertical Structured data) Network layers do their own partitioning Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) Fragmentation/Reassembly, Segmentation/Reassembly Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless. Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). Send one Ptree per packet Send intra-message packets separately Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. The message is only meaningful after destination demux-ing Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! It seems like there ought to be a whole range of killer ideas associated with the concept of using vertical structuring data within network transmission units Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.