Presentation is loading. Please wait.

Presentation is loading. Please wait.

13 12 1 document 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 person 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 Buy MYRRH ManY-Relationship-Rule Harvester.

Similar presentations


Presentation on theme: "13 12 1 document 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 person 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 Buy MYRRH ManY-Relationship-Rule Harvester."— Presentation transcript:

1

2 13 12 1 document 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 person 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 Buy MYRRH ManY-Relationship-Rule Harvester uses pTrees for association rule mining of multiple relationships. Applications: ConCur Concurrency Control uses pTrees for ROCC and ROLL concurrency control. PGP-D Pretty Good Protection of Data protects vertical pTree data. 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 |... key=array(offset,pad) FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTtrees for classification and clustering of spatial data. pTrees predicate Tree technologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. DOVE DOmain VEctors Uses pTrees for database query processing.

3 0 0 0 0 1 P 11 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole thing pure1? false  0 5. Rt half of right half? true  1 0 0 1 R 11 0 1 predicate Trees (pTrees): project each attribute (now 4 files) Record the truth of predicate: "pure1 (all 1's)" in a tree recursively on halves, until the half is pure (all 1’s or all 0’s). 3. Right half pure1? false  0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 But it's pure0 so this branch ends then vertically slice off each bit position (now 12 files) then compress each bit slice into a pTree e.g., the compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 1 st, Vertically Processing of Horizontal Data (VPHD) R(A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 for Horizontally structured records, we scan vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 01 0 0 0 0 01 0 0 0 0 1 0 0 10 01 ^^ ^ ^ ^ ^^ e.g., to find the number of occurences of 7 0 1 4 =2 2 nd, pTrees find # of occurences of 7 0 1 4? To count (7,0,1,4) s use 111000001100 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 7 0 1 4 0 *2 3 0 0 *2 2 =2 0 1 *2 1 *2 0 = pTrees predicate Tree technologies provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures.

4 Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 dis from a=000000 area for 3 nearest nbrs 0 0 0 distance=2, don’t replace 0 0 0 distance=4, don’t replace 0 0 0 distance=4, don’t replace 0 0 0 distance=3, don’t replace 0 0 0 distance=3, don’t replace 0 0 0 distance=2, don’t replace 0 0 0 distance=3, don’t replace 0 0 0 distance=2, don’t replace 0 0 0 distance=1, replace t53 0 0 0 0 1 0 0 1 0 0 0 distance=2, don’t replace 0 0 0 distance=2, don’t replace 0 0 0 distance=3, don’t replace 0 0 0 distance=2, don’t replace 0 1 C=1 wins! First 3NN using horizontal data to classify an unclassified sample, a =( 0 0 0 0 0 0 ). 0 0 0 distance=2, don’t replace PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification (CkNNC)

5 Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 Next C3NN using horizontal data: (a second pass is necessary to find all other voters that are at distance  2 from a) 0 0 0 d=2, include it also 0 0 0 d=4, don’t include 0 0 0 d=4, don’t include 0 0 0 d=3, don’t include 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=1, already voted 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=3, don’t replace 0 0 0 d=2, include it also 0 0 0 d=2, already voted 0 0 0 d=1, already voted 0 1 Vote after 1 st scan. C=0 wins now! t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 distance t53 0 0 0 0 1 0 0 1 Unclassified sample: 0 0 0 0 0 0 3NN set after 1 st scan 0 0 0 d=2, include it also 0 0 0 d=2, include it also

6 C' 0 1 C11111111110000000C11111111110000000 PINE: a Closed 3NN method using pTrees (vertically data structures). 1 st : pTree-based C3NN goes as follows: a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 C11111111110000000C11111111110000000 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 Ps00000000000000000Ps00000000000000000 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 No neighbors at distance=0 First let all training points at distance=0 vote, then distance=1, then distance=2,... until  3 votes are cast. For distance=0 (exact matches) constructing the P-tree, P s then AND with P C and P C’ to compute the vote.

7 C' 0 1 C11111111110000000C11111111110000000 a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 a 10 =C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 P D(s,1) 0 1 0 1 0 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 Construct Ptree, P S(s,1) = OR P i = P |s i -t i |=1; |s j -t j |=0, j  i = OR P S(s i,1)   S(s j,0) OR P5P5 P6P6 P 11 P 12 P 13 P 14 j  {5,6,11,12,13,14}-{i} 0 1 i= 5,6,11,12,13,14 pTree-based C3NN: find all distance=1 nbrs:

8 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a 10 C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 0 1 OR{all double-dim interval-Ptrees}; P D(s,2) = OR P i,j P i,j = P S(s i,1)  S(s j,1)  S(s k,0) k  {5,6,11,12,13,14}-{i,j} i,j  {5,6,11,12,13,14} a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 P 5,6 P 5,11 P 5,12 P 5,13 P 5,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 0 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 P 6,11 P 6,12 P 6,13 P 6,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 11,12 P 11,13 P 11,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 12,13 P 12,14 We now have 3 nearest nbrs. We could quite and declare C=1 winner? a 140 1 0 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 13,14 We now have the C3NN set and we can declare C=0 the winner! PINE =CkNN in which all training samples vote weighted by their nearness to a (~Olympic podiums) pTree-based C3NN, dist=2 nbrs:

9 FAUST using impure pTrees (ipTrees) All pTrees defined by Row Set Predicates (T/F on any row-sets). E.g.: On T(A,B,C), "units bit slice pTree of T.A, using predicate, > 60% 1-bits, true iff >60% of the A-values are odd. The IRIS dataset can be downloaded from the UCI Data Repository. To cluster the IRIS dataset of 150 iris flower samples, (50 setosa, 50 versicolor, 50 virginica iris's) using 2-level 60% ipTrees with each upper level bit representing the predicate truth applied to 10 consecutive iris samples), level-1 is shown below. FAUST clusters perfectly using only this level (an order of magnitude smaller bit vectors - so faster processing!). level-1 values: SL SW PL PW setosa 38 38 14 2 setosa 50 38 15 2 setosa 50 34 16 2 setosa 48 42 15 2 setosa 50 34 12 2 versicolor 1 24 45 15 versicolor 56 30 45 14 versicolor 57 28 32 14 versicolor 54 26 45 13 versicolor 57 30 42 12 virginica 73 29 58 17 virginica 64 26 51 22 virginica 72 28 49 16 virginica 77 30 48 22 virginica 67 26 50 19 Level-1 mn 54.2 30.8 35.8 11.6 setosa 47.2 37.2 14.4 2 versicolor 45 27.6 41.8 13.6 virginica 70.6 27.8 51.2 19.2 se 2 11.6 se 47.2 13.4 ve 45 2.2 vi 70.6 SL mn gapSW mn gap se 37.2 ve 27.6.2 vi 27.8 9.4 se 14.4 27.4 vi 51.2 ve 41.8 9.4 PL mn gap ve 13.6 5.6 vi 19.2 PW mn gap level_1 s10gt60_P SL,j s10gt60_P SW,j s10_gt60_P PL,j s10gt60_P PW,j 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 setosa versicolor virginica The 150 level_0 raw bits level_1 = s10gt60_P PW,1 1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011 level_2 =s150_s10_gt60_P PW,1 11111 11100 01011 1 level_0 (The level_2 bit strides 150 level_0 bits) (Each level_1 bit (15 of them) strides 10 raw bits) FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTrees for classification and clustering of spatial data.

10 c H = 2 + 11.6/2 = 7.8 PW mn gapSL mn gapSW mn gapPL mn gap se 2 11.6 se 47.2 13.4 ve 45 2.2 vi 70.6 se 37.2 ve 27.6.2 vi 27.8 9.4 se 14.4 27.4 vi 51.2 ve 41.8 9.4 ve 13.6 5.6 vi 19.2 CLASS PW setosa 2 versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 22 virginica 19 (perfect on setosa!) SL mn gapSW mn gapPL mn gapPW mn gap ve 45 25.6 vi 70.6 ve 27.6.2 vi 27.8 vi 51.2 ve 41.8 9.4 ve 13.6 5.6 vi 19.2 c H = 45 + 25.6/2 = 57.8 CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 versicolor 57 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 (perfect classification of the rest!) FAUST (simplest version) For each attribute (column), 1. calculate mean of each class; 2. sort those means asc; 3. calc mean_gaps=differences_of_means; 4. choose largest (relatively) mean_gap to cut. 4. choose best class and attribute for cutting gap L is gap on low side of a mean. gap H is high 2. Remove record with max gap RELATIVE. 1. 2. 3. done on previous slide FAUST using impure pTrees (ipTrees) page 2

11 24 samples from each class as training (every other one in the list of 50), first form 3-level gt50%ipTrees with level=1 stride=12. second form 3-level gt50%ipTrees, level=1 stride=24 (i.e., just a root above 3 leaf strides, 1 for each class). Conclusion: Uncompressed 50%ipTrees (with root truth values) root values are close to the mean? level_1 s24gt50_P SL,j s24gt50_P SW,j s24_gt50_P PL,j s24gt50_P PW,j se 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 51 38 15 0 se 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 50 34 14 2 ve 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 0 57 28 45 14 ve 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 63 30 40 8 vi 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 72 28 49 18 vi 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 69 30 48 22 se 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 51 34 15 2 ve 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 57 30 41 14 vi 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 73 30 49 22 level=1 stride=12, each of the 2 level=1 bits strides 12 of 24 level=1 stride=24, each of the level=1 bits strides 24 of 24 In the previous two FAUST slides, three-level 60% ipTrees were used (leaves are level=0, root is level=2) with each level=1 bit representing the predicate truth applied to 10 consecutive iris samples (leaf bits, i.e., the level=1 stride=10). Below, instead of taking the entire 150 IRIS samples, 24 are selected from each class as training samples; the 60% is replaced by 50% and level=1 stride=10 is replaced with level=1 stride=12 first, then level=1 stride=24. Note: The means (averages) are almost the same in all cases. FAUST using impure pTrees (ipTrees) page 3

12 R 11 0 1 0 1 ipTrees construction can be done during the [one-time] construction of the basic pTrees? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 8_4_2_1_gte50%_ipTree 11 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 node_naming: ( Level, offset (left-to-right) ) E.g., lower left corner node is (0,0). Array of nodes at level=L is [L, *] pTree naming: S n-1 _..._S 1 _S 0 _gteX%_ipTree for n-level ipTree with predicate gteX%. S=Stride=#leaf bits strided by the node. If it is a basic pTree, pTree subscripts specify attribute, bitslice. Note on bottom_up ipTree construction: One must record the 1-count of the stride of each inode (e.g., In binary trees, if one child is 1, the other is 0, it could be the 1-child is pure1 and the 0-child is just below 50% (so parent_node=1) or the 1-child is just above 50% and the 0-child has almost no 1-bits (so parent node=0). (example on next slide). Can be done during the one pass through each bit slice required for bottom-up construction of pure1 pTrees. binary_pure1 pTree 11 = 8_4_2_1_gte100%ipTree 11 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0

13 R 11 1 0 1 0 1 bottom-up ipTree construction (changed R 11 so this issue of recording 1-counts as you go is pertinent) 1.1-child is pure1 and 0-child is just below 50% (so parent_node=1) 2.1-child is just above 50% and the 0-child has almost no 1-bits (so that the parent node=0). (example on next slide). 1 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 8_4_2_1_gte50%_ipTree 11 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 or 1? The 1-count of the left branch =1 and 1-count of the right branch =0, so the stride=4 subtree 1-count=1 (< 50%). We know 1-countt of right branch=0 (pure0), but we wouldn't know 1-count of left branch unless it was recorded. Finally, note that recording the 1-counts as we build the tree upwards is a near-zero-extra- cost step. 0 or 1? Need to know left branch 1ct=1 and right branch 1ct=3. So this stride=8 subtree 1ct=4 (  50%).

14  Customer 1 2 3 4 Item 6 5 4 3 Gene 1 1 1 Doc 1 2 3 4 Gene 1 1 3 Exp 1 1 1 1 1 1 1 1 1234 Author 1234 G 56 term  7 567 People  1 1 1 1 1 1 3 2 1 Doc 2345 PI People  cust item card authordoc card termdoc card docdoc expgene card gene gene card (ppi) expPI card 5 6 16 ItemSet Supp(A) = CusFreq(ItemSet) gene gene card (ppi) ItemSet antecedent 12345616 itemset itemset card Conf(A  B) =Supp(A  B)/Supp(A) movie 0000 02 00 3000 100 50 0 0 0 5 1 2 3 4 400 000 5 0 0 1 0 3 0 0 customer rates movie card 0000 00 00 0000 000 100 0 0 0 1 000 000 1 0 0 0 0 0 customer rates movie as 5 card 4 3 2 1 Course Enroll ments 15 people 234 1 2 3 4 items 3 2 1 terms DataCube Model for 3 entities, items, people and terms. 7 6 5 4 3 2 t 1 termterm card (share stem?) Items: i 1 i 2 i 3 i 4 i 5 |0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10| People: p 1 p 2 p 3 p 4 |0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M| Terms: t 1 t 2 t 3 t 4 t 5 t 6 |1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00| Relationship: p 1 i 1 t 1 |0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2 Relational Model: 2345 PI RoloDex Model: 2 Entities many relationships MYRRH pTree-based ManY-Relationship-Rule Harvester uses pTrees for ARM of multiple relationships.

15 MYRRH_2e_2r ( note: standard pARM is MYRRH_2e_1r ) e.g., Rate5(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C) 0001 0010 0001 0100 R5(C,B) (R(E,F)) 1001 01111000 1100 1 2 3 4 P(B,C) (S(E,F)) If cust, c, rates book, b as 5, then c purchase b. For b  B, {c| rate5(b,c)=y}  {c| purchase(c,b)=y} ct(R5pTree i & PpTree i ) / ct(R5pTree i )  mncnf ct(R5pTree i ) / sz(R5pTree i )  mnsp Speed of AND: R5pTreeSet & PpTreeSet? (Compute each ct(R5pTree b &PpTree b ).) Slice counts,  b  B, ct(R5pTree b & PpTree b ) w AND? B (F) Schema: size(C)=size(R5pTree b )=size(BpTree b )=4 size(B)= size(R5pTree c )=size(BpTree c )=4 pre-computed  BpTtree c 1-counts 3212  R5pTtree c 1-cts 0112 2 3 1 2 BpTtree b 1-cts  1 1 1 1 pre-comR5pTtree b 1-cts  C (E) 2345 1001 0111 1000 1100 P(B,C) 0001 0010 0001 0100 R5(C,B) 1 1 0 1 R5pTtree b &PpTree b 1-counts Given e  E, If R(e,f), then S(e, f) If  e  A R(e,f), then  e  B S(e, f) If  e  A R(e,f), then  e  B S(e, f) If  e  A R(e,f), then  e  B S(e, f) If  e  A R(e,f), then  e  B S(e, f) ct(R e & S e )/ct(R e )  mncnf, ct(R e )/sz(R e )  mnsp ct( & e  A R e & e  B S e ) / ct(& e  A R e )  mncnf.... ct( & e  A R e OR e  B S e ) / ct(& e  A R e )  mncnf.... ct( OR e  A R e & e  B S e ) / ct(OR e  A R e )  mncnf.... ct( OR e  A R e OR e  B S e ) / ct(OR e  A R e )  mncnf.... Consder 2 Customer classes, Class1={C=2|3} and Class2={C=4|5}. Then P(B,C) is TrainingSet: C\B 1 2 3 4 2 1 0 1 1 3 0 1 0 1 4 0 1 0 0 5 1 1 0 0 Then the DiffSup table is: B=1 B=2 B=3 B=4 0 1 1 2 Book=4 is very discriminative of Class 1 and Class 2, e.g., Class 1 =salary>$100K P 1 ={B=1|2}P 2 ={B=3|4} C1 0 1 C2 1 0 DS 1 1 P 1 [and P 2, B=2 and B=3] is somewhat discriminative of the classes, whereas B=1 is not.. Are "Discriminative Patterns" covered by ARM? E.g., does the same information come out of strong rule mining? Does "DP" yield information across multiple relationships? E.g., determining the classes via the other relationship?

16 MYRRH_2e_3r Rate1(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C) Sell(Cust,Book) or S(B,C) Cust,c. Rates book,b as 1, and c Purchases b, likely c Sells b at term end For b  B, {c| R1(c,b)=y & P(c,b)=y}  {c| S(c,b)=y} ct(R1pTree b & PpTree b & SpTree b ) / ct(R1pTree b & PpTree b )  minconf 0001 00 00 0100 R1(C,B) 1001 1100 1 2 3 4 P(B,C) B C 2345 1001 1100 S(B,C) 3e_3r Students who buy b and courses using b, student enrolls in the course? {(s,c)| Buy(s,b)=y & Text(b,c)=y)  {(s,c)|Enroll(s,c)=y}. cnt(EpTreeSubSet(BpTree b ×TpTree b ))/(cnt(BpTree b )*(cnt(TpTree b )>mncf 13 12 1 book 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 student 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 BuyBuy 0001 00 00 0100 R5(S, C) 1001 1 2 3 PHC(B,S) B S 1234 1 2 3 4 C 5 1 0 1 Rate5(Student,Course), PurchHardCov(Book,Stu) If a student, s, rates any course as 5, then s Purchases a HardCover book. 3e_2r 13 12 1 book 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 student 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 Buy 13 12 1 offering 1 1 1 1 1 1 1 1 0 1 1 0 Location If s enrolls in c, And c is Offered at L And L uses Text=b, Then s Buys b 4e_4r Any 2 adjacent relationships can be collapsed into 1: R(c,b) and P(b,e) iff RP(c,e). By doing so, we have a whole new relationship to analyze 0001 0010 0001 0100 R(C,B) 1001 0111 1000 1100 1 2 3 4 P(B,C) B C 2345 Given c, {b|R(c,b)} is List(P R.c ) For b in List(P R,c ), {e  C|P(b,e)} is List(P P,b ) Therefore {e|RP(c,e)}=OR b  ListP R,c P P,b 0101 0110 0010 0011 RP(C,C) 2 3 4 5 C C 2345

17 P=PURCHASE(S,B) 0001 0010 0001 0100 1001 0111 1000 1100 E=ENROLL(S,C) 1 2 3 4 B=BOOK S=STUDENT 2345 1 2 3 4 C=COURSE 0001 0110 1010 0101 T=TEXT(C,B) Let T c = C-pTree of T for C=c with list={b|T(c,b)} PT=PURCHASE_TEXT(S,C) 0001 0011 0110 0001 1001 0111 1000 1100 E=ENROLL(S,C) S=STUDENT 2345 1 2 3 4 C=COURSE PT c = OR b  ListT c P b also PT s = OR b  ListP s T b P=PURCHASE(S,B) 0001 0010 0001 0100 1101 1111 1111 1000 ET=ENROLL_TEXT(S,B) 1 2 3 4 B=BOOK S=STUDENT 2345 ET s = OR c  ListE s T c also ET b =OR c  ListT b E c PE=PURCHASE_ENROLL(C,B) 0011 0010 0011 1010 1 2 3 4 B=BOOK 1 2 3 4 C=COURSE 0001 0110 1010 0101 T=TEXT(C,B) PE c = OR s  ListE c P s also PE b = OR s  ListP b E s

18 With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) and the predicate (e.g., the table column id and bit slice number or bitmap involved). pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. PGP-D is a mechanism in which we "scrambled" pTree information (predicate info, but also possibly, ordering info) in a way that data can be processed without unscrambling. For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone, but a person qualified could issue a data-mining request (classification/ARM/clustering). It is different from encrypting. The Predicate Key (PK) reveals the pTree predicates (For basic pTrees, e.g., the "predicate" specifies which column and which bit position). Make all pTrees (over the entire [distributed] DB) the same length. Pad in the front [and the back?] so that statistics can not reveal the pTree start position. Scramble the locations of the pTrees. For basic pTrees, PK would reveal offset and pre-pad The example PK reveals that the 1st pTree is found at offset=5 (has been shuffled forward 5 pTree slots - of the slots reserved for that table) and that the first 54 bits are pad bits. If the DB had 5000 files with 50 columns each (on avg) and each column had 32 bits (on avg), we have 8 million pTrees. We could pad with statistically indistinguishable additions to make it impossible to try enough alternatives in human time to break the key. An additional thought: In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always fully replicate (making all retrievals local). Thus we are guaranteed that all pTrees are statistically "real looking" (because the ARE real). We might not need to pad with bogus pTrees. A hacker could extract only the first bit of every pTree (e.g., the 8M bits that IS the first horizontal record), then shuffle those bits until something meaningful appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code (e.g., look at 2nd, 3rd, etc.). To get around that possibility, we could store the entire database as a massive "Big Bit String" and have as part of our Predicate Key (PK) the start offset of each pTree (which would be shuffled randomly). We would include a column of the [randomly determined] amount of padding (now variable) so that the position of first start bits is unknowable. Alternatively, we could use a common length but have random "non-pTree" gaps between them. Alternatively, the "Key" could simply specify the start address of the pTree (and length?) PGP-D Pretty Good Protection of Data protects vertical pTree data. 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 |... key=array(offset,pad) Could also construct large collection of bogus key-lookup-tables (identify correct one to authorized subgroup only). Additional layer. encrypt? For multiple users at different levels of security with rights to parts of DB and not others) we would have a separate key for each user level. Using the key would be simple and quick, and once the key is applied, then accessing and processing the data would be at zero additional time cost (the current thinking is that we would not encrypt or otherwise alter the pTrees themselves - just their identity). One would only need to work on the "key mechanism" to improve the method in speed, protection level. (individual pTrees are intact/unaltered) Some data collections need not be protected in their entirety (tends to be by column and not by row - pTrees are good for column protection. (I.e., it is usually the case that certain attributes are sensitive and others are routine public information). When there are differences in protection level by row (subsets of instances of the entity require different protection levels) then we would simply create each subset as a separate "file" (all of the same massive length through padding) and protect each at the proper level. 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 |...

19 ROLL has 3 basic methods: POST (allows a transaction to request its data item needs) POST is an atomic "enqueue" operation (only atomicity required -the only critical section). This can be batched so that low priority transaction POSTs can be delayed in favor of higher. CHECK (determines requested data item availability). CHECK returns the logical OR of all RVs behind it - the result is called the "Access Vector" or AV. A background ever-running process can be creating and attaching AVs to each RV. Then a transaction CHECK need only proceed until it encounters (ORs in) an AV which specifies new item availability). re-CHECKing can be done any time. RELEASE: sets some or all of transaction's 1-bits to 0-bits. ROLL CC: Data items requested for [read and/or write] access by a trans using a REQUEST VECTOR (RV), bit vector. Each data item mapped to a bit position or it can be assumed that the ordering is the table ordering. A 1-bit at a position indicates that item is requested by the transaction and a 0-bit means it is not. If read and write modes are distinguished, ROLL uses a read and a write-bit for each item. RV k ... 0 1 0 1 : 0 RV j  1 0 1 0 : 0 RV i  head 0 1 0 1 0 : 0 tail   (Where the critical section POST of the next RV_T i+1 is done by copying tail_ptr to RV_T i+1_ ptr and then resetting tail_ptr to RV_T i+1 )  (Where bkgrd CreateAV s process begins repeatedly ORing RV_T s going left to right. 110011:0110011:0 AV k CHECK_RV_T j begins here - ORs next RVs into a copy of RV_T j+1, moving right (for max recency - else just check its own AV), building an AV_T j, until it determines sufficient availability. Then it suspend CHECK and begins processing the newly available data items (but may go all the way to the head before suspending). It could also maintain the list of RVs blocking its access so that its next CHECK can OR only those RVs to get a AV_T j (or check only those AVs). 010010:0010010:0 AV i  110010:0110010:0 AV j  Every T j RELEASES (set to 0) bits as the corresponding data item is no longer needed (in RV_T j ) Designate a separate ROLL for each partition OR use multi-level pTrees where the upper level is the file level.) ROLL RVs and AVs are same structured pTrees (upper level is the file level, then use whatever record level pTree structure is used for the basic pTrees representing the data in the file itself (e.g., for an image file, the ordering of tuples (pixels) might be Peano or Z ordering and therefore, the RV and AV for (except for the top file level) would also indicate pixel access needs with the same pTree structure (1 means "need that pixel"). So the ROLL elements (RVs and AVs are just coded record-level bit slices (or trees in the multi-level pTree case). AVs for each POSTed RV would be created by a background process in reverse POST order (time-stamped?) As soon as a CHECK process encounters an AV which provides additional accesses not previously available to that transaction, it can stop the CHECK and use those items; or it can continue to gain a larger set of available items (by ignoring the AV and ORing only the RVs it encounters. This would make sense if the TS is old and/or an entire set of accesses is required to make progress at all - e.g., an entire file) A record is "available" iff the entire record is available AND every field. A field is available if its record and that field is available. First Come First Serve except: Low priority trans delayed for incoming high priority trans. A read-only data mine ignores concurrency altogether. ConCur Concurrency Control ROCC and ROLL concurrency control using pTrees

20 Domain Vectors (DVs) are bitmaps representing the presence of a domain's value. The mapping which assigns domain vector positions to domain values is the Domain Vector Table (DVT). DOMAIN VECTORS: Given domain, D e.g., D={3 letter strings} for name field) for a field DVT: nam | surrogate ====|========== aaa|0 aab|1... aaz|25... zzz|17575 Then an attribute, R.A, in a relation, R, has Domain Vector: DV(R.A) = (0010100100110...0) with a 1-bit in the nth position iff the Domain Value with surrogate, n, occurs in R.A. DV(CUSTOMER.nam) = (0...1000000000010...010...010...0) ^ ^ ^ ^ 1886-' | | `13395 1897 3289 SUE "JAY" "JON" (e.g., JAN is 1886th domain value or has surrogate 1886) The DV Accelerator method is as follows. Keep DV for some fields (particularly primary keys and frequently joined attributes). Note, to reduce the size of these vectors, surrogate the "extant domain" (currently appearing domain values), assign to new ones. the next surrogate. Update DV after Insert of new record. i. Form Modify-Vector (MV) e.g., if ABE joins the buying club, form MV with 1 in 31st position, 0 elsewhere). ii. OR MV into DV DOVE DOmain VEctor query processing DB query processing using pTrees Delete tuple (assume field value was not duplicated) i. Form MV for deleted value (e.g., ABE drops membership). ii. XOR MV into the DV To Join: i. Materialize primary DV. ii. Logically AND other DV into it, producing a JOIN VECTOR (We note that a JV is a key-value sorted list of matches). iii. Apply JV to each file-index producing surrogate lists. -1- Nested loop is efficient since all records match. But, inefficient rereading of pages may occur. -2- iv. is a guess for sparse joins. iv. Sort surrogate-lists, read files, sort file, merge-join. (this should minimize page- reads and page-faults). Projection: Depth-first retrieval on index (already optimal). Selection: i. Form Select Vector (SV) (1 for all values to be selected) If filter is logical combination of key-ranges, form key-range vectors, use corresponding logical ops (OR AND NOT)) e.g., SELECT ALL CUSTOMERS STARTING WITH J: SV=(0..01..10..0) | | 6760 7436 ii. Logically AND DV into SV. iii. Apply SV to file-index producing surrogate list. iv. Sort surrogate-list, read file. http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex1.html http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/qpo.html http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex0.html


Download ppt "13 12 1 document 2345 course 1 1 1 1 1 1 1 1 0 1 1 0 Text 4 3 2 1 person 11 10 EnrollEnroll 1 1 0 1 1 1 1 1 0 1 1 1 Buy MYRRH ManY-Relationship-Rule Harvester."

Similar presentations


Ads by Google