PDR PTreeSet Distribution Revealer

PDR PTreeSet Distribution Revealer
15 PDR PTreeSet Distribution Revealer depth=0 will produce a Distribution Tree (DT)  SpTS, S, in a PTreeSet node2,3 depth=1 X x1 x2 p p p p p p p p p pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= depthDT(S)b≡BitWidth(S). h=depth, k=node offset, Nodeh,k has a ptr to pTree{xS | F(x)[k2b-h, (k+1)2b-h)} & its 1count p6' 1 5/64 [0,64) p6 10/64 [64,128) p3' 1 0[0,8) p3 1[8,16) 1[16,24) 1[24,32) 1[32,40) 0[40,48) 1[48,56) 0[56,64) 2[80,88) 0[88,96) 0[96,104) 2[194,112) 3[112,120) 3[120,128) p4' 1 1/16[0,16) p4 2/16[16,32) 1[32,48) 1[48,64) 0[64,80) 2[80,96) 2[96,112) 6[112,128) p5' 1 3/32[0,32) 2/32[64,96) p5 2/32[32,64) ¼[96,128) Pre-compute and enter into the ToC, all DT(Yk) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector . Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT as defined above). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in fact are often all we need to know about the pTree to get the answers we are after.)

Assume a SpTS (a vertical column of numbers in bitslice format) produced by a linear functional, F(y)=yod for some unit vector d. (d=ek). Cut at gaps but more generally, at , e.g., all 25% count changes. Why? Big Data may produce no gaps (remote clusters fill in local gaps). However, remote clusters will almost never fill in the local cluster gaps seamlessly (with no abrupt change in count distribution). "The linear projection of any cluster boundary will produce a noticeable count change" and "Every noticeable count reveals a cluster bddry". Note, a gap is 2 consecutive count changes (to 0 and back). 25% is a parameter to be studied (varied by dataset ??) What we see from this preliminary look at the SEEDS dataset (from UCI MLR) is that there are few gaps, but many count changes which reveal sub-cluser info. It looks like 25% is a better parameter value than 45% for SEEDS? Value distribution of Column 1 of SEEDS 11 18 12 25 13 18 14 18 15 15 16 13 17 8 18 8 19 21 20 2 21 4 CLASS->lo me hi [11,11] [12,12] [13,16] [16,18] [19,19] [20,20] [21,21] Note: NO GAPS! but >25% count changes. 6 cut pts  7 clusters SEEDS_2 5 68 6 75 7 7 [5,6] [7,7] SEEDS_3 1 11 2 26 3 30 4 42 5 24 6 9 7 5 8 3 [1,1] [2,3] [4,4] [5,5] [6,6] [7,7] [8,8] SEEDS_4 5 99 6 51 [5,5] [6,6] SEEDS_3 1 11 2 26 3 30 4 42 5 24 6 9 7 5 8 3 [1,1] [2,5] [6,8] SEEDS_4 5 99 6 51 [5,5] [6,6] Here. cut if a count and its successor count differ by > 45% of high count) SEEDS_1 11 18 12 25 13 18 14 18 15 15 16 13 17 8 18 8 19 21 20 2 21 4 [11,18] [19,19] [20,21] SEEDS_2 5 68 6 75 7 7 [5,6] [7,7]

Cut at all 25% Count Changes
On C311 w SEED4 [5,6) WINE1*8 36 1 4 40 4 1 41 1 1 42 2 2 44 4 1 45 1 3 48 1 1 49 1 1 50 1 1 51 1 1 52 5 1 53 3 1 54 5 1 55 3 1 56 5 1 57 8 1 58 5 1 59 2 1 60 8 1 61 3 1 62 3 1 63 5 1 64 4 1 65 4 1 66 5 1 67 2 1 68 3 1 69 1 1 70 1 1 71 8 1 72 5 1 73 2 1 74 1 1 75 2 1 76 3 2 78 2 1 79 4 1 80 4 1 81 2 1 82 1 1 83 2 1 84 5 1 85 2 3 88 4 2 90 1 2 92 4 4 96 1 4 120 2 [36,40) 1 0 [40,41) 1 3 [41,44) 0 3 [44,45) 1 3 [45,52) 5 0 [52,57)15 6C1 [57,58) 1 7 [58,71) 1 3 [71,71) 0 2 [71,72) 0 8 [72,84) 6 22C2 [84,90) 1 10 [90,121)3 8C3 C1 WINE2*3 108 1 [ 0,12) 1 0 [12,48)10 4C11 [48,57) 1 0 [57,69) 1 0 [69,96) 1 0 [96,108) 0 1 [108,109)0 1 C311 consists of i e These are 3.32 appart (small gap) and there won't be any count change (counts will be 1 and 1) I2 Ct gap 28 1 I1 Ct gap 63 1 ACCR 100% IRIS1 F Ct gap 79 1 [43,50) C1 [50,53) C1 [53,54) [54,59) C2 [59,60) C3 [60,63) \ [63,66) \ [66,67) \ [67,68) \ [68,70) \ [70,77) C3 [77,80) On C31 w SEED3 8 2 [1,2) [2,5) C311 [5,8) [8,9) SEED1 11 18 12 25 13 18 14 18 15 15 16 13 17 8 18 8 19 21 20 2 21 4 [11,12) C1 [12,13) C2 [13,17) C3 [17,19) C4 [19,20) C5 [20,21) C6 [21,22) C7 On C3 using SEED2 [5,6) C31 [6,7) C32 C1 w IRIS4*6 \ \ \ \ \ \ \ On C321 w SEED4 [5,6) [6,7) On C3 w SEED3 1 7 1 2 8 1 3 11 1 4 9 1 5 2 1 6 1 1 7 1 [1,3)15 0 0 [3,5) C321 [5,8) 2 1 1 Accuracy 140/150=93.3% to 96.67% (~= GM) C2 w IRIS4 24 1 [2,3) [3,10) [10,13)0 9 0 [13,19)0 10 0 [19,20)0 0 2 [20,25)0 0 3 CONC1/4 0 4 6 6 3 1 7 2 5 100 1 [ 0, 6) [ 6,12) [12,18) [18,20) [20,41) [41,55) [55,58) [58,71) [71,72) [72,82) [82,97) [97,101)0 3 1 hopeless! CONC2 106 23 [ 0, 5) [ 5,16) [16,32) C1 [32,34) C2 [34,43) C2 [43,65) C2 [65,107) C3 C2 w CONC4/3 33 4 [ 1, 2) [ 2, 4) [ 4, 9) [ 9,18) [18,30) [30,33) [33,34) C3 w CONC4/3 121 4 [ 1, 2) [ 2, 4) [ 4, 9) [ 9,18) [18,30) [30,60) [60,90) [90,120)0 4 0 [120,..)0 6 0 C2 w WINE2 42 1 [3,5) 1 1 [5,7) 2 7 [7,12) 0 3 [12,21)1 9 [21,31)0 2 [31,42)2 2 [42,43)0 1 none of the 6 L were isolated. C3 IRIS4 25 3 C11 w CONC4 91 5 [ 3, 7)1 3 0 [ 7,28)0 1 0 [28,56)0 0 5 [56,91)0 0 5 [91,92)0 0 5 accuracy 132/150 = 88% (~5% > GM) C11 WINE3*3 93 1 [21,24)1 0 [24,33)1 2 [33,42)3 0 [42,60)1 1 [60,72)2 0 [72,87)0 1 [87,93)1 0 [93,94)1 0 [10,13)0 2 0 [13,16) C31 [16,17) C31 [17,18)0 1 0 [18,19) C32 [19,21)0 0 5 [21,22)0 0 6 [22,23)0 0 2 [23,24)0 0 6 [24,26)0 0 5 C31 w IRIS3*6 108 1 [12,48) [48,60) [60,66) [66,96) C311 [96,108)0 0 1 [108,.) C3 w WINE2 36 1 [6,10) 0 3 [10,12) 1 2 [12,26) 0 2 [26,35) 1 0 [35,37) 1 1 C1 w CONC3/4 75 2 [40,52)0 1 C11 [52,61)1 1 6 | [61,64)0 1 3 | [64,75)0 1 3 | [75,76)0 2 0 ACC 135/150=90% (~10% > GM)

So there is potential in analyzing the more general concept of Functional Distribution Changes, FDCs (gaps are pairs of consecutive FDCs). Let's now walk through storing SEEDS, CONCRETE, IRIS and WINE in a pTreeDataBase and using that pDB efficiently in this clustering algorithm (possibly requiring only the pre-computed DT info from the Catalog and never requiring access to the pTrees themselves???) The basic pDB storage object will be a PTreeSet or PTS, which is an sequence of ScalarPTreeSets or SPTSs, each of which is an sequence of pTrees, each of which is a BitArray (compressed or uncompressed, multilevel or singlelevel) A SPTS is a PTS with SequenceLength=1. A pTree is a SPTS with SequenceLength=1 So, an alternative definition: A PTS is a pTreeSequence and a BitWidthSequence, PTS(pS, BWS) where Length(pS)=bBWSb A SPTS has associated with it in the pDB Catalogue, its DT, its Avg, its Median, its Minimum, its Maximum, A pTree has associated with it in the pDB Catalague, its 1-count We note, if all DTs are full (built down to singleton intervals) the DT contains every pre=computation, but such DTs are massive structures!

15 SPAETH Y y1 Y2 y y y y y y y y y ya 13 4 yb 10 9 yc 11 10 yd 9 11 ye 11 11 yf 7 8 yofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= p6' 1 5/64 [0,64) p6 10/64 [64,128) p5' 1 3/32[0,32) 2/32[64,96) p5 2/32[32,64) ¼[96,128) p3' 1 0[0,8) p3 1[8,16) 1[16,24) 1[24,32) 1[32,40) 0[40,48) 1[48,56) 0[56,64) 2[80,88) 0[88,96) 0[96,104) 2[194,112) 3[112,120) 3[120,128) p4' 1 1/16[0,16) p4 2/16[16,32) 1[32,48) 1[48,64) 0[64,80) 2[80,96) 2[96,112) 6[112,128)

Assume a MainMemory pDB and that each bit position has an address.
1 y1 y y7 2 y y y8 3 y y y9 ya 5 6 7 yf yb a yc b yd ye c d e f a b c d e f SPAETH a b c Y A1 p13 p12 p11 p10 A2 p23 p22 p21 p20 yod p3 p2 p1 p0 y y y y y y y y y ya yb yc yd ye yf 1-count d=e d=e dnnxx .8 .5 min min main diagonal 1 max max D d e f g h i j k l m n o yod p3 p2 p1 p0 yod p3 p2 p1 p0 yod p3 p2 p1 p0 dnxnx dfA f=y1 dMA .9 .3 mainxniagonal 2 A A=Avg M 9 3 M=Med D D =fA D 9 3 =AM KEY: 14,d,4,6,7,b,b,o,2,g,1,3,6,8,9,i,2,m,a,1,1,4,2,a,1,7,3,n,g,k,2,2,0,c,3,l,2,9,1,5,1,f,2,h,s,e,2,j Assume a MainMemory pDB and that each bit position has an address. ToC for Spaeth MMpDB d=DIAGnnxx d=DIAGnxxn d=furth_Avg d=Avg_Med__ p13 p12 p11 p10 p23 p22 p21 p20 p3 p2 p1 p0 p3 p2 p1 p0 p3 p2 p1 p0 p3 p2 p1 p0 pTrees_Array a b c d e f g h i j k l m n o Count_Array LOCATION_POINTER_ARRAY 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 The data portion of SpaethMMpDB is 495 bits with 24 15_bit pTrees ( 360 data bits, 135 red pad bits ). Key is 24 pTrees + 24 pad lengths each 5_bits (or just randomly generate array and send seed?) If there are 15 trillion (not just 15) rows, green ToC same size, key same size, data array trillion (480Tb = 60TB) or smaller since pads can stay small ( ~30TB?). Next, put DTs in ToC?

yodfA 1.34 3.13 2.68 4.02 5.36 9.39 13.8 13.4 14.7 12.9 14.2 9.82 15 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 p6' 1 5/64 [0,64) p6 10/64 [64,128) p5' 1 3/32[0,32) 2/32[64,96) p5 2/32[32,64) ¼[96,128) p3' 1 0[0,8) p3 1[8,16) 1[16,24) 1[24,32) 1[32,40) 0[40,48) 1[48,56) 0[56,64) 2[80,88) 0[88,96) 0[96,104) 2[194,112) 3[112,120) 3[120,128) p4' 1 1/16[0,16) p4 2/16[16,32) 1[32,48) 1[48,64) 0[64,80) 2[80,96) 2[96,112) 6[112,128)

PDR PTreeSet Distribution Revealer

Similar presentations

Presentation on theme: "PDR PTreeSet Distribution Revealer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PDR PTreeSet Distribution Revealer

Similar presentations

Presentation on theme: "PDR PTreeSet Distribution Revealer"— Presentation transcript:

Similar presentations

About project

Feedback