Presentation is loading. Please wait.

Presentation is loading. Please wait.

This suggests a clustering method:

Similar presentations


Presentation on theme: "This suggests a clustering method:"— Presentation transcript:

1 This suggests a clustering method:
x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 x=i1 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 x=i1 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e This suggests a clustering method: 1. find cosine cone gaps emanating from a corner point (or any circumscribing point). 2. "cap" the cone gap on the open end with a linear gap (actually for IRIS, the cap seems unnecessary and the cone gaps themselves seem to separate the three classes). F=(y-M)o(x-M)/|x-M|-min restricted to a cosine cone on IRIS Corner points Gap in dot product projections onto the cornerpoints line. x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=i1 w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 x=i1 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's Cosine cone gap (over some  angle) x=i1 w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's x=i1 w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. But we can't divide PTreeSets yet !?!?! )

2 Squared y on f Projection Distance = yoy - (yof)2 fof
The downside of [capped] cone clustering is that we need to divide by PTreeSet |y| . So far we can't do that (without a loop)? Instead of a "capped cone" a better shape might be a "[double] capped tube". For fixed point, f, and variable point , y, we need, in addition to the dot product projection length, the dot product projection distance as well, as shown in red. f y y - f |f| yo = y - (yof) fof f squared is y - (yof) fof f o y - dot product projection distance squared = yoy - 2 (yof)2 fof fof (fof)2 yo dot product projection length f |f| squared = yoy - 2 (yof)2 fof + Now if we replace the origin by a corner point (or some other circumscribing hyper-rectangle point, p, e.g., replace y with y-p and replace f with M-p Squared y on f Projection Distance = yoy - (yof)2 fof Squared y-p on M-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(M-p) )2 (M-p)o(M-p) Furthest Point or Mean Point f (or M) 1st: compute this constant [vector] = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| 2 Gaps in dot product lengths [projections] on the line. 3rd: comp these PTreeSets (2 dots, 1 minus, 1 plus) Do not compute y-p. (shifts entire vector sp)? y cap gap width M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) 2nd: compute this PTreeSet (1 dot, 1 minus) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.) Origin (or p) tubular gap width

3 = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| M-p |M-p| yo po M-p |M-p|
There are three functionals in the "dot product" group for "functional gap clustering" of a VectorSpace subset, Y (yY): 1. SD(y) = (y-p)o(y-p), p a fixed vector, the "Square Distance from a point", primarily for outlier identification and densities. 2. PL(y) = yod, d a unit vector, the "Projected Length" functional. yod projection length d y yoy-(yod)2 square projection distance (y-p)o(y-p) - ( (y-p)o(M-p) )2 (M-p)o(M-p) = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| 2 3. SPD(y) = yoy - (yod)2, d a unit vector, the "Square Projection Distance" functional. E.g., if d≡(M-p)/|M-p|, d = unit vector from vector p to vector M, then SPD(y)= But to avoid creating an entirely new VectorPTreeSet(Y-p) for the space (with origin shifted to p), we think it useful to alter the expression for SPDfM to : SPDpM(y) M-p |M-p| yo where we might: 1st compute the constant vector nd the ScalarPTreeSet po M-p |M-p| yo - 3rd the constant th the SPTreeSet pop yo M-p |M-p| po - 5th the SPTreeSet th the constant yoy, yop = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| 2 7th the SPTreeSets 8th the SPTreeSet Is it better to leave all the additions and subtractions for one mega-step at the end? (Md?) Other efficiency thoughts? M-p |M-p| (y-p)o = ( yo(M-p) - po M-p ) We note that PL shares many construction steps with SPD. Algorithm?: 1. remove "edge outliers (use SD to check each point touching the circumscribing rectangle). 2. look for thin gaps in SPD using a unit vector, d, from an edge point (e.g., CirRectCorner) to the Mean. 3 For each thin gap in 2 use PL to find projected length gaps of points within the "thin-gap bounded tube".

4 Algorithm?: 1. (y-p)o(y-p) remove edge outliers (use SD to check each point touching the circumscribing rectangle) using threshold >2*50 2. look for thin gaps in SPD using a unit vector, d, from an edge point (e.g., CirRectCorner) to the Mean. 3 For each thin gap in 2 use PL to find projected length gaps of points within the "thin-gap bounded tube". x=s (58=avg(y1) ) V Ct 0 3 s15, s17, s34 1 12 s 6,11,16,19,20,22,28,32,3337,47,49 2 12 s 1,10 13,18,21,27,29,40,41,44,45,50 3 7 s 2,12,23,24,35,36,38 4 10 s 2,3,7,13,25,26,30,31,46,48 5 2 s4, s43 6 2 s9,s39 7 1 s14 8 1 i39 9 1 s32 ^^all 50 setosa + i39 e49 16 2 17 2 19 1 20 2 21 5 22 4 23 3 24 4 25 1 27 8 28 2 29 2 30 4 31 1 32 4 34 2 35 2 36 2 37 3 38 2 39 2 40 4 41 1 43 2 44 4 45 2 46 1 47 2 48 1 50 4 52 2 53 2 54 2 56 2 57 1 i1 i31 vv 9 virginica i10 i8 i36 i32 i16 i18 i23 i19

5 FxM(x,y)=yo(x-M)/|x-M|-min on XX≡{(x,y)|x,yX}, where X(x,y) is a Spaeth image table Cluster by splitting at all F_gaps > 2 The 15 Value_Arrays (one for each x) z z z z z z z z z z z z z z z X x y \y= a b 1 1 x=1 1 f M d a b b c e c d a e 8 f 7 9 x y FxM z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5  M (=MeanVector) The 15 Count_Arrays z z z z z z z z z z z z z z z Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 gap: 10-6 gap: 5-2 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1 The FAUST algorithm: 1. project onto each Mx line using the dot product with the unit vector from M to x. (only x=z1 is shown) 2. Generate each Value Array, F[x0]|(y), xX (also generate the Count_Arrays and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.

6 Cluster by splitting at gaps > 2
yo(z7-M)/|z7-M| ValueArrays z z z z z z z z z z z z z z z yo(z7-M)/|z7-M| CountArrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 6-9 z71 1 z72 1 In Step_3 of the algorithm we can: Analyze one of the gap arrays (e.g., As done for z1. Subclusters is shown above) then start over on each subcluster. Or we can analyze all gap arrays concurrently (in parallel using the same F - saving the [substantial?] re-compute costs?) and then intersect the subcluster partitions we get from each x_ValueArray gap analysis, forthe final subclustering. Here we use the second alternative, judiciously choosing only the x's that are likely to be productive (choosing z7 next). Many are likely to produce redundant partitions - e.g., z1, z2, z3, z4, z6 - as their projection lines will be nearly coincident. How should we choose the sequence of "productive" strides? One way would be to always choose the remaining stride with the shortest ValueArray, so that the chances of decent sized gaps is maximized. Other ways of choosing?

7 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 3-7 z71 1 z72 1 zd1 1 zd2 1 We choose zd=z13 next (Should have been first? Since it's ValueArray is shortest?) Note, z8, z9, za projection lines will be nearly coincident with that of z7.

8 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 z71 1 z72 1 zd1 1 zd2 1 AND each red with each blue with each green, to get the subcluster masks (12 ANDs producing 5 sub-clusters.

9 F1(x,y) = L1Distance(x,y) = (|x1-y1|+|x2-y2|) on XX≡{(x,y)|x,yX},
Cluster by splitting at all F1_gaps L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f (redundant subclustering) gap: 10-5 There is a z1-gap, but it produces a subclustering that was already discovered by a previous round. Which z values will give new subclusterings?

10 Re-confirms zf an anomaly.
L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z This re-confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f Re-confirms zf an anomaly. After having subclustered with linear gap analysis, which is best for determining larger subclusters, we run this round gap algorithm out only 2 steps to determine if there are any singleFvalue gaps>2 (the points in the singleFvalueGapped set are then declared anomalies). So we run it out two steps only, then find those points for which the one initial gap determined by those first two values is sufficient to declare outlierness. Doing that here, we reconfirm the outlierness of z6 and zf, while finding new outliers, z5 and za.

11 Using F=yo(x-M)/|x-M|-MIN on IRIS, one stride at a time (s1=setosa1 first)
For virginica1 Val Ct 0 1 1 1 2 2 3 5 4 6 5 11 6 12 7 4 8 2 9 5 10 1 17 1 22 1 24 2 25 1 27 1 28 1 29 2 30 1 31 3 32 4 33 1 34 4 35 2 36 2 37 4 38 4 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 F(i39)=17 F<17 (50 Setosa) vers1 Val Ct 0 1 2 4 3 1 4 1 5 3 6 3 7 8 8 3 9 7 10 6 11 4 12 4 13 3 15 2 19 2 20 2 21 1 26 2 27 3 28 4 30 2 31 5 32 4 33 3 34 1 36 3 37 5 38 4 39 5 40 7 41 4 42 2 43 2 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 F<19 (50 setosa) 19<F<22 {vers8,12,39,44,49} 22<F yo(s1-M)/|s1-M|-69) Val Ct 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 virginica39 is an outlier. 2 clusters, F<52 (ct=99) and F>52 (50 Setosa) virgini39 Val Ct 0 1 1 2 2 1 4 2 6 1 7 1 8 7 9 2 10 2 11 7 12 2 13 3 14 7 15 4 16 10 17 4 18 6 19 9 20 3 21 6 22 3 23 6 24 3 25 1 27 3 28 2 32 1 39 1 40 1 41 1 42 8 43 13 44 17 45 4 46 5 47 1 F=32 vers49 outlier. 32<F (50 Setosa, vir39) AVG(ver8,12,39,44,49) Val Ct 0 1 1 1 7 5 10 3 12 2 13 2 14 3 15 5 16 2 17 5 18 8 19 4 20 3 21 4 22 3 23 8 24 4 25 4 26 3 27 7 28 7 29 4 30 5 31 4 32 5 33 8 34 2 35 6 36 5 37 3 38 2 39 8 40 6 41 3 43 1 44 2 45 1 47 1 F=0 vir32 outlier F=1 vir18 outlier F=7 vir6,10,19,23,36 subcluster?

12 F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go.
On Clus(F<52) ver1 F(virg7)=0 outlier F(virg32)=25 outlier Val Ct 0 1 4 1 5 5 6 3 7 5 8 3 9 8 10 11 11 14 12 8 13 8 14 5 15 3 16 7 17 5 18 6 19 2 20 1 21 1 22 1 25 1 F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go. On Remaining, mx mn mx mx Val Ct 0 3 1 4 2 11 3 14 4 14 5 9 6 10 7 2 8 6 9 2 11 2 For s1 (i.e., yo(s1-M)/|s1-M|-69) Val Ct 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 outlier 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 i39=virgi39 outlier. Clusters, F<52 (ct=99) and F>52 (50 Setosa) On Remaining, max's Val Ct 0 2 e8 outlier 1 2 e11 outlier 7 2 8 1 9 4 10 1 11 2 12 2 13 4 14 3 15 1 16 4 17 2 18 2 19 3 20 4 21 6 22 5 23 5 24 4 25 2 26 2 27 1 28 2 29 4 30 5 31 1 32 3 33 2 34 2 35 3 36 2 37 1 38 1 i8 i10 i36 i6 i23 i19 i18 i6 i8 i10 i19 i23 i35 i i i i i i i6 i10 i18 i19 i23 i35 all declared outliers e4 e38 e19 i20 F e e e outlier i outlier On Remaining, max's Val Ct e44 outlier 6 1 7 2 8 1 9 3 10 1 11 3 12 5 13 2 14 2 15 3 17 3 18 3 19 5 20 1 21 9 22 5 23 4 24 2 26 4 27 2 28 2 29 4 30 2 31 3 32 3 33 2 34 3 35 2 36 1 37 1 38 1 39 1 e36 outlier? On Remaining, mx mx mx mn Val Ct 0 1 1 2 2 3 3 1 5 5 6 4 7 5 8 2 9 3 10 5 11 4 12 7 13 5 14 2 15 4 16 4 17 7 18 4 19 4 20 2 21 2 22 1 24 1 25 1 27 2 29 2 On Remaining, mn mn mx mx Val Ct 0 1 1 3 2 3 3 7 4 7 5 7 6 5 7 5 8 3 9 8 10 4 11 4 12 11 13 4 14 8 15 4 16 1 18 1 On Remaining, mn mx mx mx Val Ct 0 1 2 1 3 4 4 3 5 5 6 4 7 5 8 7 9 8 10 3 11 5 12 2 13 4 14 5 15 7 16 5 17 4 18 1 20 1 On Remaining w e35 Val Ct 0 1 i26 outlier 3 2 On remaining vir1 Val Ct 0 1 1 2 2 1 4 1 5 1 6 2 7 2 8 2 9 4 10 1 11 4 12 3 13 4 14 2 15 6 16 4 17 6 19 4 20 5 21 5 22 2 23 1 24 2 25 5 26 4 27 4 28 1 29 2 30 6 31 2 32 1 33 1 34 1 35 2 36 1 38 1 39 1 e35 e10 e e outlier i44 i3 i i ^^outlier i3 i30 i31 i26 i8 i36 i i outlier i outlier i outlier i outlier i outlier Rem mn mx mn mx Val Ct 0 1 1 1 2 1 3 1 4 1 5 1 6 1 8 1 9 3 10 5 11 5 12 3 13 7 14 6 15 4 16 6 17 7 18 5 19 4 20 2 21 3 22 7 23 4 24 3 25 1 26 1 27 2 e49 outlier On Remaining, mn mx mx mx Val Ct 0 1 1 1 2 1 3 5 4 6 5 5 6 4 7 9 8 4 9 4 10 4 11 3 12 5 13 6 14 6 15 7 16 5 17 4 18 4 20 1 22 1 Could look at distances for 0,1 and 20,22? e13 e30 e32 e outlier e outlier e i44 i45 i49 i5 i37 i1 i i i i i not outlier i outlier

13 outliers gap>L1=32.1 s6 s14 s15 s16 s17 s19 s21 s23 s24 s32 s33 s34 s37 s42 s45 e1 e2 e3 e5 e6 e7 e9 e10 e11 e12 e13 e15 e18 e19 e21 e22 e23 e27 e28 e29 e30 e34 e36 e37 e38 e41 e49 i1 i3 i4 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i25 i26 i28 i30 i31 i32 i34 i35 i36 i37 i39 i41 i42 i45 i46 i47 i49 i50 outliers gap>L1=42.8 s15 s16 s19 s23 s33 s34 s37 s42 s45 e1 e2 e7 e10 e11 e12 e13 e15 e19 e21 e22 e23 e27 e28 e30 e34 e36 e38 e41 e49 i1 i3 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i26 i30 i31 i32 i34 i35 i36 i39 outliers gp>L1=53.5 s15 s16 s23 s33 s34 s42 e10 e13 e15 e27 e28 e30 e36 e49 i1 i3 i7 i9 i10 i12 i15 i18 i19 i20 i26 i30 i32 i35 i36 i39 F=L1(x,y) on IRIS, masking to subclusters (go right down the table). Two rounds only If we use L1gap=6, remove those outliers, then use linear gap analysis for larger subcluster revalation, let's see if we can separate Versicolor (e) from virginica (i). outliers gap>L1=64.3 s15 s16 s23 s42 e10 e13 e49 i3 i7 i9 i10 i18 i19 i20 i32 i35 i36 i39 outliers gap>L1=74.95 L1gap s42 9 e13 8 i7 10 i9 12 i10 12 i35 9 i36 9 i39 26

14 Val=0;p=K;c=0;P=Pure1; For i=n to 0 {c=Ct(P&Pi); If (c>=p){Val=Val+2i; P=P&Pi }; else{p=p-c; P=P&P'i }; return Val, P; IDX z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : X1 1 3 : 11 7 X2 1 : 11 8 X3 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 X4 : P3 1 : P2 1 P1 1 : P0 1 : d(xy) 2 1 3 4 8 14 13 12 9 6 11 10 : 7 5 P'3 1 : P'2 1 : P'1 1 : P'0 1 : Need Rank(n-1) applied to each stride instead of the entire pTree. The result from stride=j gives the jth entry of SpS(X,d(x,X-x)) Parallelize over a large cluster? Ct(P&Pi): revise the Count proc to kick out count for each stride (involves loop down pTree by register-lengths? What does P represent after each step?? How does alg go on 2pDoop (w 2 level pTrees) where each stride is separate Note: using d, not d2 (fewer pTrees). Can we estimate d? (using truncated McClarin series) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=10< 14, p=14–10=4; P=P&P' (elim 10 val8) n=2: c=Ct(P&P2)= 1 < 4, p=4-1=3; P=P&P' (elim 1 val4) n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=9< 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 0 < 5, p=5-0=5; P=P&P' (elim 0 val4) n=1: c=Ct(P&P1)=4 < 5, p=5-4=1; P=P&P' (elim 4 val2) n=0: c=Ct(P&P0 )=1>= P=P&P0 (elim 1 val<1 23 * * * * 1 = 1 n=3: c=Ct(P&P3)= 9 < 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 2 < 5, p=5-2=3; P=P&P' (elim 2 val4)2 n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 1 = 3 n=3: c=Ct(P&P3)= 6 < 14, p=14–6=8; P=P&P' (elim 6 val8) n=2: c=Ct(P&P2)= 7 < 8, p=8-7=1; P=P&P' (elim 7 val4)2 n=1: c=Ct(P&P1)=11, p=1-1=0; P=P&P (elim 0 val2) n=0: c=Ct(P&P0 )=1 P=P&P0 (elim 0)

15 Level-1 key map Red=pure stride (so no Level-0)
e f g h i a j b c k d m 0 0 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 a b c d e f g h i j k m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Level-0: key map 13 12 11 10 23 22 21 20 (6-e) = e else pur0 (6-e) = f else pur0 (6-e) = g else pur0 (6-e) = h else pur0 In this 2pDoop KEY-VALUE DB, we list keys. Should we bitmap? Each bitmap is a pTree in the KVDB. Each of these is existing, e.g., e here 5,7-a,f=f else pur0 5,7-a,f=g else pur0 5,7-a,f=h else pur0 234789bcefg els pr0 234789bcefh else pr0 124-79c-f h else pr0 (b-f) = i else pur0 (b-f) = j else pur0 (b-f) = k else pur0 (b-f) = m else pur0 (a) = j else pur0 (a) = k else pur0 (a) = m else pur0 =SpS(XX, -27( p13p33 + p13p32 + p23p43 p23p42 (3-6,8,9) k, els pr0 (3-6,8,9) m els pr0 + p13p31 + 26( p13+p23+p33+p43 +p13p12+ p23p22+ p33p32 + +p43p42 ) -26( p23p41 124679bd m els pr0 25( p13p11+ p23p21 + p33p31 + p43p41 ) -25( p13p30 +p23p40 +p12p31 +p22p41 +p12p32 +p22p42 e f 5 6 g 7 h i a j b c k d m 33 32 31 30 43 42 41 40 24( p12+p22+p32+p42 +p13p10+ +p23p20 +p33p30 +p43p40 -24(p12p30 +p22p40 +p12p11+ +p22p21 +p32p31 +p42p41 ) 23( p12p10+ p22p20 + p32p30 + p42p40 ) -23(p11p31 +p11p30 +p21p41 +p21p40 p11+p21+p31+p41 +p11p10 + +p21p20 + +p31p30 +p41p40 ) -22(p10p30 +p20p40 p10+p20+p30+p40 ) 22(

16 If (Ct≡Count(P&Pi)p)
x y yox-M ID1ID2 -MIN V P C1 V P C2 V P C3 V P C4 V P C5 V P C6 V P C7 V P C8 V P C9 V P Ca V P Cb V P Cc z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z2 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z3 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z4 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z z5 z RankK V=0;p=K;P=Pur1; For i=n..0 { If (Ct≡Count(P&Pi)p) V=V+2i; P=P&Pi } else{ p=p-Ct; P=P&P'i } } RankK reveals the full gap situation for any functional on any vector space. (Here, the functional is the dot product with d=unit vector from the mean to each point). Use RankK with K=N, N-C1,N-C1,...,N-C1...-CmaxL1 to mine useful subcluster info (Ci=ct(P&Pi) after the ith round, N=|X|=15, mxL1=20)

17 x y yox-M ID1ID2 -MIN V P C1 V P C2 V P C3 V P C4 V P C5 V P C6 V P C7 V P C8 V P C9 V P Ca V P Cb V P Cc z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z6 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z7 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z8 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z9 z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z z10z

18 x y yox-M ID1ID2 -MIN V P C1 V P C2 V P C3 V P C4 V P C5 V P C6 V P C7 V P C8 V P C9 V P Ca V P Cb V P Cc z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z11z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z12z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z13z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z14z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z z15z

19 ptree P=Pure1; ptreeSet P[n]; ptree T1,T2,a; ptreeSet c,t,rv,k;
For i=(n-1) to 0 {T1=P&P[i]; T2=P&P[i]’; c=PartialCount(T1); a=Compare(c,k); rv[i]=a; P=(T1&a)|(T2&a’); t=Subtract(k,c); k=(k& a) |(t& a’) ; } The difference between the two algorithms is in the method of handling (resetting) P and the parameters (rv[i] V) and (k p). Mohammad uses PTreeSets for the array of [real number] parameter values and then can avoid looping through the strides. Which is faster for big data? Should 2-level pTrees be used? If so which is better? RankK V=0;p=K;P=Pur1; For i=n..0 { If(Ct≡Count(P&Pi)p){ V=V+2i; P=P&Pi } else {p=p-Ct; P=P&P'i } }

20 Stride 1 P' P3 1 010 1 05 P'3P' P'3P P3P' P3P2 1 08 1 12 1 02 1 03 P'3P'2P' P'3P2P'1 P'3P'2P P'3P2P1 P3P'2P' P3P2 P'1 P3P'2P P3P2 P1 1 04 1 04 1 01 11 00 1 2 1 02 1 01 P'3P'2P'1P' P'3P2P'1P'0 P'3P'2P'1P P'3P2P'1P0 P'3P'2P1P' P'3P2P1P'0 P'3P'2P1P P'3P2P1P0 P3P'2P'1P' P3P2P'1P'0 P3P'2P'1P P3P2P'1P0 P3P'2P1P' P3P2P1P'0 P3P'2P1P P3P2P1P0 00 : 1 03 : 1 04 00 1 01 00 11 00 1 02 00 1 02 00 00 1 01 If all these pTree ANDs are pre-computed and stored (with their 1-counts) for each stride, the Rank alg can be run accessing the counts only. E.g., if n=1,000,000=1M then N=1T and there are 1M strides to pre-compute ;-( If the bitwidth is 4, then each stride requires these 30=( )=25-2 pre-computed level-0 pTrees and counts. If the bitwidth=b each stride requires (i=1..b2i = 2b+1-2 pre-computations. E.g., ~=1018 for b=32, so one would do this only for, say, the high order 8 bits. Descending the tree, 1bits turn to 0bits only. Therefore, the counts are non-increasing and the count across at any level stays at n=1M =106, 31 1 21 1 11 1 01 1 32 1 22 1 12 1 02 1 33 1 23 1 13 1 03 1 34 1 24 1 14 1 04 1 35 1 25 1 15 1 05 1 36 1 26 1 16 1 06 1 37 1 27 1 17 1 07 1 38 1 28 1 18 1 08 1 39 1 29 1 19 1 09 1 3a 1 2a 1 1a 1 0a 1 3b 1 2b 1 1b 1 0b 1 3c 1 2c 1 1c 1 0c 1 3d 1 2d 1 1d 1 0d 1 3e 1 2e 1 1e 1 0e 1 3f 1 2f 1 1f 1 0f 1 LEVEL-0 of PTreeSet yo(z1-M)/|z1-M|

21 y = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| M-p |M-p| (y-p)o
Using a "capped tube". Given a unit vector, d, we need the d_dot_product_projection_lengths, and the d_dot_product_projection_distances. y squared is (y-(yod)d)o(y-(yod)d) = yoy -2(yod)2 + (yod)2 = yoy - (yod)2 | y - (yod)d | dot product projection distance (yod)d Note, this projection_distance is the perpendicular distance from the point, y, to the d_line and has nothing to do with the origin of the vector, y. Note, this projection_distance is the perpendicular distance from the point, y, to the d_line and has nothing to do with the origin of the vector, y. Squared y-p on M-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(M-p) )2 (M-p)o(M-p) Furthest Point or Mean Point f (or M) 1st: compute this constant [vector] = yoy -2yop + pop - ( yo(M-p) - po(M-p |M-p| 2 Gaps in dot product lengths [projections] on the line. 3rd: comp these PTreeSets (2 dots, 1 minus, 1 plus) Do not compute y-p. (shifts entire vector sp)? y cap gap width M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) 2nd: compute this PTreeSet (1 dot, 1 minus) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.) p tubular gap width Origin


Download ppt "This suggests a clustering method:"

Similar presentations


Ads by Google