Download presentation
Presentation is loading. Please wait.
1
Please email the following information to william.perrizo@ndsu.edu:
1. Name and address 2. Thesis advisor (if you have one currently) 3. Thesis topic or topic area (if you have one currently) 4. Are you currently a a. TA b. RA (whose grant funds you?) c. C.S. Department Paper Grader d. Working for an external (to NDSU) entity (which one? How many hours/week?) The reason for requesting this information is that: 1. I want to get to know who you are, what group you're in and your interests and situation. 2. Since my data mining technology has been patented by NDSURF and has been licensed to Treeminer Inc. of Maryland, I have to be somewhat careful with it. If you should use it in a thesis or paper, please let me know ahead of time. 3. So that problems do not arise, I ask that you trust me to decide degree of authorship etc. in all publications involving this work. Why can you trust me to be fair? I have some 250 refereed publications and don’t need any more. I have done this with hundreds of students, all of whom will tell you that I have always been completely fair. It is important that attributions are correct.
2
This is my research group meeting
This is my research group meeting. Remember that these are not lectures for a class and are not required for anyone. First, what do I do? I Data Mine Big Data in Human Time (big data = zillions of rows! And, sometimes also, thousands of columns (which can complicate the data mining of a zillion rows). What is data mining? Roughly it's CLASSIFICATION (assigning a [class] label to an unclassified row based on a training set of already classified rows). What about clustering and ARM? They are important and related! Roughly, the purpose of clustering is to create/improve a training set. Roughly the purpose of ARM is to data mine complex data (relationships). CLASSIFICATION is case-based reasoning. To make a decision, we tend to search our memory for similar cases (near neighbor cases) and base our current decision on those cases - we do what worked before (for us or for others). We let those near neighbor cases vote. What does near mean? How many? How near? How do we they vote? These are all questions to be answered for the particular decision we are making. "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information"[2] is one of the most highly cited papers in psychology. It was published in 1956 by the cognitive psychologist George A. Miller of Princeton University's Department of Psychology in Psychological Review. It is often interpreted to argue that the number of objects an average human can hold in working memory is 7 ± 2. This is referred to as Miller's Law. We can think of classification as providing a better 7 (decision support, not decision making) Some current Data Mining research projects: 1. MapReduce FAUST Current_Relevancy_Score =10 Killer_Idea_Score=5 MapReduce and Hadoop are key-value approach to organizing complex BigData. In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER we start with a vector space. FAUST = Fast, Accurate, Unsupervised and Supervised machine Teaching 2. pTree Text Mining: Current_Relevancy_Score =10 Killer_Idea_Score=9 I I think FAUST is the way to do this. Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. Preliminary work suggests that attribute selection via simple Standard Deviations really helps (select those columns (or more generally, the functionals) with the highest StDev because they are the ones with the highest potential for large gaps! 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =10 Killer_Idea_Score=10 No No one has taken up the proof that this is a break through method. The applications are unlimited! 4. Secure pTreeBases: Current_Relevancy_Score =9 Killer_Idea_Score=8 This seems straight forward and a certainty (to be a killer advance)! It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs. Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =10 Killer_Idea_Score=10 No one done a complete analysis of this is a break through method. The applications are unlimited here too! 6. pTree Algorithmic Tools: Current_Relevancy_Score =10 Killer_Idea_Score=10 Mohammad Hossain has expanded the algorithmic tool set to include quadratic tools and even higher degree tools and now division is added to the arithmetic tools. 7. pTree Alternative Algorithm Impl: Current_Relevancy_Score =9 Killer_Idea_Score=8 Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8. pTree O/S Infrastructure: Current_Relevancy_Score =10 Killer_Idea_Score=10 Matt Piehl finished his M.S. on this (Available in the library).
3
Functional-Gap-based clustering methods (the unsupervised part of FAUST)
This class of partitioning or clustering methods relies on choosing a functional (a mapping of each row to a real number) which is distance dominated (i.e., the difference between two functional values, F(x) and F(y) is always the distance between x and y; so if we find a gap in the F-values, we know that points on opposite sides of that gap are at least as far apart as the gap width.) Here are some of the functionals that we have already used productively: (in each, we can actual pair-wise distances at the extreme ends for outliers.) Coordinate Projection (ej) Check gaps in ej(y) ≡ yj Square Distance (SD) Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a p grid). Dot Product Projection (DPP) Check DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| gaps (parameterized over a grid of p values and d=(p-q)/|p-q|?) values. d Square Dot Product Radius (SDPR) 1. Check SDPRpq(y) ≡ SDp(y) - DPPpq(y)2 gaps DPP-KM 1. Check DPPp,d(y) gaps (grids of p and d?) Check distances at sparse extremes After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA 1. Check DPPp,d(y) gaps (grids of p and d?) against density of subcluster Check distances at sparse extremes against subcluster density Apply other methods once Dot ceases to be effective. DPP-SD) Check DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid) Check sparse ends distance with subcluster density. (DPPpd , SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y) ≡ (y-p)o(y-p) = yoy - 2 yop pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)
4
Dot Product Projection (DPP) 1
Dot Product Projection (DPP) 1.Check F(y)=(y-p)o(q-p)/|q-p| gaps or thin intervals Check actual distances at sparse ends. Here we apply DPP to the IRIS data set: 150 iris samples (rows) and 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). We assume we don't know ahead of time that the first 50 are the Setosa Class, next 50 are Versicolor Class and the final 50 are Virginica Class. We cluster with DPP and then see how close it comes to separating into the 3 known classes (s=setosal, e=versicolor, i=virginica). CLUS3 outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 No Thining. Sparse Lo end: Check [0,8] distances i30 i35 i20 e34 i34 e23 e19 e27 i i i e i e e e i30,i35,i20 outliers because F3 they are 4 from 5,6,7,8 {e34,i34} doubleton outlier set gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 Sparse Lower end: Checking [0,4] distances s14 s42 s45 s23 s16 s43 s3 s s s s s s s s42 is revealed as an outlier because F(s42)= 1 is 4 from 5,6,... and it's 4 from others in [0,4] Thinning=[6,7 ] CLUS3.1 <6.5 44 ver 4 vir LUS3.2 >6.5 2 ver 39 vir No sparse ends CLUS3.1 p=anxa q=axna F Cnt 0 2 3 1 5 2 6 1 8 2 9 4 10 3 11 6 12 6 13 7 14 7 15 4 16 3 19 2 Sparse Upper end: Check [16,19] distances e7 e32 e33 e30 e15 e e e e e e15 outlier. So CLUS3.1 = 42 versicolor Gaps=[15,19] [21,26] Check dis in [12,28] to see if s16, i39,e49,e8,e11,e44 outliers s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s s s s s i e e e e e e e So s16,,i39,e49, e11 are outlier. {e8,e44} doubleton outlier. Separate at 17 and 23, giving CLUS1 F<17 ( CLUS1 =50 Setosa with s16,s42 declared as outliers). 17<F CLUS2 F<23 (e8,e11,e44,e49,i39 all are already declared outliers) 23<F CLUS ( 46 vers, 49 virg with i6,i10,i18,i19,i23,i32 declared as outliers) CLUS3.2 = 39 virg, 2 vers (unable to separate the 2 vers from the 39 virg) Sparse Upper end: Checking [57.68] distances i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i i i i i i i i i i i10,i36,i19,i32,i18 singleton outlies because F 4 from 56 and 4 from each other. {i6,i23} is a doubleton outlier set.
5
DPP (other corners) Check Dotp,d(y) gaps>=4 Check sparse ends.
CLUS1 p=nxnn q=xnxx 0 1 2 1 4 1 6 2 9 1 10 1 11 2 12 2 13 3 14 3 15 2 16 2 17 4 18 3 19 3 20 2 21 5 22 6 23 5 24 2 25 7 26 3 27 2 28 2 29 1 30 3 31 3 32 7 33 4 34 1 35 1 36 2 37 2 39 1 41 1 42 1 43 1 Sparse low end (check [0,9] i23 i6 i36 i8 i31 i3 i26 i i i i i i i i3, i26, i36 >=4 singleton outliers {i23,i6}, {i8,i31} doubleton ols Sparse low end (checking [0,7] i1 i18i19i10i37i5 i6 i23i32i44i45i49i25i8 i15i41i21i33i29i4 i3 i16 i i i i i i i i i i i i i i i i i i i i i i i i i i1, i18, i19, i10, i37, i32 >=4 outliers Dotgp>=4 p=xnnn q=nxxx 0 1 1 1 2 1 3 2 4 7 5 1 6 7 7 5 8 9 9 3 10 7 11 3 12 5 13 4 14 5 15 4 16 8 17 4 18 7 19 3 20 5 21 1 22 4 23 1 24 1 31 2 33 2 34 12 35 8 36 17 37 6 38 2 39 2 Sparse hi end (checking [34,43] e20e31e10e32e15e30e11e44e8 e49 e e e e e e e e e e e30,e49,ei15,e11 >=4 singleton ols {e44,e8} doubleton ols gap:(24,31) CLUS1<27.5 (50 versi, 49 virg) CLUS2>27.5 (50 set, 1 virg) Sparse hi end (checking [38,39] s42 s36 s37 s1 s s s s s37, s1 outliers Thinning (8,13) Split in middle=10.5 CLUS_1.1<10.5 (21 virg, 2 ver) CLUS_1.2>10.5 (12 virg, 42 ver) CLUS1 Dotgp>=4 p=nnnn q=xxxx 0 1 1 2 2 2 3 1 4 2 5 1 6 6 7 2 8 3 9 1 10 2 11 2 12 2 13 6 14 6 15 7 16 2 17 2 18 3 19 3 20 2 21 2 22 3 23 4 24 2 25 1 26 2 27 3 28 1 29 1 Clus1 p=nnxn q=xxnx 0 2 1 1 2 5 3 8 4 9 5 6 6 9 7 14 8 11 9 7 10 4 11 2 13 2 Thinning (7,9) Split in middle=7.5 CLUS_1.2.1 < 7.5 (10 virg, 4 ver) CLUS_1.2.2 > 7.5 ( 1 virg, 38 ver) i15 gap>=4 outlier at F=0 Sparse hi end (checking [10,13] e34i2 i14i43e41i20i7 i35 e i i i e i i i i7, i35 >=4 singleton outliers CLUS1 Dotgp>=4 p=nnnx q=xxxn 0 1 4 1 5 3 6 5 7 4 8 3 9 6 10 7 11 3 12 4 13 8 14 4 15 4 16 3 17 8 18 5 19 3 20 1 21 1 22 3 23 1 CLUS1.2 Dotgp>=4 p=aaan q=aaax 0 1 4 4 5 3 6 3 7 4 8 1 9 5 10 7 11 3 12 5 13 3 14 6 15 1 16 4 17 1 18 1 19 2 hi end gap outlier i30 CLUS1.2.1 Dotgp>=4 p=anaa q=axaa 0 1 1 1 2 1 4 2 6 3 7 4 9 2 CLUS1.2.1 Dotgp>=4 p=aana q=aaxa 0 5 1 2 2 3 3 2 4 1 6 1 C i24e7 i34i47i27i28e34e36e21i50i2 i43i14i22 i e i i i i e e e i i i i i CLUS1.2.1 p=naaa q=xaaa 0 4 1 1 2 1 3 2 4 2 5 2 6 1 7 1
6
The next slide attempts analyze "gap climbing" mathematically.
HILL CLIMBING GAP WIDTH Check Dotp,d(y) for thinnings. Use AVG of each side of the thinning for p,q. redo. Dot F p=aaan q=aaax 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 p=avg<12 q=avg>12 0 2 2 1 3 2 5 1 6 1 8 1 9 1 10 3 11 2 12 1 13 4 14 1 15 3 16 5 17 2 18 2 19 3 21 4 22 1 23 6 24 5 25 5 26 4 27 4 28 2 29 3 30 3 31 3 33 4 34 4 35 2 36 3 37 3 38 1 39 1 40 2 44 1 45 1 46 2 47 1 Inconclusive! There isn't a more prominent gap than before. p=aaan+.005*avg<12 q=aaax+.005*avg>12 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 Here we tweek d just a little toward the means and get a more prominent gap?? Cut=8 CLUS_1.1<8 (45 Virg, 1 Vers) 8<CLUS_1.2<17 (5 Virg, 49 Vers) Cut=9 CLUS_1.1<9 (46 Virg, 2 Vers) CLUS_1.2>9 (4 Virg, 48 Vers) Cut=17 CLUS_1<17 CLUS_2>17 (50 Set) These are attempts at "hill-climbing" the gaps to make them more prominent (To see if they are wider than they show up to be via the choice of F - in the case that the projection line cuts the gap at a severe angle and therefore reports a much narrower gap than actually exists. The next slide attempts analyze "gap climbing" mathematically.
7
"Gap Hill Climbing": mathematical analysis
One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher STDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). This is very general. We are more interested in growing the one particular gap of interest (largest gap or largest thinning). To do this we can do as follows: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap =p q= a b c d e f f e d c b a 9 8 7 6 a j k b c q d e f 2 1 d2 d2-gap p q d2 d2-gap
8
Barrel Clustering: (This method attempts to build barrel-shaped gaps around clusters)
Furthest Point or Mean Point q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, SBR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od y barrel cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p barrel radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)
9
4 functionals in the dot product group of gap clusterers on a VectorSpace subset, Y (yY):
1. SLp(y) = (y-p)o(y-p), p a fixed vector. Square Length functional primarily for outlier identification and densities. 2. Dotd(y) = yod, (d is a unit vector) the Dot-product functional. Using d=q-p/|q-p| and y-p Dotp,q(y) = (y-p)o(q-p)/|q-p| yod projection d y y - (yod)d = projection. Squaring its length: (y-yodd)o(y-yodd)=yoy-(yod)2 yod projection (neg) d y y - (yod)d so again yoy - (yod)2 = squared proj (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) = yoy -2yop + pop - 2 3. SPDd(y) = yoy - (yod)2 (d a unit vector) is the Square Projection Distance functional E.g., if d ≡ (q-p)/|q-p|, d = unit vector from vector p to vector q, then SPD(y)= But to avoid creating an entirely new VectorPTreeSet(Y-p) for the space (with origin shifted to p), we think it useful to alter the expression to : SPDpq(y) q-p |q-p| - yo po where we might: 1st compute the constant vector nd the ScalarPTreeSet q-p |q-p| yo po q-p |q-p| yo - 3rd the constant th the SPTreeSet pop yo q-p |q-p| po - 5th the SPTreeSet th the constant yoy, yop = yoy -2yop + pop - 2 7th the SPTreeSets 8th the SPTreeSet q-p |M-p| - yo po Is it better to leave all the additions and subtractions for one mega-step at the end? Other efficiency thoughts? q-p |q-p| (y-p)o = - yo po We note that Dot(y)=yod shares many construction steps with SPD. 4. CAd(y) = yod/|y|, (d unit vector) the Cone Angle functional. Using d=q-p/|q-p| and y=x-p CAp,q(y) = (y-p)od/|y-p| SCAp,q(y) = (y-p)od2/|y-p|2 = (y-p)od2/(y-p)o(y-p), Squared Cone Angle functional
10
CLUS1.2 is pure Versicolor (45 of the 50).
SPD p q e14 V Ct 2 10 3 12 4 12 5 12 6 8 7 11 8 9 9 5 10 9 11 4 12 4 13 2 14 1 17 2 18 3 19 10 20 5 21 6 22 5 23 6 24 6 25 3 27 2 29 2 30 1 SPD on CLUS1 p e11 q =MN V Ct 2 3 3 4 4 5 5 7 6 2 7 2 8 6 9 6 10 3 11 4 12 2 13 4 14 4 15 3 16 2 17 1 18 5 19 1 20 2 22 2 23 1 24 1 25 1 26 1 29 1 SPD p q e14 V Ct 1 6 2 4 3 8 4 4 5 10 6 2 7 2 8 2 9 7 10 2 11 2 12 2 13 1 15 2 17 1 18 4 19 2 20 4 22 1 24 1 25 1 26 1 29 1 31 2 32 2 33 3 i15 i36 i32 SPD p q V Ct 2 8 3 10 4 10 5 10 6 5 7 10 8 6 9 8 10 6 11 1 mask: V<8.5 CTs SMs CTe SMe CTi SMi CLUS1 mask: V<12.5 5 SMe 24 SMi CLUS1.1 thin gap mask: 8.5<V<15.5 CTs SMs CTe SMe CTi SMi CLUS2 masking V>6: Total_e Masked_e Total_i Masked_i However I cheated a bit. I used p=MinVect(e) and q=MaxVect(e) which makes it somewhat supervised. START OVER WITH THE FULL > mask: V>12.5 45 SMe 0 SMi CLUS1.2 mask: V>15.5: CTs SMs CTe SMe CTi SMi This tube contains 49 setosa + 2 virginica CLUS3 CLUS1.2 is pure Versicolor (45 of the 50). CLUS3 is almost pure Setosa (49 of the 50, plus 2 virginica) CLUS2 is almost purely [1/2 of] viriginica (24 of 50, plus 1 setosa). CLUS1.1 is the other 24 virginicas, plus the other 5 versicolors. So this method clusters IRIS quite well (albeit into 4 clusters, not three). Note that caps were not put on these tubes. Also, this was NOT unsupervised clustering! I took advantage of my knowledge of the classes to carefully chose the unit vector points, p and q E.g., p = MinVector(Versicolor) and q = MaxVector(Versicolor. True, if one sequenced thru a fine enough d-grid of all unit vectors [directions], one would happen upon a unit vector closely aligned to d=q-p/|q-p| but that would be a whole lot more work that I did here (would take much longer). In worst case though, for totally unsupervised clustering. there would be no other way than to sequence through a grid of unit vectors. However, a good heuristic might be to try all unit vectors "corner-to-corner" and "middle-of-face-TO-middle-of-opposite-face" first, etc. Another thought would be to try to introduce some sort of hill climbing to "work our way" toward a good combination of a radial gap plus two good linear cap gaps for that radial gap.
11
SPD on CLUS1 p C1US1axxx q C1US1aaaa V Ct 1 3 2 5 3 9 4 13 5 18 6 12 7 4 8 1 9 2 no thinnings SPD on CLUS1 p C1US1xaxx q C1US1aaaa V Ct 1 4 2 13 3 7 4 19 5 9 6 7 7 9 8 2 SPD on CLUS1 p C1US1xxax q C1US1aaaa V Ct 1 1 2 4 3 3 4 9 5 9 6 14 7 9 8 4 9 6 10 3 11 3 12 1 14 2 15 1 no thinnings SPD on CLUS1 p C1US1xxxa q C1US1aaaa V Ct 1 1 2 3 3 10 4 15 5 16 6 12 7 7 8 3 9 1 10 1 no thinnings SPD p axxx q aaaa V Ct 2 1 3 5 4 6 5 6 6 8 7 6 8 8 9 15 10 7 11 8 12 13 13 8 14 14 15 9 16 13 17 6 18 4 19 4 20 3 21 4 23 1 25 1 mask: V<3.5 14 SM versi 10 SM virgi CL1.1? mask: V<11.5 0 SM setosa 46 SM versicolor 24 SM virginica CLUS1 mask: V>3.5 0 SM setosa 32 SM versi 14 SM virgi CLUS1.2? mask: V>11.5 50 SM setosa 4 SM versicolor 26 SM virginica CLUS2 SPD on CLUS2 p C1US2axxx q C1US2aaaa V Ct 6 2 7 2 8 6 9 13 10 7 11 7 12 4 13 5 14 11 15 9 16 2 18 4 21 2 22 1 23 3 25 1 26 1 SPD on CLUS1 p C1US1axax q C1US1aaaa V Ct 1 1 2 3 3 4 4 2 5 12 6 13 7 9 8 7 9 2 10 7 11 4 13 2 14 1 17 2 18 1 SPD on CLUS1 p C11aaxx q C11aaaa V Ct 1 1 2 7 3 10 4 13 5 13 6 13 7 6 8 2 9 2 11 1 no thinnings SPD on CLUS1 p C1US1axxa q C1US1aaaa V Ct 1 1 2 2 3 6 4 9 5 12 6 17 7 8 8 6 9 5 10 1 11 1 no thinnings mask: V<13.5 44 SM setosa 0 SM versicolor 02 SM virginica CLUS2.1 mask: V<9.5 37 SM vers 16 SM virg CL1.1? mask: 100>V>13.5 6 SM setosa 4 SM versicolor 24 SM virginica CLUS2.2 mask: V>9.5 9 SM vers 8 SM virg CL1.2? SPD on CLUS1 C11xaax C11aaaa V Ct 1 2 2 3 3 4 4 8 5 8 6 14 7 8 8 4 9 5 10 6 11 1 12 3 14 1 15 2 no thins C11axaa C11aaaa V Ct 1 2 2 2 3 2 4 10 5 3 6 13 7 8 8 7 9 4 10 3 11 6 12 2 13 2 14 2 17 2 18 1 19 1 SPD on CLUS1 C11xxaa C11aaaa V Ct 1 1 2 4 3 6 4 9 5 10 6 7 7 9 8 5 9 3 10 4 11 2 12 4 13 1 14 3 17 2 SPD on C1 C11aaax C11aaaa V Ct 1 3 2 1 3 3 4 4 5 12 6 15 7 4 8 5 9 4 10 7 11 4 12 2 13 1 14 1 15 1 17 1 18 1 19 1 SPD on CLUS1 C11xaxa C11aaaa V Ct 1 2 2 3 3 12 4 12 5 10 6 15 7 7 8 4 9 1 10 2 11 1 no thins C11aaxa C11aaaa V Ct 1 2 2 3 3 6 4 12 5 11 6 9 7 11 8 5 9 5 10 1 11 3 13 2 C11xaaa C11aaaa V Ct 1 2 2 4 3 5 4 9 5 10 6 9 7 5 8 6 9 2 10 6 11 3 12 1 13 2 14 2 15 2 17 2 mask: V<5.5 16 ver 3 virCL1.1? mask: V<5.5 26 ver 4 vir CL1.1? mask: V>5.5 30 ver 21 virCL1.1? mask: V>5.5 20 ver 20 vir CL1.1?
12
95 remaining versicolor and virginica=SubClus1.
i p max V Ct 0 2 1 2 2 2 3 5 4 3 5 3 6 4 7 4 8 7 9 2 10 3 11 1 12 4 13 5 14 4 15 7 16 2 17 5 18 3 19 1 20 1 21 4 23 2 24 2 25 4 26 1 27 2 28 1 29 2 30 1 32 1 {e4, e40} form a doubleton outlier set i7 and e10 are singleton outliers x=s (58=avg(y1) ) V Ct 0 3 s15, s17, s34 1 12 s 6,11,16,19,20,22,28,32,3337,47,49 2 12 s 1,10 13,18,21,27,29,40,41,44,45,50 3 7 s 2,12,23,24,35,36,38 4 10 s 2,3,7,13,25,26,30,31,46,48 5 2 s4, s43 6 2 s9,s39 7 1 s14 8 1 i39 9 1 s32 ^^all 50 setosa + i39 e49 16 2 17 2 19 1 20 2 21 5 22 4 23 3 24 4 25 1 27 8 28 2 29 2 30 4 31 1 32 4 34 2 35 2 36 2 37 3 38 2 39 2 40 4 41 1 43 2 44 4 45 2 46 1 47 2 48 1 50 4 52 2 53 2 54 2 56 2 57 1 i1 i31 vv 9 virginica i10 i8 i36 i32 i16 i18 i23 i19 But here I mistakenly used the mean rather than the max corner. So I will redo - but note the high level of cluster and outlier revelation????? i p max V Ct 0 2 2 6 3 3 4 4 5 4 6 2 7 6 8 9 9 2 10 2 11 2 12 5 13 7 14 2 15 6 16 2 17 5 19 3 20 2 22 3 23 2 24 3 25 2 26 1 27 1 28 1 29 3 30 1 31 2 e32 e11 e8,44 e49 i39 60 1 61 1 62 1 63 1 64 1 65 1 66 1 67 3 68 4 69 4 70 3 71 3 72 4 73 2 74 5 75 1 76 2 77 1 78 3 79 1 s3 s9 s39,43 s42 s23 s14 2 actual gap-ouliers, checking distances reveals 4 e-outlier (versicolor), 5 s-outliers (setosa). i p max V Ct 0 2 1 1 2 3 3 3 4 4 5 2 6 6 7 3 8 5 9 4 10 4 11 2 12 3 13 4 14 6 15 4 16 1 17 7 18 2 19 3 20 2 22 2 23 1 24 2 25 4 26 4 27 1 28 2 29 2 30 1 32 2 33 1 34 1 35 1 No new outliers reviealed 95 remaining versicolor and virginica=SubClus1. Continue outlier id rounds on SC1 (maxSL, maxSW, max PW) then do "capped tube" (further subclusters.) 1. (y-p)o(y-p) remove edge outliers ( thr>2*50) 2. lthin gaps in SPD: d, from an edge point to MN. 3 For each thin PL, do len gap anal of pts in " tube". e13 i7 e40 e4 e10 F e i e e e e32 e11 e8 e44 e49 e e e e e 45 remaining setosa. This is SubCluster 2 (may have additional outliers or sub-subclusters but we will not analyse further (would be done in practice tho SPD(y) =(y-p)o(y-p)-(y-p)od2 d: mn-mx V Ct Next slide i p max V Ct 0 2 1 10 2 11 3 6 4 15 5 4 6 8 7 9 8 4 9 5 10 2 11 7 13 4 14 2 15 2 16 1 17 1 18 1 19 1 e30, e15 outliers e20,e31,e32 form SC12 Declared tripleton outlier set? (But they are not singleton outliers.) s3 s9 s39 s43 s42 s23 s s s s s s e13 e20 e15 e31 e32 e30 F e e e e e e
13
Cone Clustering: (finding cone-shaped clusters)
x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e
14
FxM(x,y)=yo(x-M)/|x-M|-min on XX≡{(x,y)|x,yX}, where X(x,y) is a Spaeth image table Cluster by splitting at all F_gaps > 2 APPENDIX The 15 Value_Arrays (one for each x) z z z z z z z z z z z z z z z X x y \y= a b 1 1 x=1 1 f M d a b b c e c d a e 8 f 7 9 x y FxM z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 M (=MeanVector) The 15 Count_Arrays z z z z z z z z z z z z z z z Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 gap: 10-6 gap: 5-2 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1 The FAUST algorithm: 1. project onto each Mx line using the dot product with the unit vector from M to x. (only x=z1 is shown) 2. Generate each Value Array, F[x0]|(y), xX (also generate the Count_Arrays and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.
15
Cluster by splitting at gaps > 2
yo(z7-M)/|z7-M| ValueArrays z z z z z z z z z z z z z z z yo(z7-M)/|z7-M| CountArrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 6-9 z71 1 z72 1 In Step_3 of the algorithm we can: Analyze one of the gap arrays (e.g., As done for z1. Subclusters is shown above) then start over on each subcluster. Or we can analyze all gap arrays concurrently (in parallel using the same F - saving the [substantial?] re-compute costs?) and then intersect the subcluster partitions we get from each x_ValueArray gap analysis, forthe final subclustering. Here we use the second alternative, judiciously choosing only the x's that are likely to be productive (choosing z7 next). Many are likely to produce redundant partitions - e.g., z1, z2, z3, z4, z6 - as their projection lines will be nearly coincident. How should we choose the sequence of "productive" strides? One way would be to always choose the remaining stride with the shortest ValueArray, so that the chances of decent sized gaps is maximized. Other ways of choosing?
16
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 3-7 z71 1 z72 1 zd1 1 zd2 1 We choose zd=z13 next (Should have been first? Since it's ValueArray is shortest?) Note, z8, z9, za projection lines will be nearly coincident with that of z7.
17
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 z71 1 z72 1 zd1 1 zd2 1 AND each red with each blue with each green, to get the subcluster masks (12 ANDs producing 5 sub-clusters.
18
F1(x,y) = L1Distance(x,y) = (|x1-y1|+|x2-y2|) on XX≡{(x,y)|x,yX},
Cluster by splitting at all F1_gaps L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f (redundant subclustering) gap: 10-5 There is a z1-gap, but it produces a subclustering that was already discovered by a previous round. Which z values will give new subclusterings?
19
Re-confirms zf an anomaly.
L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z This re-confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f Re-confirms zf an anomaly. After having subclustered with linear gap analysis, which is best for determining larger subclusters, we run this round gap algorithm out only 2 steps to determine if there are any singleFvalue gaps>2 (the points in the singleFvalueGapped set are then declared anomalies). So we run it out two steps only, then find those points for which the one initial gap determined by those first two values is sufficient to declare outlierness. Doing that here, we reconfirm the outlierness of z6 and zf, while finding new outliers, z5 and za.
20
Using F=yo(x-M)/|x-M|-MIN on IRIS, one stride at a time (s1=setosa1 first)
For virginica1 Val Ct 0 1 1 1 2 2 3 5 4 6 5 11 6 12 7 4 8 2 9 5 10 1 17 1 22 1 24 2 25 1 27 1 28 1 29 2 30 1 31 3 32 4 33 1 34 4 35 2 36 2 37 4 38 4 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 F(i39)=17 F<17 (50 Setosa) vers1 Val Ct 0 1 2 4 3 1 4 1 5 3 6 3 7 8 8 3 9 7 10 6 11 4 12 4 13 3 15 2 19 2 20 2 21 1 26 2 27 3 28 4 30 2 31 5 32 4 33 3 34 1 36 3 37 5 38 4 39 5 40 7 41 4 42 2 43 2 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 F<19 (50 setosa) 19<F<22 {vers8,12,39,44,49} 22<F yo(s1-M)/|s1-M|-69) Val Ct 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 52 1 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 virginica39 is an outlier. 2 clusters, F<52 (ct=99) and F>52 (50 Setosa) virgini39 Val Ct 0 1 1 2 2 1 4 2 6 1 7 1 8 7 9 2 10 2 11 7 12 2 13 3 14 7 15 4 16 10 17 4 18 6 19 9 20 3 21 6 22 3 23 6 24 3 25 1 27 3 28 2 32 1 39 1 40 1 41 1 42 8 43 13 44 17 45 4 46 5 47 1 F=32 vers49 outlier. 32<F (50 Setosa, vir39) AVG(ver8,12,39,44,49) Val Ct 0 1 1 1 7 5 10 3 12 2 13 2 14 3 15 5 16 2 17 5 18 8 19 4 20 3 21 4 22 3 23 8 24 4 25 4 26 3 27 7 28 7 29 4 30 5 31 4 32 5 33 8 34 2 35 6 36 5 37 3 38 2 39 8 40 6 41 3 43 1 44 2 45 1 47 1 F=0 vir32 outlier F=1 vir18 outlier F=7 vir6,10,19,23,36 subcluster?
21
F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go.
On Clus(F<52) ver1 F(virg7)=0 outlier F(virg32)=25 outlier Val Ct 0 1 4 1 5 5 6 3 7 5 8 3 9 8 10 11 11 14 12 8 13 8 14 5 15 3 16 7 17 5 18 6 19 2 20 1 21 1 22 1 25 1 F=yo(x-M)/|x-M|-MIN on IRIS, subclustering as we go. On Remaining, mx mn mx mx Val Ct 0 3 1 4 2 11 3 14 4 14 5 9 6 10 7 2 8 6 9 2 11 2 For s1 (i.e., yo(s1-M)/|s1-M|-69) Val Ct 0 1 3 1 4 2 7 1 8 1 9 2 10 1 12 4 14 5 15 2 16 4 17 1 18 4 19 5 20 1 21 2 22 2 23 8 24 4 25 3 26 2 27 5 28 3 29 4 30 4 31 3 32 2 33 2 34 4 35 5 36 2 37 2 38 1 39 1 40 1 41 1 43 1 44 1 45 1 outlier 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 F(i39)=52 i39=virgi39 outlier. Clusters, F<52 (ct=99) and F>52 (50 Setosa) On Remaining, max's Val Ct 0 2 e8 outlier 1 2 e11 outlier 7 2 8 1 9 4 10 1 11 2 12 2 13 4 14 3 15 1 16 4 17 2 18 2 19 3 20 4 21 6 22 5 23 5 24 4 25 2 26 2 27 1 28 2 29 4 30 5 31 1 32 3 33 2 34 2 35 3 36 2 37 1 38 1 i8 i10 i36 i6 i23 i19 i18 i6 i8 i10 i19 i23 i35 i i i i i i i6 i10 i18 i19 i23 i35 all declared outliers e4 e38 e19 i20 F e e e outlier i outlier On Remaining, max's Val Ct e44 outlier 6 1 7 2 8 1 9 3 10 1 11 3 12 5 13 2 14 2 15 3 17 3 18 3 19 5 20 1 21 9 22 5 23 4 24 2 26 4 27 2 28 2 29 4 30 2 31 3 32 3 33 2 34 3 35 2 36 1 37 1 38 1 39 1 e36 outlier? On Remaining, mx mx mx mn Val Ct 0 1 1 2 2 3 3 1 5 5 6 4 7 5 8 2 9 3 10 5 11 4 12 7 13 5 14 2 15 4 16 4 17 7 18 4 19 4 20 2 21 2 22 1 24 1 25 1 27 2 29 2 On Remaining, mn mn mx mx Val Ct 0 1 1 3 2 3 3 7 4 7 5 7 6 5 7 5 8 3 9 8 10 4 11 4 12 11 13 4 14 8 15 4 16 1 18 1 On Remaining, mn mx mx mx Val Ct 0 1 2 1 3 4 4 3 5 5 6 4 7 5 8 7 9 8 10 3 11 5 12 2 13 4 14 5 15 7 16 5 17 4 18 1 20 1 On Remaining w e35 Val Ct 0 1 i26 outlier 3 2 On remaining vir1 Val Ct 0 1 1 2 2 1 4 1 5 1 6 2 7 2 8 2 9 4 10 1 11 4 12 3 13 4 14 2 15 6 16 4 17 6 19 4 20 5 21 5 22 2 23 1 24 2 25 5 26 4 27 4 28 1 29 2 30 6 31 2 32 1 33 1 34 1 35 2 36 1 38 1 39 1 e35 e10 e e outlier i44 i3 i i ^^outlier i3 i30 i31 i26 i8 i36 i i outlier i outlier i outlier i outlier i outlier Rem mn mx mn mx Val Ct 0 1 1 1 2 1 3 1 4 1 5 1 6 1 8 1 9 3 10 5 11 5 12 3 13 7 14 6 15 4 16 6 17 7 18 5 19 4 20 2 21 3 22 7 23 4 24 3 25 1 26 1 27 2 e49 outlier On Remaining, mn mx mx mx Val Ct 0 1 1 1 2 1 3 5 4 6 5 5 6 4 7 9 8 4 9 4 10 4 11 3 12 5 13 6 14 6 15 7 16 5 17 4 18 4 20 1 22 1 Could look at distances for 0,1 and 20,22? e13 e30 e32 e outlier e outlier e i44 i45 i49 i5 i37 i1 i i i i i not outlier i outlier
22
outliers gap>L1=32.1 s6 s14 s15 s16 s17 s19 s21 s23 s24 s32 s33 s34 s37 s42 s45 e1 e2 e3 e5 e6 e7 e9 e10 e11 e12 e13 e15 e18 e19 e21 e22 e23 e27 e28 e29 e30 e34 e36 e37 e38 e41 e49 i1 i3 i4 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i25 i26 i28 i30 i31 i32 i34 i35 i36 i37 i39 i41 i42 i45 i46 i47 i49 i50 outliers gap>L1=42.8 s15 s16 s19 s23 s33 s34 s37 s42 s45 e1 e2 e7 e10 e11 e12 e13 e15 e19 e21 e22 e23 e27 e28 e30 e34 e36 e38 e41 e49 i1 i3 i5 i6 i7 i8 i9 i10 i12 i14 i15 i16 i18 i19 i20 i22 i23 i26 i30 i31 i32 i34 i35 i36 i39 outliers gp>L1=53.5 s15 s16 s23 s33 s34 s42 e10 e13 e15 e27 e28 e30 e36 e49 i1 i3 i7 i9 i10 i12 i15 i18 i19 i20 i26 i30 i32 i35 i36 i39 F=L1(x,y) on IRIS, masking to subclusters (go right down the table). Two rounds only If we use L1gap=6, remove those outliers, then use linear gap analysis for larger subcluster revalation, let's see if we can separate Versicolor (e) from virginica (i). outliers gap>L1=64.3 s15 s16 s23 s42 e10 e13 e49 i3 i7 i9 i10 i18 i19 i20 i32 i35 i36 i39 outliers gap>L1=74.95 L1gap s42 9 e13 8 i7 10 i9 12 i10 12 i35 9 i36 9 i39 26
23
Val=0;p=K;c=0;P=Pure1; For i=n to 0 {c=Ct(P&Pi); If (c>=p){Val=Val+2i; P=P&Pi }; else{p=p-c; P=P&P'i }; return Val, P; IDX z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : X1 1 3 : 11 7 X2 1 : 11 8 X3 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 X4 : P3 1 : P2 1 P1 1 : P0 1 : d(xy) 2 1 3 4 8 14 13 12 9 6 11 10 : 7 5 P'3 1 : P'2 1 : P'1 1 : P'0 1 : Need Rank(n-1) applied to each stride instead of the entire pTree. The result from stride=j gives the jth entry of SpS(X,d(x,X-x)) Parallelize over a large cluster? Ct(P&Pi): revise the Count proc to kick out count for each stride (involves loop down pTree by register-lengths? What does P represent after each step?? How does alg go on 2pDoop (w 2 level pTrees) where each stride is separate Note: using d, not d2 (fewer pTrees). Can we estimate d? (using truncated McClarin series) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=10< 14, p=14–10=4; P=P&P' (elim 10 val8) n=2: c=Ct(P&P2)= 1 < 4, p=4-1=3; P=P&P' (elim 1 val4) n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=9< 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 0 < 5, p=5-0=5; P=P&P' (elim 0 val4) n=1: c=Ct(P&P1)=4 < 5, p=5-4=1; P=P&P' (elim 4 val2) n=0: c=Ct(P&P0 )=1>= P=P&P0 (elim 1 val<1 23 * * * * 1 = 1 n=3: c=Ct(P&P3)= 9 < 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 2 < 5, p=5-2=3; P=P&P' (elim 2 val4)2 n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 1 = 3 n=3: c=Ct(P&P3)= 6 < 14, p=14–6=8; P=P&P' (elim 6 val8) n=2: c=Ct(P&P2)= 7 < 8, p=8-7=1; P=P&P' (elim 7 val4)2 n=1: c=Ct(P&P1)=11, p=1-1=0; P=P&P (elim 0 val2) n=0: c=Ct(P&P0 )=1 P=P&P0 (elim 0)
24
Level-1 key map Red=pure stride (so no Level-0)
e f g h i a j b c k d m 0 0 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 a b c d e f g h i j k m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Level-0: key map 13 12 11 10 23 22 21 20 (6-e) = e else pur0 (6-e) = f else pur0 (6-e) = g else pur0 (6-e) = h else pur0 In this 2pDoop KEY-VALUE DB, we list keys. Should we bitmap? Each bitmap is a pTree in the KVDB. Each of these is existing, e.g., e here 5,7-a,f=f else pur0 5,7-a,f=g else pur0 5,7-a,f=h else pur0 234789bcefg els pr0 234789bcefh else pr0 124-79c-f h else pr0 (b-f) = i else pur0 (b-f) = j else pur0 (b-f) = k else pur0 (b-f) = m else pur0 (a) = j else pur0 (a) = k else pur0 (a) = m else pur0 =SpS(XX, -27( p13p33 + p13p32 + p23p43 p23p42 (3-6,8,9) k, els pr0 (3-6,8,9) m els pr0 + p13p31 + 26( p13+p23+p33+p43 +p13p12+ p23p22+ p33p32 + +p43p42 ) -26( p23p41 124679bd m els pr0 25( p13p11+ p23p21 + p33p31 + p43p41 ) -25( p13p30 +p23p40 +p12p31 +p22p41 +p12p32 +p22p42 e f 5 6 g 7 h i a j b c k d m 33 32 31 30 43 42 41 40 24( p12+p22+p32+p42 +p13p10+ +p23p20 +p33p30 +p43p40 -24(p12p30 +p22p40 +p12p11+ +p22p21 +p32p31 +p42p41 ) 23( p12p10+ p22p20 + p32p30 + p42p40 ) -23(p11p31 +p11p30 +p21p41 +p21p40 p11+p21+p31+p41 +p11p10 + +p21p20 + +p31p30 +p41p40 ) -22(p10p30 +p20p40 p10+p20+p30+p40 ) 22(
25
Dot Product Projection (DPP) 1
Dot Product Projection (DPP) 1. Check F=Dotp,d(y) gaps or thin intervals Check actual distances at sparse ends. Here we apply DPP to the IRIS data set: 150 iris samples (rows) and 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). We assume we don't know ahead of time that the first 50 are the Setosa Class, next 50 are Versicolor Class and the final 50 are Virginica Class. We cluster with DPP and then see how close we came to separating the 3 classes (s=setosal, e=versicolor, i=virginica). Analyzing the thin interval [8,9]: e21i4 i8 i9 i17i24i26i27i28i38i50e2 e3 e12e5 e17e19e23e29e35e37i20i34 e i i i i i i i i i i e e e e e e e e e e i i These are the actual distances from each F=7 to each F=10 is >=4. F-gap from F=6 to F=11 >=4. F-gap from F=6 to F=10>= Separate at F=8.5 to CLUS2.1<8.5 (2 ver, 43 vir) and CLUS2.2>8.5 (44 ver, 4 vir) gp>=4 CLUS2 p=aaan q=aaax 0 3 1 3 2 8 3 2 4 6 5 5 6 5 7 11 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 gap>=4 p=nnnn q=xxxx F Count 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Sparse Lower end i32 i18 i19 i23 i6 i36 F i i i i i i i32, i18, i19 gap>=4 outliers Thin interval: ( ) F e4 i7 e10 e31 e32 s14 i39 s16 s19 e49 s15 e44 e11 e8 s6 s34 e i e e e s i s s e s e e e s s So i39,s16,s49,s15 are "thin area" outlier. Separate at 41, giving CLUS1<41 (50 Setosa, 4 Versicolor, e8,e11,e44,e49) and CLUS2>=41. So, two rounds of Dotpd(y) gap analysis yields CLUS1 (50 Setosa, plus 4 Versicolor) CLUS2.1 (43 Virginica, plus 2 Versicolor) CLUS2.2 (44 Veriscolor, plus 4 Virginica) and picks out 3 Virginica, 5 Setosa as outliers (More outliers would result by applying 1.1 to the sparse ends of the 2nd round?). Round1: p=nnnn (n=min) and q=xxxx (x=max) Round2: p=aaan (a=avg) and q=aaax Sparse Ends analysis should accomplish the same outlier detection that a few steps of SL accomplishes. If an outlier is surrounded at a fixed distance then those neighbors will show up as sparse end neighbors and the outlier-ness of the point will be detected by looking at pairwise distances of that sparse end. Sparse upper end s23 s43 s9 s39 s42 s14 F s s s s s s no gap>4 outliers
26
Check Dotp,d(y) for thinnings. Calc AVG of each side of thinning as p,q. redo.
p=nnnn q=xxxx 0 2 3 2 4 1 5 1 7 1 9 2 10 1 11 1 12 1 13 2 14 1 15 3 16 4 17 3 18 2 19 8 20 2 21 3 22 1 23 4 24 5 25 4 26 5 27 5 28 4 29 3 30 2 31 2 32 2 33 4 34 5 36 3 37 2 38 4 40 1 42 1 43 1 44 2 45 5 47 6 48 3 49 4 50 6 51 4 52 3 53 5 54 3 55 5 56 1 57 1 58 2 59 1 60 1 Dot p=AVG>22 q=AVG<22 0 1 1 1 2 2 3 2 4 4 5 5 6 9 7 11 8 6 9 3 10 3 11 3 19 1 23 1 24 1 25 1 26 1 29 1 30 1 31 2 32 2 34 6 35 2 36 4 37 2 38 2 39 3 40 3 41 4 42 4 43 2 44 3 45 6 46 7 47 3 48 2 49 1 50 3 52 7 54 5 55 1 56 3 57 3 58 2 59 2 61 1 62 2 64 1 66 1 67 1 68 2 70 1 Cut=15 CLUS_1<15 (50 Set) CLUS_2>15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.