Presentation is loading. Please wait.

Presentation is loading. Please wait.

Method-1: Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].

Similar presentations


Presentation on theme: "Method-1: Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))]."— Presentation transcript:

1 Method-1: Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].
Do M round gap analysis using SpS((x-M)o(x-M)). f2 x' Do f0 round gap analysis using SpS((x-f0)o(x-f0)). d0≡(M-f0)/|M-f0|. f1 x Do d0 linear gap analysis on SpS((x-f0)od0). x-f0 Find a furthest pt from f0, f1MaxPt[SpS((x-f0)o(x-f0))]). d1≡(f1-f0)/|f1-f0|. Do d1 linear gap analysis on SpS((x-f0)od1). Do f1 round gap analysis on SpS((x-f1)o(x-f1)). d2 ((x-f0)od1)d1 d1 = X' ≡ space perpendicular to d1. Projection of x-f0 onto d1 is ≡ x' (x-f0) ((x-f0)od1)d1 x'ox' [ - (x-f0) ((x-f0)od1)d1 = ] o d1 = (x-f0)o(x-f0) - ((x-f0)od1)2 f0 SpS(x'ox') = SpS[(x-f0)o(x-f0)) - SpS[(x-f0)od1]2 Let f2MaxPt[SpS(x'ox')] d1 d2≡f2'/|f2'|=[(f2-f0)-((f2-f0)od1)d1]/|f2'| d2od1=[(f2-f0)od1-((f2-f0)od1)(d1od1) ]/|f2'|=0 x'' ≡ x' - (x'od2)d2 x''od1 = x'od1- (x'od2)(d2od1) = (x-f0)od1 - ((x-f0)od1)(d1od1) = 0 x''od2 = x'od2- (x'od2)(d2od2) = 0 x(k) ≡ x(k-1) - (x(k-1)odk)dk where fk MaxPt[SpS(x(k-1)ox(k-1))] and dk ≡ fk-1' / |fk-1'| {dk} forms an orthonormal basis. Do fk round gap analysis on SpS[(x-fk)o(x-fk)]. Do dk linear gap analysis on SpS[(x-f0)odk]. Linear gap anal. incl: Coordinate gap analysis. xo(a1...an) ) or (x-p)o(a1...an) ) or i=1..nai*(x-p)i2 ) (len2 is sub-case.) truncated Taylor series, k=1..Nbk*(i=1..nai(x-p)ik) Square gradient-like length: i=1..n(MaxVal(x(k-1)ox (k-1))). x(k-1)-MaxLength itself: MaxVal(x(k-1)ox (k-1)). Each of these defining dk+1 ≡ (fk+1-f0) / |fk+1-f0| rather than dk+1 ≡ fk-1' / |fk-1'|

2 SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir t t t t t t t t t t t t t t tall b b b b b b b b b b b b b b ball Before adding the new tuples: MINS MAXS MEAN same after additions. 1 2 3 4 5 6 7 8 9 10 20 30 40 50

3 Summarizing, the methodology is to:
DISTANCES t123 b tal b134 b123 All outliers! M RndGp>4 1 53 b13 0 58 t123 0 59 b234 0 59 tal 0 60 b134 1 61 b123 0 67 ball f0=t123 RnGp>4 1 0 t123 0 25 t13 1 28 t134 0 34 set42 ... 1 103 b23 0 108 b13 SubClust-1 f1=ver49 RdGp>4 none SubClust-1 f1=ver49 LnGp>4 none Summarizing, the methodology is to: 1. Choose a point, f0 (chosen for high outlier potential? e.g., furthest from mean, M?) 2. Do f0-round-gap outlier analysis (+ subcluster anal?) 3. Let f1 be s.t. no x is further away from f0 (in some direction) than f1 (so that all d1 dot products are  0) 4. Do f1-round-gap outlier analysis (+ subclust anal?). 5. Do d1-linear-gap analysis, where d1≡ f0-f1 / |f0-f1|. 6. Let f2 s.t. no x is further away (in some direction) from the d1-line than f2 7. Do f2-round-gap analysis. 8. Do d2-linear-gap-anal, d2 ≡ f0-f2 - (f0-f2)od1 / length ... SubClust-1 f0=b2 RnGp>4 1 0 b2 0 28 ver36 SubClust-2 f0=t3 RnGp>4 none f0=b23 RnGp>4 1 0 b23 0 30 b3 ... 1 84 t34 0 95 t23 0 96 t234 SubClust-1 f0=b3 RnGp>4 1 0 b3 0 23 vir8 ... 1 54 b1 0 62 vir39 SubClust-2 f0=t3 LinGap>4 1 0 t3 0 12 t34 SubClust-2 f0=t34 LinGap>4 1 0 t34 0 13 set36 f0=b124 RnGp>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 vir10 ... 1 75 t24 1 81 t1 1 86 t14 1 93 t12 0 98 t124 b b b24 All are outliers again! SubClust-1 f0=t24 RnGp>4 1 0 t24 1 12 t2 0 20 ver13 SubClust-2 f0=set16 LnGp>4 none SubClust-1 f0=b1 RnGp>4 1 0 b1 0 23 ver1 SubClust-2 f1=set42 RdGp>4 none SubClust-1 f0=ver19 RnGp>4 none f0=b34 RnGp>4 1 0 b34 0 26 vir1 ... 1 66 vir39 0 72 set24 1 83 t3 0 88 t34 SubClust-1 f0=ver19 LinGp>4 none SubClust-2 f1=set42 LnGp>4 none Observe that SubClust-2 consist precisely of the 50 setosa iris samples! Likely f2,f3 and f4 analysis will not find none. SubClust-1 SubClust-2

4 Method-2 f=(mx1,mn2,mn3,mn4) RdGp>4 SubC-2 t3 set37 ... set14 t14 t1 f=(mx1,mn2,mn3,mn4) SubC-1 vir18 b24 b3 b23 f=(mn1,mn2,mn3,mx4) SubC-1 none f=MiVector RdGp>4 tal t123 t124 ... b23 b124 b13 b234 b134 b123 ball g=(mx1,mx2,mx3,mn4) SubC-1 vir32 vir18 vir6 ... ver44 t2 g=(mn1,mx2,mx3,mx4) SubC-1 b34 vir1 g=(mn1,mx2,mx3,mx4) RdGp>4 SubC-2 none LnG>4 SbC1 0 53 ver49 0 54 ver11 0 54 ver44 1 54 ver8 0 59 ver32 ver49 ver11 ver44 ver8 none are separated by 4 from all others LnGp>4 SubClus-1 none LnGp>4 SubClus-2 none f=(mn1,mx2,mn3,mn4) SubC-1 none g=MaxVector RdGp>4 t23 t234 t12 t124 t13 t134 DISTANCES t t12 t t13 t134 g=(mx1,mn2,mx3,mx4) SubC-1 none LnGp>4 SubClus-1 none LnGp>4 SubClus-1 none x4Gp>4 SubCl-1 t4 ver18 f=(mn1,mn2,mx3,mn4) SubC-1 b4 vir39 b1 b14 b12 f=(mn1,mn2,mx3,mn4) RdGp>4 SubC-2 none Finds all 30 added outliers (but they were added as "circum-corners"); 4 virginica, 1 setosa; 1 versicolor outliers. d=(f-g) / len LnGp>4 t23 set14 g=(vmx1,vmx2,vmn3,vmx4) RdGp>4 SubC-2 none f=(MnV1,mx2,mn3,mn4) RdGp>4 none Method-2: f,g opposite corners of circumscribing box: f = MaxVectorX≡(maxXx1..maxXxn) LnGp>4 SubClus-2 none g=(mx1,mn2,mx3,mx4) RdGp>4 32 ver19 ... vir39 b2 set21... 97 t34 g=(vmx1,vmx2,vmn3,vmx4) SubC-1 b4 ver16 ... vir19 t24 SubClus1 g≡MinVectorX≡(minXx1...minXxn) f=(mn1,mn2,mn3,mx4) RdGp>4 SubC-2 none Then sequence thru other opposite corner pairs (d ortho-normal iff cube) SubClus2 g=(mx1,mx2,mx3,mn4) RdGp>4 SubC-2 none Advantages? No calculation for f, g. Calc SpS(x-f|g)o(x-f|g) for round-gap-analysis. Calc SpS(do(x-f|g)) for Linear gap analysis (if it is deemed productive) LnGp>4 SubClus-1 none LnGp>4 SubClus-2 none In fact, Sub-Cluster-2 consist precisely of the remaining 49 Setosa iris samples. LnGp>4 SubClus-1 none LnGp>4 SubClus-2 set19 t34

5 In Method-2 we used, as our projection lines, the diagonals of the circumscribing coordinate rectangle. In method 3.1 we used a circumscribing rectangle in which the corners are actual points from X and the diagonals are diameters. DEFINITIONS: Given an aggregate function, A: 2RR (e.g., max, min, median, mean) and VRn, with Vk≡projk(V)k=1..n  R1, define AVector(V) ≡ ( A(V1), A(V2), ..., A(Vn) ) and call it "the A-Vector or the Vector of As" E.g., MinVector, MaxVector, MedianVector, MeanVector). Each of the first 3 is actually a "RankVector" for the right choice of Rank. E.g., MinVector(V)=RankVector1(V), Max(V)=RankVector|V|(V), MedianVector(V)=VectoRank|V|/2 where, as is customary, if |V| is even, Rank|V|/2≡ (Rank(|V|-1)/2+Rank(|V|+1))/2). Other [non-Rank] Vectors include, SumVector, StdVector and DiameterVector. In the previous Method-2 example, I just picked some diagonals. Ideally we should sequence through the main diagonals first and after that, possibly the sub-main diagonals, then the sub-sub-main diagonals... What are these? 001 101 Let b' be the bit complement of b. The [4] main 3D diagonals run from b1b2b3 to b1'b2'b3' and can be sequenced by: b1=0 and b2b3 = 00, 01, 10, 11. The [8] main 4D diags: b1=0 and b2b3b4=000,001,010,011,100,101,110,111 etc. 011 111 1000 1100 1010 1001 1111 1110 1011 1101 0000 0100 0010 0001 0111 0110 0011 0101 x y z 000 100 010 110 Next, we redo Method-2 using the eight main diagonals in the order given (doing round-gap-analysis first with fk,1=0 and gk,1=1 (and the other bits sequencing identically for both fk and gk through k=000, 001, 010, 011, 100, 101, 110, 111), then linear analysis with dk≡(fk-gk)/|fk-gk|. The advantage [may be] that fk and gk are already know (no SpS has to be built and analyzed to get them) and the dks are [close to] orthogonal. Next we test this revision of Method-2 (called 2.1) against Method-3.1 to see if orienting the rectangle to "fit" X (all corners, fk, gk are from X) is worth the extra work (better accuracy? Clearly it is slower processing!

6 Method-3: cluster X  Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.)) Pick f1MxPt(SpS[(M-x)o(M-x)]). d1≡(M-f1)/|M-f1|. If d1k 0, Gram-Schmidt {d1, e1=(1..0), ..., ek-1, ek+1, ..., en=(0...1)} giving an orthonormal basis {d1, ..., dn}. Assume k=1 d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ... dh ≡ (eh - (ehod1)d1 - (ehod2)d (ehodh-1)dh-1) / | eh - (ehod1)d1 - (ehod2)d (ehodh-1)dh-1 | Theorem: M a fixed pt., MxPt[SpS((M-x)od)]=MxPt[SpS(xod)]. Since (M-x)od =Mod-xod, the projected values{xod | xX}, are just the shift by Mod of the projected values, {(M-x)od | xX}. Therefore the Max values are generated at the same pt(s). So we can use xod instead of (M-x)od always when calculating MxPt (and MnPt). Repick f1MnPt[SpS(xod1)] Pick g1MxPt[SpS(xod1)] Pick f2MnPt[SpS(xod2)] Pick g2MxPt[SpS(xod2)]. ... Pick fhMnPt[SpS(xodh)] Pick ghMxPt[SpS(xodh)]. Do some combination of round gap analysis using the f's and g's and linear gap analysis with the d's (possibly only the round gap analysis?) Notes on implementation speed: For a fixed point, p in Rn (e.g., p=M or p=fkor p=dk). (p-x)o(p-x) = pop + xox -2xop = pop + k=1..nxk2 + k=1..n(-2pk)xk xop = k=1..n(-2pk)xk Since loading is the most expensive step in our logical operations (AND/OR/XOR/...), there should be ways to. e.g., load x and then use it for nearly all the binary operations above (rather than reloading for each one individually.)

7 In this first attempt with Method-3, I will user SpS((M-x)o(M-x)) to find f1, then use do all the Linear gap analyses. f1=ball g1=tall LnGp>4 ball b123 b134 b234 b13 ... t13 t134 t123 tal f2=vir11 g2=set16 Ln>4 none b123 b134 b234 f3=t34 g3=vir18 Ln>4 none f4=t4 g4=b4 Ln>4 vir1 b4 b14 x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxx x xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x  Method-2 f1=MinVector Method-2 g1=MaxVector ↓ f4=t4 g4=vir1 Ln>4 none This ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3. f1 in Method-3 g1 in Method-3 x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x f1=b13 g1=b2 LnGp>4 none f2=t2 g2=b2 LnGp>4 set16 b2 f2=t2 g2=t234 Ln>4 t23 t234 t12 t24 t124 t2 ver11 t23 t234 t12 t24 t124 t2 f2=vir11 g2=b23 Ln>4 b12 b34 b124 b23 t13 b13 b34 b124 b23 t13 b13 f2=vir11 g2=b12 Ln>4 1 45 set16 0 61 b24 0 61 b2 0 61 b12 b24 b2 b12

8 f1=bal RnGp>4 ball b123... t4 vir t34 t12 t23 t124 t234 t13 t134 t123 tal Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before). Meth-3.1 start  f1Mx(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps. Sub Clus1 Sub Clus2 t12 t23 t124 t234 We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150. We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify. SubClus2 f1=t14 Rn>4 t1 t14 ver8 ... set15 t3 t34 SubClus1 f1=b123 Rn>4 b123 b13 vir32 vir18 b23 vir6 b13 vir32 vir18 b23 If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis! In 3.1, I computed SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, a corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis. Then compute SpS(xod1) to get g1 (and for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!) SubClus2 f1=set23 Rn>4 vir39 ver49 ver8 ver44 ver11 t24 t2 SubClus1 f1=b134 Rn>4 b134 vir19 |ver49 ver8 ver44 ver11 Almost outliers! Subcluster2.2 Which type? Must classify. Sub Clus2.2 SC1 f2=ver13 Rn>4 ver13 ver43 SubClus1 f1=b234 Rn>4 b234 b34 vir10 SC1 g2=vir10 Rn>4 1 0 vir10 0 6 vir44 SubClus1 f1=b124 Rn>4 b124 b12 b14 b24 b1... t4 b3 b124 b12 b14 SbCl_2.1 g1=ver39 Rn>4 1 0 vir39 0 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters. SC1 f4=b1 Rn>4 1 0 b1 0 23 ver1 SbCl_2.1 g1=set19 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 f3=set16 Rn>4 none SbCl_2.1 g3=set9 Rn>4 none SC1 f1=vir19 Rn>4 t4 b2 SbCl_2.1 f2=set42 Rn>4 set42 set9 SC1 g4=b4 Rn>4 1 0 b4 0 21 vir15 SbCl_2.1 LnG>4 none SbCl_2.1 f4=set Rn>4 none SbCl_2.1 f2=set9 Rn>4 none SbCl_2.1 g4=set Rn>4 none SC1 g1=b2 Rn>4 t4 ver36 SubC1us1 has 91, only versicolor and virginica. SbCl_2.1 g2=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 LnG>4 none

9 2.1 start  f1=MnVec RnGp>4 none Meth-2.1 on IRIS: f and g are opposite corners of the X-circumscribing box: f=MinVecX≡(minXx1..minXxn) g1=MxVec RnGp>4 0 7 vir18... 1 47 ver30 0 53 ver49.. 0 74 set14 g≡MaxVecX≡(maxXx1..maxXxn), d≡(g-f)/|g-f| Sub Clus1 Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d. Sub Clus2 2.1.a Do SpS((x-f)o(x-f)) round gap analysis 2.1.b Do SpS((x-g)o(x-g)) round gap analysis. SubClus1 Lin>4 none SubCluster2 2.1.c Do SpS((xod)) linear gap analysis. Notes: No calculation is required to find f and g (assuming MaxVecX and MinVecX have been calculated and residualized when pTreeSetX was captured.) 2.1.c (and 2.1.b?) may be unproductive in finding new subclusters/anomalies (either because 2.1.a finds almost all or because 2.1.b and/or 2.1.c find the same ones) and could be skipped (very likely if the dimension is high, since the main diagonal corners are typically far from X, in a high dimensional vector space and thus the radii of a round gap is large and large radii round gaps are nearly linear, suggesting 2.1.a will find all the subclusters that 2.1.b and 2.1.c would find. f2=0001 RnGp>4 none This ends SubClus2 = 47 setosa only g2=1110 RnGp>4 none Lin>4 none f1=0000 RnGp>4 none g1=1111 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none f2=0001 RnGp>4 none g2=1110 RnGp>4 none Lin>4 none g3=1101 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none g3=1101 RnGp>4 none Lin>4 none f4=0011 RnGp>4 none f4=0011 RnGp>4 none g4=1100 RnGp>4 none Lin>4 none g4=1100 RnGp>4 none f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none Lin>4 none f6=0101 RnGp>4 1 19 set26 0 28 ver49 0 31 set42 0 31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41 ver13 f5=0100 RnGp>4 none ver49 set42 ver8 set36 ver44 ver11 ver49 ver8 ver44 ver11 Subc2.1 g5=1011 RnGp>4 none Lin>4 none f6=0101 RnGp>4 none g6=1010 RnGp>4 none g6=1010 RnGp>4 none Lin>4 none Lin>4 none Final Notes: Clearly 2.1.b is very productive in this example! Without it we would not have separated setosa from versicolor+virginica! But 2.1.c was unproductive. This suggest that it is productive to calculate 2.1.a and 2.1.b but having done that 2.1.c will probably not be productive. Next we consider doing only 2.1.c to see if it is as productive as 2.1.a b. f7=0110 RnGp>4 none f7=0110 RnGp>4 1 28 ver13 0 33 vir49 g7=1001 RnGp>4 none Lin>4 none g7=1001 RnGp>4 none Lin>4 none f8=0111 RnGp>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none Lin>4 none This ends SubClus1 = 95 ver and vir samples only

10 2.1.c start f0000=VectoMn Meth-2.1c on IRIS: f and g are opposite corners of the X-circumscribing box: f=MinVecX≡(minXx1..minXxn) g1111=VectoMx LinGp>4 set14... ver11 ver30... vir18 Looks like the same split as before! g≡MaxVecX≡(maxXx1..maxXxn), d≡(g-f)/|g-f| Sub Clus1 Sequence thru main diagonal pairs, {f, g} lexicographically. For each create d. Sub Clus2 2.1.c Do SpS((xod)) linear gap analysis. SubClus1 LinGap>4 Class Means: Mset Mver Mvir f0001 g1110 none SL SW PL PW dMset dMver dMvir FAUSTclass name ver vir7 ver vir27 ver vir22 vir ver1 vir ver27 ver vir20 vir ver34 ver vir14 vir ver3 vir ver28 So FAUST Classifier (midpoint of means version) miss-classifies 5 of the veriscolor and 5 of the virginica of SubCluster-2 (and it miss-classifies vir39 as veriscolor). So it is 93% accurate overall (100% on setosa, 89% on versicolor and 90% on virginica). f0010 g1101 none f0011 g1100 0 12 ver11 0 14 ver44 0 14 ver8 1 17 ver49 0 26 set42... 0 42 set15 SubClus1.1 ver11 ver44 ver8 ver49 Mset Mver Mvir = Mset = Mver = Mvir Class means show SubClus1.1  ver. None are outliers! SubClus1.2 SubClus1.2 LnGp>4 SubClus2 LinGap>4 f0100 g1011 1 27 set21 0 37 vir39 f0001 g1110 none f0010 g1101 none f0011 g1100 f0101 g1010 none f0100 g1011 none f0101 g1010 none f0110 g1001 none f0110 g1001 none So 2.1.c is as good as the combo of 2.1.a and 2.1.b (projection on d appears to be as accurate as the combination of square length of f and of g). This is probably because the round gaps (centered at the corners) are nearly linear by the time they get to the set X itself. To compare the time costs, we note: f0111 g1000 none f0111 g1000 none SubClus2 is the 46 remaining verisolor and the 49 remaining virginica SubClus1.2 is exactly the 50 setosa The combination of 2.1.a and 2.1.b, (p-x)o(p-x)= pop +xox -2xop= pop + k=1..nxk2 + k=1..n(-2pk)xk has n multiplications in the second term, n scalar multiplications and n additions in the third term. For both p=f and p=g, then, it takes 2n multiplications, 2n scalar multiplications and 2n additions. For 2.1.c, xod = k=1..n(dk)xk involves n scalar multiplications and n additions. So it appears to be much cheaper (timewise)

11 Thin interval finder on the fM line using the scalar pTreeSet, PTreeSet(xofM) (the pTree slices of these projection lengths) Looking for Width24_Count=1_ThinIntervals or W16_C1_TIs 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f X x1 x2 z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 W=24 C=1 [ , ] =[0,16). z1ofM=11 is 5 units from 16, so z1 not declared an anomaly. W=24 C=1 [ , ] =[32,48). z4ofM=34 is within 2 of 32, so z4 is not declared an anomaly. W=24 C=1 [ , ] =[48, 64). z5ofM=53 is 19 from z4ofM=34 (>24) but 11 from 64. The next interval [64,80) is empty and it's 27 from 80 (>24) so z5 is an anomaly and we make a cut through z5. W=24 C=0 [ , ]=[64, 80). Ordinarily we cut thru the midpoint of C=0 intervals, but in this case it's unnecessary since it would duplicate the z5 cut just made. Here we started with xofM distances. The same process works starting with any distance based ScalarPTreeSet, e.g., xox, etc.

12 [ ] Find a furthest point from M, f0MaxPt[SpS((x-M)o(x-M))].
F:XR any distance dominated functional (=ScalarPTreeSet(x,F(x)) s.t. |F(x)-F(y)|dis(x,y) for gap-based FAUST machine teaching. E.g., the dot product with any fixed vector, v, (gaps in the projections along the line generated by the vector). E.g., use vectors: fM; fM/|fM|; or in general, a*fM (a constant); (where M is a medoid (mean or vector of medians) and f is a "furthest point" from M). fF; fF/|fF|; or in general, a*fF (a constant); (where F is a "furthest point" from f). ek where ek - ( ) (1 in the kth position) But also, if one takes the ScalarPTreeSet(x,xox) of square vector lengths (or just lengths), the gaps are rounded gaps as one proceeds out from the origin. One can note that this is just the column of xox values, so it is dot product generated also. Find a furthest point from M, f0MaxPt[SpS((x-M)o(x-M))]. Do f0 round gap analysis onSpS((x-f0)o(x-f0)) to identify/eliminate (repeating if f0 is eliminated). d0≡(M-f0)/|M-f0|. Find a furthest point from f0, f1MaxPt[SpS((x-f0)o(x-f0))]. d1≡(f1-f0)/|f1-f0|. Do f1 round gap analysis { SpS((x-f1)o(x-f1)) } to identify/eliminate (repeating if f1 is eliminated) anomalies on the f1 end. Do d1 linear gap analysis (SpS((x-f0)od1)) X'=d1≡space perp to d1. The projection of x-f0 onto d1 is ≡ x' (x-f0) ((x-f0)od1)d1 x'ox' [ - (x-f0) ((x-f0)od1)d1 = ] o = (x-f0)o(x-f0) - ((x-f0)od1)2 SpS(x'ox') = SpS[(x-f0)o(x-f0)) - SpS[(x-f0)od1]2 For each subcluster, find f2MaxPt[SpSSubCluster(x'ox')] and d2≡f2'/|f2'|=[(f2-f0)-((f2-f0)od1)d1]/|f2'| x(k) ≡ x(k-1) - (x(k-1)odk)dk where fk MaxPtSubCluster[SpS(x(k-1)ox(k-1))] dk≡fk-1'/ fk-1'| Do fk round gap analysis { SpS[(x-fk)o(x-fk)] } to identify/eliminate (repeating if fk is eliminated) on the fk end. Do dk linear gap analysis { SpS[(x-f0)odk] } to separate sub-clusters.

13 APPENDIX: FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal info) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0

14 FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r   r vv r mR   r      v v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d a std of these distances from origin along the d-line

15 1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score= Nothing comes to minds as to what we would do here.  MapReduce.Hadoop is a key-value approach to organizing complex BigData.  In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER  we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2.  pTree Text Mining: Current_Relevancy_Score =10  Killer_Idea_Score=9   I I think Oblique FAUST is the way to do this.  Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9               Killer_Idea_Score=9   No No one has taken up the proof that this is a break through method.  The applications are unlimited! 4.  Secure pTreeBases: Current_Relevancy_Score =9            Killer_Idea_Score=10     This seems straight forward and a certainty (to be a killer advance)!  It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs.  Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9             Killer_Idea_Score= No one done a complete analysis of this is a break through method.  The applications are unlimited here too! 6.  pTree Algorithmic Tools: Current_Relevancy_Score =10                 Killer_Idea_Score= This is Md’s work.  Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful.  It helps us all! 7.  pTree Alternative Algorithm Impl: Current_Relevancy_Score =9               Killer_Idea_Score= This is Bryan’s work.  Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8.  pTree O/S Infrastructure: Current_Relevancy_Score =10                    Killer_Idea_Score= This is Matt’s work.  I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,…  I give it a 10/10 because I know the people – they do double digit work always! From: Sent: Thurs, Aug Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach –   these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time.  I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going.  I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday.


Download ppt "Method-1: Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))]."

Similar presentations


Ads by Google