Presentation is loading. Please wait.

Presentation is loading. Please wait.

[ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].

Similar presentations


Presentation on theme: "[ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))]."— Presentation transcript:

1 [ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].
Do M round gap analysis using SpS((x-M)o(x-M)). f2 x' Do f0 round gap analysis using SpS((x-f0)o(x-f0)). d0≡(M-f0)/|M-f0|. x Do d0 linear gap analysis on SpS((x-f0)od0). f1 x-f0 Find a furthest pt from f0, f1MaxPt[SpS((x-f0)o(x-f0))]). d1≡(f1-f0)/|f1-f0|. Do d1 linear gap analysis on SpS((x-f0)od1). Do f1 round gap analysis on SpS((x-f1)o(x-f1)). d2 ((x-f0)od1)d1 d1 = X' ≡ space perpendicular to d1. Projection of x-f0 onto d1 is ≡ x' (x-f0) ((x-f0)od1)d1 x'ox' [ - (x-f0) ((x-f0)od1)d1 = ] o d1 = (x-f0)o(x-f0) - ((x-f0)od1)2 f0 SpS(x'ox') = SpS[(x-f0)o(x-f0)) - SpS[(x-f0)od1]2 Let f2MaxPt[SpS(x'ox')] d1 d2≡f2'/|f2'|=[(f2-f0)-((f2-f0)od1)d1]/|f2'| d2od1=[(f2-f0)od1-((f2-f0)od1)(d1od1) ]/|f2'|=0 x'' ≡ x' - (x'od2)d2 x''od1 = x'od1- (x'od2)(d2od1) = (x-f0)od1 - ((x-f0)od1)(d1od1) = 0 x''od2 = x'od2- (x'od2)(d2od2) = 0 x(k) ≡ x(k-1) - (x(k-1)odk)dk where fk MaxPt[SpS(x(k-1)ox(k-1))] and dk ≡ fk-1' / |fk-1'| {dk} forms an orthonormal basis. Do fk round gap analysis on SpS[(x-fk)o(x-fk)]. Do dk linear gap analysis on SpS[(x-f0)odk]. Single coordinate gap analysis: e.g., for the coordinate=k values, d=ek≡(0,...,0,1,0....0) ↓ kth coordinate We note linear gap analysis include all of the following: Gap analysis using: A lin combo, a1...an,of coord values (i.e., xo(a1...an) ). A lin combo, a1...an,of coord values starting at p: (i.e., (x-p)o(a1...an) ) A lin combo of squared coord vals (i.e., i=1..nai*(x-p)i2 ) (len2 is sub-case.) A lin combo of cubed vals (i=1..nai*(x-p)i3) Any truncated (at N) Taylor series, e.g., of the form k=1..Nbk*(i=1..nai(x-p)ik) The square gradient-like length: i=1..n(MaxVal(x(k-1)ox (k-1))). x(k-1)-MaxLength itself: MaxVal(x(k-1)ox (k-1)). We can, of course, also do each of these defining dk+1 ≡ (fk+1-f0) / |fk+1-f0| rather than dk+1 ≡ fk-1' / |fk-1'|

2 SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver SL SW PL PW ver ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir t t t t t t t t t t t t t t tall b b b b b b b b b b b b b b ball Before adding the new tuples: MINS MAXS MEAN same after additions.

3 SL SW PL PW set set set . vir vir vir t t t t t t t t t t t t t t tall b b b b b b b b b b b b b b ball Before adding the 30 new tuples (t for tiny. b for big): MINS MAXS MEAN (incidentally, the means remain very nearly the same after additions.)

4 . [ ] Find a furthest point from M, f0MaxPt[SpS((x-M)o(x-M))].
Do f0 round gap analysis {SpS((x-f0)o(x-f0))} to identify/eliminate (repeating if f0 eliminated) on the f0 end. d0≡(M-f0)/|M-f0|. Find a furthest point from f0, f1MaxPt[SpS((x-f0)o(x-f0))]. d1≡(f1-f0)/|f1-f0|. Do f1 round gap analysis { SpS((x-f1)o(x-f1)) } to identify/eliminate (repeating if f1 is eliminated) anomalies on the f1 end. Do d1 lin gap analysis {SpS((x-f0)od1) } to separate sub-clusters d1 ≡ space perpendicular to d1. The projection of x-f0 onto d1 is ≡ x' (x-f0) ((x-f0)od1)d1 x'ox' [ - (x-f0) ((x-f0)od1)d1 = ] o = (x-f0)o(x-f0) - ((x-f0)od1)2 SpS(x'ox') = SpS[(x-f0)o(x-f0)) - SpS[(x-f0)od1]2 For each subcluster, find f2MaxPt[SpSSubCluster(x'ox')] d2≡f2'/|f2'|=[(f2-f0)-((f2-f0)od1)d1]/|f2'| . x(k) ≡ x(k-1) - (x(k-1)odk)dk where fk MaxPtSubCluster[SpS(x(k-1)ox(k-1))] dk≡fk-1'/ fk-1'| Do fk round gap analysis { SpS[(x-fk)o(x-fk)] } to identify/eliminate (repeating if fk is eliminated) on the fk end. Do dk linear gap analysis { SpS[(x-f0)odk] } to separate sub-clusters.

5 f0 RndGp>4 f0=ball f1 RndGp>4 f0=b24 f1=t2 1 0 t2 0 17 ver13 f1=set42 1 0 set42 0 6 set9 f1=set14 none f2 RndGp>4 f0=b24 f1=set42 f2=vir19 1 48 ver49 1 54 vir39 0 60 set19 f3 RndGp>4 f0=b24 f1=set42 f2=vir19 f3=set14 none f4 RndGp>4 f0=b24 f1=set42 f2=vir19 f3=set14 f4=vir19 none. 1 0 ball 0 28 b13 ... 1 108 t234 0 113 t13 1 116 t134 0 122 t123 0 125 tall f0=t234 1 0 t234 1 12 t23 0 25 set42 1 78 vir18 0 83 b124 1 83 b3 0 89 b13 1 91 b34 0 96 b23 f0=b123 1 0 b123 0 30 vir18 1 30 vir32 0 37 vir6 1 90 t34 0 99 t12 0 100 t124 f0=b134 1 0 b134 0 24 vir19 f0=b234 1 0 b234 0 37 vir10 1 88 t3 0 93 t34 f0=b12 1 0 b12 0 30 b1 ... 1 65 t24 0 76 t1 0 77 t14 f0=b14 1 0 b14 0 28 b1 f0=b24 1 0 b24 0 28 b2 1 30 b4 0 36 vir37 1 62 t2 0 68 t24 f0=vir19 none

6 f0 RndGp>4 b123 to t4 subcluster2
f0=b123 b vir b vir b f0 RndGp>4 f0=ball 1 0 ball 0 28 b 1 73 t4 0 78 vir39 ... 1 98 t34 0 103 t12 0 104 t23 0 107 t124 1 108 t234 0 113 t13 1 116 t134 0 122 t123 0 125 tal Distance vir32 b13 vir18 b23 All outliers vir b vir b t t t t234 f0 RndGp>4 b123-t4 subCluster2 f0=b124 b b b b Distance b b12 b14 b24 b b b b We see that all distances are > 4. All are outliers. f0 RndGp>4 b123-t4 f0=b234 b b vir f0 RndGp>4 b123 to t4 subcluster2 f0=b134 b vir f0 RndGp>4 b123-t4 subcluster2 f0=vir19 none Move to f1 (would like d is perpendicular to f0=ball=(90,60,80,40)? (or f0=vir19? Let's take f0=ball. Since f0=ball has a non-zero e1-component, Gram-Schmidt the basis {f0=ball=(90,60,80,40), e2=(0,1,0,0), e3={0,0,1,0}, e4=(0,0,0,1)} for an orthonormal basis, {f1,f2,f3,f4} f0 RnGp>4 b123-t4 sc2 f0=b3 b vir f0 RdGp>4 b123-t4 sc2 f0=vir19 t b

7 d2≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1|
Let f1 ≡ MaxPt(SpS(xX, xox)). Let d1 ≡ f1 / |f1|. If d1's e1-component 0, Gram-Schmidt the basis {d1, e2=(0,1,0,0), e3={0,0,1,0}, e4=(0,0,0,1)} giving an orthonormal basis, {d1,d2,d3,d4}. d1 ≡ f1 / |f1| d2≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| f2 ≡ MaxPt(SpS(xX-{f1}, xod2)) d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| f3 ≡ MaxPt(SpS(xX-{f1,f2}, xod3)) d4 ≡ (e4 - (e4od1)d1 - (e4od2)d2 - (e4od3)d3) / |e4 - (e4od1)d1 - (e4od2)d2 - (e4od3)d3| f4 ≡ MaxPt(SpS(xX-{f1,f2,f4}, xod4)) f1=tall RndGp>4 t12 b124 t23 ball set24 t134 t123 t13 t234 tall DISTANCE set24 34.9 t134 set24 44.9 t123 35.3 t13 32.9 t234 t134 27.7 t123 12.0 t13 45.5 t234 t123 25.0 t13 39.8 t234 t13 47.0 t234 DIS-to-up t12 93.3 b124 t12 51.7 t23 102.8 ball b124 77.8 t23 43.0 ball t23 104.5 ball

8 p x y No gaps (ct=0_intervals) on the furthest-to-Mean line, but 3 ct=1 intevals. Declare p=p12, p16, p18 anomaly if pofM is far enough from the bddry pts of its interval? VOM (34, 35) Mean, M Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals. 2. Find good gaps (dot prod with a constant vector for linear gaps?) For rounded gaps, use xox? Note: in this example, vom works better than mean. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 If data shifted, len doesn't work. xofM is independent of pt placement wrt the origin. Length based gapping is dependent.

9 Thin interval finder on the fM line using the scalar pTreeSet, PTreeSet(xofM) (the pTree slices of these projection lengths) Looking for Width24_Count=1_ThinIntervals or W16_C1_TIs 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f X x1 x2 z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 W=24 C=1 [ , ] =[0,16). z1ofM=11 is 5 units from 16, so z1 not declared an anomaly. W=24 C=1 [ , ] =[32,48). z4ofM=34 is within 2 of 32, so z4 is not declared an anomaly. W=24 C=1 [ , ] =[48, 64). z5ofM=53 is 19 from z4ofM=34 (>24) but 11 from 64. The next interval [64,80) is empty and it's 27 from 80 (>24) so z5 is an anomaly and we make a cut through z5. W=24 C=0 [ , ]=[64, 80). Ordinarily we cut thru the midpoint of C=0 intervals, but in this case it's unnecessary since it would duplicate the z5 cut just made. Here we started with xofM distances. The same process works starting with any distance based ScalarPTreeSet, e.g., xox, etc.

10 Defining gaps: Any scalar pTreeSet where the scalar is a distance, can be used for gap based
FAUST Clustering / Anomaly _Detection or FAUST Classification. Certainly the dot product with any fixed vector works (gaps in the projections along the line generated by the vector). E.g., use vectors fM; fM/|fM|; or in general, a*fM (a constant); (where M is a medoid (mean or vector of medians) and f is a "furthest point" from M). fF; fF/|fF|; or in general, a*fF (a constant); (where F is a "furthest point" from f). ek where ek - ( ) (1 in the kth position) V1=(-b a ) where V = (a b c d e ...) is any one of the vectors above. (gives us a vector orthogonal to V) V2=(a b C ) where C=-(a2 + b2)/c (vector orthogonal to V and to V1); etc. (Vk for all k=1...n forming a orthogonal basis with V But also, if one takes the ScalarPTreeSet of all vector lengths (or squares of lengths to make it easy) that is also a ScalarPTreeSet and the gaps are radial gaps as one proceeds out from the origin. One can note that this is just the column of xox values, so it is dot product generated also. If one takes just the ScalarPTreeSet of all ith coordinate values (V=ei above), that works as well. In this case we get gaps in the value distribution of the ith coordinates. This was used, for instance in coordinate-wise (non-Oblique) FAUST. PTreeSet 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 Take a fixed vector, y0, the ScalarptreeSet (SpS) of all vector lengths (or squares of lengths) of vectors, x-y0, is also a ScalarPTreeSet that works and the gaps are radial gaps as one proceeds out from the point, y0. Note that this is just the column of xoy0 values.

11 APPENDIX: FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal info) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0

12 FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r   r vv r mR   r      v v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d a std of these distances from origin along the d-line

13 1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score= Nothing comes to minds as to what we would do here.  MapReduce.Hadoop is a key-value approach to organizing complex BigData.  In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER  we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2.  pTree Text Mining: Current_Relevancy_Score =10  Killer_Idea_Score=9   I I think Oblique FAUST is the way to do this.  Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9               Killer_Idea_Score=9   No No one has taken up the proof that this is a break through method.  The applications are unlimited! 4.  Secure pTreeBases: Current_Relevancy_Score =9            Killer_Idea_Score=10     This seems straight forward and a certainty (to be a killer advance)!  It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs.  Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9             Killer_Idea_Score= No one done a complete analysis of this is a break through method.  The applications are unlimited here too! 6.  pTree Algorithmic Tools: Current_Relevancy_Score =10                 Killer_Idea_Score= This is Md’s work.  Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful.  It helps us all! 7.  pTree Alternative Algorithm Impl: Current_Relevancy_Score =9               Killer_Idea_Score= This is Bryan’s work.  Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8.  pTree O/S Infrastructure: Current_Relevancy_Score =10                    Killer_Idea_Score= This is Matt’s work.  I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,…  I give it a 10/10 because I know the people – they do double digit work always! From: Sent: Thurs, Aug Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach –   these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time.  I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going.  I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday.


Download ppt "[ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))]."

Similar presentations


Ads by Google