Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research of William Perrizo, C.S. Department, NDSU

Similar presentations


Presentation on theme: "Research of William Perrizo, C.S. Department, NDSU"— Presentation transcript:

1 Research of William Perrizo, C.S. Department, NDSU
I datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows). How do I do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees"). I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time. What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows). Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification. The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining. ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)". To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologist George A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7. Some current pTree Data Mining research projects 1. MapReduce FAUST (FAUST= Functional Analytic Unsupervised and Supervised machine Teaching): MapReduce and Hadoop are key-value approaches to organizing and managing BigData. In FAUST CLASSIFY we start with a Training TABLE and in FAUST CLUSTER  we start with a vector space. 2.  pTree Text Mining:: I am trying to capturiethe reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus. Preliminary work on the term frequency matrix suggests that attribute selection via simple Standard Deviations really helps (select those columns with high StD because of their separation potential.=). 3. FAUST CLUSTER/ANOMALASER: This is a method finding anomalies very quickly. 4.  Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions. 5. FAUST PREDICTOR/CLASSIFIER: This technology is described above. 6.  pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools. 7.  pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases? 8.  pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed. 9. pTree Recommenders: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.

2 FAUST clustering (the unsupervised part of FAUST)
This class of partitioning or clustering methods relies on choosing a functional (mapping of each row in a dim=n table to a real number) which is distance dominated (i.e., the difference between any two functional values, F(x) and F(y) is always  the distance between x and y. The distance dominance of F implies if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.). Functionals we've used effectively: The Coordinate Projection Functionals (ej) Check gaps in ej(y) ≡ yj The Square Distance Functional (SD) Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid). The Dot Product Projection (DPP) Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2 The Square Dot Product Radius (SDPR) SDPRpq(y) ≡ SDp(y) - DPPpq(y)2 (easier pTree processing) DPP-KM 1. Check gaps in DPPp,d(y) (over grids of p and d?) Check distances at any sparse extremes After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA 2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster Check distances at sparse extremes against subcluster density Apply other methods once Dot ceases to be effective. DPP-SD) Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid) Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y) ≡ (y-p)o(y-p) = yoy - 2 yop pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)

3 DPP 60 59 58 62 63 61 57 64 56 25 27 22 29 24 26 37 31 34 30 35 23 32 21 28 SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver 1 2 3 4 5 6 7 8 9 10 20 40 50 ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir 3 4 5 6 7 8 9 50 1 2 10 20 30 40 37 29 28 19 11 15 12 24 16 17 13 18 DPP 27 23 21 26 36 32 33 25 31 SL SW PL PW ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 FAUST DPP Clustering on IRIS, DPP(y)=(y-p)o(q-p)/|q-p|, p=min (n), q=max (x) corners of circumscribing rectangle (midpts or avg (a) is used also). Checking [0,4] distances (s42 Setosa outlier) F s14 s42 s45 s23 s16 s43 s3 s s s s s s s IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises. CL1 F<17 (50 Set) CL3 w outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 17<F<23 CL2 (e8,e11,e44,e49,i39) 23<F CL3 (46 vers,49 vir) Thinning=[6,7 ] CL3.1 <6.5 44 ver 4 vir CL3.2 >6.5 2 ver 39 vir No sparse ends Check distances in [12,28] s16,,i39,e49, e11, {e8,e44, i6,i10,i18,i19,i23,i32 outliers F s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s s s s s i e e e e e e e Checking [57.68] distances i10,i36,i19,i32,i18, {i6,i23} outliers F i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i i i i i i i i i i Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps. Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method).

4 "Gap Hill Climbing": mathematical analysis
One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). This is very heuristic. We are more interested in growing the one particular gap of interest (largest gap or largest thinning). To do this we could do: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap =p q= a b c d e f f e d c b a 9 8 7 6 a j k b c q d e f 2 1 d2 d2-gap p q d2 d2-gap

5 CLUS3 outliers removed p=aaax q=aaan
F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 No Thining. Sparse Lo end: Check [0,8] distances i30 i35 i20 e34 i34 e23 e19 e27 i i i e i e e e i30,i35,i20 outliers because F3 they are 4 from 5,6,7,8 {e34,i34} doubleton outlier set gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 Sparse Lower end: Checking [0,4] distances s14 s42 s45 s23 s16 s43 s3 s s s s s s s s42 is revealed as an outlier because F(s42)= 1 is 4 from 5,6,... and it's 4 from others in [0,4] Thinning=[6,7 ] CLUS3.1 <6.5 44 ver 4 vir LUS3.2 >6.5 2 ver 39 vir No sparse ends CLUS3.1 p=anxa q=axna F Cnt 0 2 3 1 5 2 6 1 8 2 9 4 10 3 11 6 12 6 13 7 14 7 15 4 16 3 19 2 Sparse Upper end: Check [16,19] distances e7 e32 e33 e30 e15 e e e e e e15 outlier. So CLUS3.1 = 42 versicolor Gaps=[15,19] [21,26] Check dis in [12,28] to see if s16, i39,e49,e8,e11,e44 outliers s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s s s s s i e e e e e e e So s16,,i39,e49, e11 are outlier. {e8,e44} doubleton outlier. Separate at 17 and 23, giving CLUS1 F<17 ( CLUS1 =50 Setosa with s16,s42 declared as outliers). 17<F CLUS2 F<23 (e8,e11,e44,e49,i39 all are already declared outliers) 23<F CLUS ( 46 vers, 49 virg with i6,i10,i18,i19,i23,i32 declared as outliers) CLUS3.2 = 39 virg, 2 vers (unable to separate the 2 vers from the 39 virg) Sparse Upper end: Checking [57.68] distances i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i i i i i i i i i i i10,i36,i19,i32,i18 singleton outlies because F 4 from 56 and 4 from each other. {i6,i23} is a doubleton outlier set.

6 DPP (other corners) Check Dotp,d(y) gaps>=4 Check sparse ends.
CLUS1 p=nxnn q=xnxx 0 1 2 1 4 1 6 2 9 1 10 1 11 2 12 2 13 3 14 3 15 2 16 2 17 4 18 3 19 3 20 2 21 5 22 6 23 5 24 2 25 7 26 3 27 2 28 2 29 1 30 3 31 3 32 7 33 4 34 1 35 1 36 2 37 2 39 1 41 1 42 1 43 1 Sparse low end (check [0,9] i23 i6 i36 i8 i31 i3 i26 i i i i i i i i3, i26, i36 >=4 singleton outliers {i23,i6}, {i8,i31} doubleton ols Sparse low end (checking [0,7] i1 i18i19i10i37i5 i6 i23i32i44i45i49i25i8 i15i41i21i33i29i4 i3 i16 i i i i i i i i i i i i i i i i i i i i i i i i i i1, i18, i19, i10, i37, i32 >=4 outliers Dotgp>=4 p=xnnn q=nxxx 0 1 1 1 2 1 3 2 4 7 5 1 6 7 7 5 8 9 9 3 10 7 11 3 12 5 13 4 14 5 15 4 16 8 17 4 18 7 19 3 20 5 21 1 22 4 23 1 24 1 31 2 33 2 34 12 35 8 36 17 37 6 38 2 39 2 Sparse hi end (checking [34,43] e20e31e10e32e15e30e11e44e8 e49 e e e e e e e e e e e30,e49,ei15,e11 >=4 singleton ols {e44,e8} doubleton ols gap:(24,31) CLUS1<27.5 (50 versi, 49 virg) CLUS2>27.5 (50 set, 1 virg) Sparse hi end (checking [38,39] s42 s36 s37 s1 s s s s s37, s1 outliers Thinning (8,13) Split in middle=10.5 CLUS_1.1<10.5 (21 virg, 2 ver) CLUS_1.2>10.5 (12 virg, 42 ver) CLUS1 Dotgp>=4 p=nnnn q=xxxx 0 1 1 2 2 2 3 1 4 2 5 1 6 6 7 2 8 3 9 1 10 2 11 2 12 2 13 6 14 6 15 7 16 2 17 2 18 3 19 3 20 2 21 2 22 3 23 4 24 2 25 1 26 2 27 3 28 1 29 1 Clus1 p=nnxn q=xxnx 0 2 1 1 2 5 3 8 4 9 5 6 6 9 7 14 8 11 9 7 10 4 11 2 13 2 Thinning (7,9) Split in middle=7.5 CLUS_1.2.1 < 7.5 (10 virg, 4 ver) CLUS_1.2.2 > 7.5 ( 1 virg, 38 ver) i15 gap>=4 outlier at F=0 Sparse hi end (checking [10,13] e34i2 i14i43e41i20i7 i35 e i i i e i i i i7, i35 >=4 singleton outliers CLUS1 Dotgp>=4 p=nnnx q=xxxn 0 1 4 1 5 3 6 5 7 4 8 3 9 6 10 7 11 3 12 4 13 8 14 4 15 4 16 3 17 8 18 5 19 3 20 1 21 1 22 3 23 1 CLUS1.2 Dotgp>=4 p=aaan q=aaax 0 1 4 4 5 3 6 3 7 4 8 1 9 5 10 7 11 3 12 5 13 3 14 6 15 1 16 4 17 1 18 1 19 2 hi end gap outlier i30 CLUS1.2.1 Dotgp>=4 p=anaa q=axaa 0 1 1 1 2 1 4 2 6 3 7 4 9 2 CLUS1.2.1 Dotgp>=4 p=aana q=aaxa 0 5 1 2 2 3 3 2 4 1 6 1 C i24e7 i34i47i27i28e34e36e21i50i2 i43i14i22 i e i i i i e e e i i i i i CLUS1.2.1 p=naaa q=xaaa 0 4 1 1 2 1 3 2 4 2 5 2 6 1 7 1

7 The next slide attempts analyze "gap climbing" mathematically.
HILL CLIMBING GAP WIDTH Check Dotp,d(y) for thinnings. Use AVG of each side of the thinning for p,q. redo. Dot F p=aaan q=aaax 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 p=avg<12 q=avg>12 0 2 2 1 3 2 5 1 6 1 8 1 9 1 10 3 11 2 12 1 13 4 14 1 15 3 16 5 17 2 18 2 19 3 21 4 22 1 23 6 24 5 25 5 26 4 27 4 28 2 29 3 30 3 31 3 33 4 34 4 35 2 36 3 37 3 38 1 39 1 40 2 44 1 45 1 46 2 47 1 Inconclusive! There isn't a more prominent gap than before. p=aaan+.005*avg<12 q=aaax+.005*avg>12 0 3 1 3 2 8 3 3 4 6 5 6 6 5 7 12 8 2 9 4 10 12 11 8 12 13 13 5 14 3 15 7 Here we tweek d just a little toward the means and get a more prominent gap?? Cut=8 CLUS_1.1<8 (45 Virg, 1 Vers) 8<CLUS_1.2<17 (5 Virg, 49 Vers) Cut=9 CLUS_1.1<9 (46 Virg, 2 Vers) CLUS_1.2>9 (4 Virg, 48 Vers) Cut=17 CLUS_1<17 CLUS_2>17 (50 Set) These are attempts at "hill-climbing" the gaps to make them more prominent (To see if they are wider than they show up to be via the choice of F - in the case that the projection line cuts the gap at a severe angle and therefore reports a much narrower gap than actually exists. The next slide attempts analyze "gap climbing" mathematically.

8 "Gap Hill Climbing": mathematical analysis
One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher STDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). This is very general. We are more interested in growing the one particular gap of interest (largest gap or largest thinning). To do this we can do as follows: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap =p q= a b c d e f f e d c b a 9 8 7 6 a j k b c q d e f 2 1 d2 d2-gap p q d2 d2-gap

9 Barrel Clustering: (This method attempts to build barrel-shaped gaps around clusters)
Furthest Point or Mean Point q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, SBR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od y barrel cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p barrel radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

10 4 functionals in the dot product group of gap clusterers on a VectorSpace subset, Y (yY):
1. SLp(y) = (y-p)o(y-p), p a fixed vector. Square Length functional primarily for outlier identification and densities. 2. Dotd(y) = yod, (d is a unit vector) the Dot-product functional. Using d=q-p/|q-p| and y-p Dotp,q(y) = (y-p)o(q-p)/|q-p| yod projection d y y - (yod)d = projection. Squaring its length: (y-yodd)o(y-yodd)=yoy-(yod)2 yod projection (neg) d y y - (yod)d so again yoy - (yod)2 = squared proj (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) = yoy -2yop + pop - 2 3. SPDd(y) = yoy - (yod)2 (d a unit vector) is the Square Projection Distance functional E.g., if d ≡ (q-p)/|q-p|, d = unit vector from vector p to vector q, then SPD(y)= But to avoid creating an entirely new VectorPTreeSet(Y-p) for the space (with origin shifted to p), we think it useful to alter the expression to : SPDpq(y) q-p |q-p| - yo po where we might: 1st compute the constant vector nd the ScalarPTreeSet q-p |q-p| yo po q-p |q-p| yo - 3rd the constant th the SPTreeSet pop yo q-p |q-p| po - 5th the SPTreeSet th the constant yoy, yop = yoy -2yop + pop - 2 7th the SPTreeSets 8th the SPTreeSet q-p |M-p| - yo po Is it better to leave all the additions and subtractions for one mega-step at the end? Other efficiency thoughts? q-p |q-p| (y-p)o = - yo po We note that Dot(y)=yod shares many construction steps with SPD. 4. CAd(y) = yod/|y|, (d unit vector) the Cone Angle functional. Using d=q-p/|q-p| and y=x-p CAp,q(y) = (y-p)od/|y-p| SCAp,q(y) = (y-p)od2/|y-p|2 = (y-p)od2/(y-p)o(y-p), Squared Cone Angle functional

11 CLUS1.2 is pure Versicolor (45 of the 50).
SPD p q e14 V Ct 2 10 3 12 4 12 5 12 6 8 7 11 8 9 9 5 10 9 11 4 12 4 13 2 14 1 17 2 18 3 19 10 20 5 21 6 22 5 23 6 24 6 25 3 27 2 29 2 30 1 SPD on CLUS1 p e11 q =MN V Ct 2 3 3 4 4 5 5 7 6 2 7 2 8 6 9 6 10 3 11 4 12 2 13 4 14 4 15 3 16 2 17 1 18 5 19 1 20 2 22 2 23 1 24 1 25 1 26 1 29 1 SPD p q e14 V Ct 1 6 2 4 3 8 4 4 5 10 6 2 7 2 8 2 9 7 10 2 11 2 12 2 13 1 15 2 17 1 18 4 19 2 20 4 22 1 24 1 25 1 26 1 29 1 31 2 32 2 33 3 i15 i36 i32 SPD p q V Ct 2 8 3 10 4 10 5 10 6 5 7 10 8 6 9 8 10 6 11 1 mask: V<8.5 CTs SMs CTe SMe CTi SMi CLUS1 mask: V<12.5 5 SMe 24 SMi CLUS1.1 thin gap mask: 8.5<V<15.5 CTs SMs CTe SMe CTi SMi CLUS2 masking V>6: Total_e Masked_e Total_i Masked_i However I cheated a bit. I used p=MinVect(e) and q=MaxVect(e) which makes it somewhat supervised. START OVER WITH THE FULL > mask: V>12.5 45 SMe 0 SMi CLUS1.2 mask: V>15.5: CTs SMs CTe SMe CTi SMi This tube contains 49 setosa + 2 virginica CLUS3 CLUS1.2 is pure Versicolor (45 of the 50). CLUS3 is almost pure Setosa (49 of the 50, plus 2 virginica) CLUS2 is almost purely [1/2 of] viriginica (24 of 50, plus 1 setosa). CLUS1.1 is the other 24 virginicas, plus the other 5 versicolors. So this method clusters IRIS quite well (albeit into 4 clusters, not three). Note that caps were not put on these tubes. Also, this was NOT unsupervised clustering! I took advantage of my knowledge of the classes to carefully chose the unit vector points, p and q E.g., p = MinVector(Versicolor) and q = MaxVector(Versicolor. True, if one sequenced thru a fine enough d-grid of all unit vectors [directions], one would happen upon a unit vector closely aligned to d=q-p/|q-p| but that would be a whole lot more work that I did here (would take much longer). In worst case though, for totally unsupervised clustering. there would be no other way than to sequence through a grid of unit vectors. However, a good heuristic might be to try all unit vectors "corner-to-corner" and "middle-of-face-TO-middle-of-opposite-face" first, etc. Another thought would be to try to introduce some sort of hill climbing to "work our way" toward a good combination of a radial gap plus two good linear cap gaps for that radial gap.

12 SPD on CLUS1 p C1US1axxx q C1US1aaaa V Ct 1 3 2 5 3 9 4 13 5 18 6 12 7 4 8 1 9 2 no thinnings SPD on CLUS1 p C1US1xaxx q C1US1aaaa V Ct 1 4 2 13 3 7 4 19 5 9 6 7 7 9 8 2 SPD on CLUS1 p C1US1xxax q C1US1aaaa V Ct 1 1 2 4 3 3 4 9 5 9 6 14 7 9 8 4 9 6 10 3 11 3 12 1 14 2 15 1 no thinnings SPD on CLUS1 p C1US1xxxa q C1US1aaaa V Ct 1 1 2 3 3 10 4 15 5 16 6 12 7 7 8 3 9 1 10 1 no thinnings SPD p axxx q aaaa V Ct 2 1 3 5 4 6 5 6 6 8 7 6 8 8 9 15 10 7 11 8 12 13 13 8 14 14 15 9 16 13 17 6 18 4 19 4 20 3 21 4 23 1 25 1 mask: V<3.5 14 SM versi 10 SM virgi CL1.1? mask: V<11.5 0 SM setosa 46 SM versicolor 24 SM virginica CLUS1 mask: V>3.5 0 SM setosa 32 SM versi 14 SM virgi CLUS1.2? mask: V>11.5 50 SM setosa 4 SM versicolor 26 SM virginica CLUS2 SPD on CLUS2 p C1US2axxx q C1US2aaaa V Ct 6 2 7 2 8 6 9 13 10 7 11 7 12 4 13 5 14 11 15 9 16 2 18 4 21 2 22 1 23 3 25 1 26 1 SPD on CLUS1 p C1US1axax q C1US1aaaa V Ct 1 1 2 3 3 4 4 2 5 12 6 13 7 9 8 7 9 2 10 7 11 4 13 2 14 1 17 2 18 1 SPD on CLUS1 p C11aaxx q C11aaaa V Ct 1 1 2 7 3 10 4 13 5 13 6 13 7 6 8 2 9 2 11 1 no thinnings SPD on CLUS1 p C1US1axxa q C1US1aaaa V Ct 1 1 2 2 3 6 4 9 5 12 6 17 7 8 8 6 9 5 10 1 11 1 no thinnings mask: V<13.5 44 SM setosa 0 SM versicolor 02 SM virginica CLUS2.1 mask: V<9.5 37 SM vers 16 SM virg CL1.1? mask: 100>V>13.5 6 SM setosa 4 SM versicolor 24 SM virginica CLUS2.2 mask: V>9.5 9 SM vers 8 SM virg CL1.2? SPD on CLUS1 C11xaax C11aaaa V Ct 1 2 2 3 3 4 4 8 5 8 6 14 7 8 8 4 9 5 10 6 11 1 12 3 14 1 15 2 no thins C11axaa C11aaaa V Ct 1 2 2 2 3 2 4 10 5 3 6 13 7 8 8 7 9 4 10 3 11 6 12 2 13 2 14 2 17 2 18 1 19 1 SPD on CLUS1 C11xxaa C11aaaa V Ct 1 1 2 4 3 6 4 9 5 10 6 7 7 9 8 5 9 3 10 4 11 2 12 4 13 1 14 3 17 2 SPD on C1 C11aaax C11aaaa V Ct 1 3 2 1 3 3 4 4 5 12 6 15 7 4 8 5 9 4 10 7 11 4 12 2 13 1 14 1 15 1 17 1 18 1 19 1 SPD on CLUS1 C11xaxa C11aaaa V Ct 1 2 2 3 3 12 4 12 5 10 6 15 7 7 8 4 9 1 10 2 11 1 no thins C11aaxa C11aaaa V Ct 1 2 2 3 3 6 4 12 5 11 6 9 7 11 8 5 9 5 10 1 11 3 13 2 C11xaaa C11aaaa V Ct 1 2 2 4 3 5 4 9 5 10 6 9 7 5 8 6 9 2 10 6 11 3 12 1 13 2 14 2 15 2 17 2 mask: V<5.5 16 ver 3 virCL1.1? mask: V<5.5 26 ver 4 vir CL1.1? mask: V>5.5 30 ver 21 virCL1.1? mask: V>5.5 20 ver 20 vir CL1.1?

13 95 remaining versicolor and virginica=SubClus1.
i p max V Ct 0 2 1 2 2 2 3 5 4 3 5 3 6 4 7 4 8 7 9 2 10 3 11 1 12 4 13 5 14 4 15 7 16 2 17 5 18 3 19 1 20 1 21 4 23 2 24 2 25 4 26 1 27 2 28 1 29 2 30 1 32 1 {e4, e40} form a doubleton outlier set i7 and e10 are singleton outliers x=s (58=avg(y1) ) V Ct 0 3 s15, s17, s34 1 12 s 6,11,16,19,20,22,28,32,3337,47,49 2 12 s 1,10 13,18,21,27,29,40,41,44,45,50 3 7 s 2,12,23,24,35,36,38 4 10 s 2,3,7,13,25,26,30,31,46,48 5 2 s4, s43 6 2 s9,s39 7 1 s14 8 1 i39 9 1 s32 ^^all 50 setosa + i39 e49 16 2 17 2 19 1 20 2 21 5 22 4 23 3 24 4 25 1 27 8 28 2 29 2 30 4 31 1 32 4 34 2 35 2 36 2 37 3 38 2 39 2 40 4 41 1 43 2 44 4 45 2 46 1 47 2 48 1 50 4 52 2 53 2 54 2 56 2 57 1 i1 i31 vv 9 virginica i10 i8 i36 i32 i16 i18 i23 i19 But here I mistakenly used the mean rather than the max corner. So I will redo - but note the high level of cluster and outlier revelation????? i p max V Ct 0 2 2 6 3 3 4 4 5 4 6 2 7 6 8 9 9 2 10 2 11 2 12 5 13 7 14 2 15 6 16 2 17 5 19 3 20 2 22 3 23 2 24 3 25 2 26 1 27 1 28 1 29 3 30 1 31 2 e32 e11 e8,44 e49 i39 60 1 61 1 62 1 63 1 64 1 65 1 66 1 67 3 68 4 69 4 70 3 71 3 72 4 73 2 74 5 75 1 76 2 77 1 78 3 79 1 s3 s9 s39,43 s42 s23 s14 2 actual gap-ouliers, checking distances reveals 4 e-outlier (versicolor), 5 s-outliers (setosa). i p max V Ct 0 2 1 1 2 3 3 3 4 4 5 2 6 6 7 3 8 5 9 4 10 4 11 2 12 3 13 4 14 6 15 4 16 1 17 7 18 2 19 3 20 2 22 2 23 1 24 2 25 4 26 4 27 1 28 2 29 2 30 1 32 2 33 1 34 1 35 1 No new outliers reviealed 95 remaining versicolor and virginica=SubClus1. Continue outlier id rounds on SC1 (maxSL, maxSW, max PW) then do "capped tube" (further subclusters.) 1. (y-p)o(y-p) remove edge outliers ( thr>2*50) 2. lthin gaps in SPD: d, from an edge point to MN. 3 For each thin PL, do len gap anal of pts in " tube". e13 i7 e40 e4 e10 F e i e e e e32 e11 e8 e44 e49 e e e e e 45 remaining setosa. This is SubCluster 2 (may have additional outliers or sub-subclusters but we will not analyse further (would be done in practice tho SPD(y) =(y-p)o(y-p)-(y-p)od2 d: mn-mx V Ct Next slide i p max V Ct 0 2 1 10 2 11 3 6 4 15 5 4 6 8 7 9 8 4 9 5 10 2 11 7 13 4 14 2 15 2 16 1 17 1 18 1 19 1 e30, e15 outliers e20,e31,e32 form SC12 Declared tripleton outlier set? (But they are not singleton outliers.) s3 s9 s39 s43 s42 s23 s s s s s s e13 e20 e15 e31 e32 e30 F e e e e e e

14 Cone Clustering: (finding cone-shaped clusters)
x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some  angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e

15 P(mrmv)/|mrmv|oX<a
FAUST Oblique Classifier: formula: P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ).     r   r r v v        r   mr   r      v v v       r    r       v mv v      r    v v     r            v                     P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv a PX dot d>a = PdiXi>a E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? AND 2 pTrees masks P(mbmr)oX>(mr+m)|/2od P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. g b    grb  grb grb        grb    grb  grb    grb grb                    grb  In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them.     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    b v v     r            b    b v                     b  mb  b                   b   b                              b    b b   For classes r and b                      bgr      bgr                  bgr       bgr                           bgr  bgr bgr   bgr bgr bgr r D

16 FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r   r vv r mR   r      v v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d a std of these distances from origin along the d-line

17 L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z 12/8/12 x y x\y a b 3 3 4 9 3 6 f 14 2 8 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f

18 L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis. After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.

19 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 gap: 10-6 gap: 5-2 cluster PTree Masks (by ORing) z11 1 z12 1 z13 1

20 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 6-9 z71 1 z72 1

21 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 3-7 z71 1 z72 1 zd1 1 zd2 1

22 Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 z71 1 z72 1 zd1 1 zd2 1 AND each red with each blue with each green, to get the subcluster masks (12 ANDs)

23 SpS[ XX, d2(x=(x1,x2), y=(x3,x4) ] = SpS[ XX, (x1-x3)2+(x2-x4)2 ] =
Computation for SpS(XX,d2(x,y)) shown below is chock full of useful information but is also massive and time 11/25/12 consuming to compute (even with 2 level pTrees). Can we find a simpler functional involving some subset of the 40 terms below that is square distance dominated? SpS[ XX, d2(x=(x1,x2), y=(x3,x4) ] = SpS[ XX, (x1-x3)2+(x2-x4)2 ] = SpS(XX, x1x1 + x2x2 + x3x3 + x4x4 - 2x1x3 -2x2x4) 26( p13+p13p12 + p23+p23p22 + p33+p33p32 + p43+p43p42 -2 p13p33-2p13p32 -2 p23p43-2p23p42 ) + 25( p13p11 + p33p31 + p43p41 -2 p13p31 -2 p23p41 ) + 24( p12+p13p10+p12p11 + -2p12p32-2p13p30-2p12p31 p22+p23p20+p22p21 + p32+p33p30+p32p31 + p42+p43p40+p42p41 -2p22p42-2p23p40-2p22p41) + 23( p12p10 + p22p20 + p32p30 + p42p40 -2 p12p30 -2 p22p40 ) + 22( p11+p11p10 + -2p11p31-2p11p30 -2 p21p41-2p21p40 ) + p21+p21p20 + p31+p31p30 + p41+p41p40 p10 + p20 + p30 + p40 -2p10p30 -2p20p40 ) p23p21 + =SpS(XX, =SpS(XX, - p13p31 - p23p41 p13p33+p13p32 +p23p43+p23p42 )) + 26( p13+p23+p33+p43 +p13p12 + +p23p22 + +p33p32 + +p43p42 -2( 24( p12+p13p10+p12p11 + p22+p23p20+p22p21 + p32+p33p30+p32p31 + p42+p43p40+p42p41 - p12p30 - p22p40 +2( p13p11 + p23p21 + p33p31 + p43p41 - p12p32 - p13p30 - p12p31 - - p22p42 - p23p40 - p22p41 )) + 22( p11+p11p10 + p21+p21p20 + p31+p31p30 + p41+p41p40 - p10p30 -p20p40 ) + p10 + p20 + p30 + p40 +2( p12p10 + p22p20 + p32p30 + p42p40 - p11p31- p11p30 - p21p41- p21p40 )) + p13 p12 p11 p10 p23 p22 p21 p20 p33 p32 p31 p30 p43 p42 p41 p40 * * * * * * * 1 * p13 p12 p11 p10 p23 p22 p21 p20 p33 p32 p31 p30 p43 p42 p41 p40 )+ =SpS(XX, 26( p13+p23+p33+p43 +p13p12+p23p22 + +p33p32 + +p43p42 ) 25( p13p11+ p23p21 + p33p31 + p43p41 ) 24( p12+p22+p32+p42 +p23p20 +p33p30 +p42p41 ) 23( p12p10+ p22p20 + p32p30 + p42p40 ) 22( p11+p21+p31+p41 +p21p20 + +p31p30 +p41p40 ) -27( p13p33 + p13p32 + p13p31 + p23p41 -25( p13p30 +p12p31 +p22p41 +p23p40 +p22p42 -24(p12p30 +p22p40 -23(p11p31 +p11p30 +p21p41 +p21p40 p10+p20+p30+p40 ) -22(p10p30+p20p40 -26( +p12p32 +p13p10+ +p12p11+ +p22p21 +p32p31 +p43p40 +p11p10 + p23p43 p23p42 + piipii=pii (no processing) Only 44 the pairwise products need computing.

24 The computation for SpS(XX,d2(x,y)) is massive and time consuming (even with 2 level pTrees).
Can we use a simpler distance dominated functional? Here we try a Manhattan_distance based functional: L1(x,y)≡i=1..n|xi-yi| Claim L1(x,y)  L2(x,y)≡(i=1..n(xi-yi)2)½ / √2 Proof: on x2+y2=1, f(x,y) = x+y , so f(x)= x + (1-x2)½ f'(x) = 1 + ½(1-x2)-½ (-2x) = 0 1 = x / (1-x2)-½ 1 = x2 / 1-x2 1-x2 = x2 1= 2x2 1/2= x2 1/√2= x pTree Operations: +, -, *, input_operands=pTrees output=pTree PTreeSet Operations (column ops): +, -, *, max, min, rankK, ... input_operands=PTreeSets output=PTreeSet Table Functionals (pTrees and PTreeSets are tables too!): col_max, col_min, col_rankK, max of lin combo of cols, ... input_operand=table output=column of reals or a PTreeSet Table Functional Contour: MaxPts, MinPts, RanKPts, ... input_operand=table, functional, set of reals. output=mask pTree of the set of points that map into that set of reals under that functional on that table.

25 F1=pS( XX,(x1-x3)2+(x2-x4)2= +26( p13+p23+p33+p43 +p13p12+p23p22+
F1=Square Dis Functional is a hard pTree computation (Md?). F2 F2=simpler distance dominated functional Manhattan-distance-based (but the masks may be as difficult to construct as the terms of F1?) Consider the functionals G1=(x1-x3)+(x2-x4) G2=(x1-x3)+(x4-x2) G3=(x3-x1)+(x2-x4) G4=(x3-x1)+(x4-x2) | Gi | / √2 is a distance dominated functional on XX, e.g., on z1: since L1((x1,x2),(x3,x4))= |x1-x3| + |x2-x4|  (x1-x3)+(x2-x4) etc.  L1  √2*L2. IDX z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf X1 1 3 : 11 7 X2 1 : 11 8 X3 X4 p13 : 1 p12 : 1 p11 1 : p10 1 : p23 : 1 p22 : p21 : 1 p20 1 : p33 p32 p31 p30 p43 p42 p41 p40 F1 1 3 2 6 9 15 14 13 10 11 7 1 2 3 4 9 10 11 8 1 1 1 1 1 1 1 1 4 2 8 17 68 196 170 200 153 145 181 164 85 5 40 144 122 148 109 113 136 65 : 162 128 117 116 90 80 53 1 25 61 41 29 89 52 10 20 13 1 3 4 7 10 11 12 13 14 9 2 6 8 : 5 ID1 ID2 G1 G2 G3 G4 MXabs z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z z1 z dXY 2 1 3 4 8 14 13 12 9 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 : 1 : 1 : 1 : 1 : 1 : 1 : 1 : 1 : z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 6 9 15 14 13 10 11 7 1 2 3 4 9 10 11 8 1 1 1 1 1 1 1 1 z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf 1 3 2 6 9 15 14 13 10 11 7 1 2 3 4 9 10 11 8 1 1 1 1 1 1 1 1 F1=pS( XX,(x1-x3)2+(x2-x4)2= +26( p13+p23+p33+p43 +p13p12+p23p22+ +p33p32 + +p43p42 +25( p13p11+ p23p21+ p33p31+ p43p41 24( p12+p22+p32+p42 +p23p20 +p33p30 -p42p41 ) +23( p12p10+ p22p20 + p32p30 + p42p40 )+ +22( p11+p21+p31+p41 +p21p20 + +p31p30 +p41p40 -27( p13p33+p13p32+ -p13p31- p23p41) -p13p30 -p12p31 -p22p41 -p23p40 -p22p42)+ -p12p30 -p22p40 -p11p31-p11p30 -p21p41-p21p40 + p10+p20+p30+p40 -p10p30-p20p40) -p12p32 +p13p10+ -p12p11 -p22p21 -p32p31 +p43p40 +p11p10 + p23p43+p23p42) F2=SpS(XX,(Rnd((1/√2)((x1>x3&x2>x4)*(x1-x3+x2-x4)+ (x1>x3&x2x4)*(x1-x3+x4-x2)+ (x1x3&x2>x4)*(x3-x1+x2-x4)+ (x1x3&x2x4)*(x3-x1+x4-x2))))

26 ANDing Multi-Level pTrees
1. A≡AND(lev1s)= resultLev1; 2. If (Ak=0 &  operand s.t. Lev0k is pure0) resultLev0k = pure0; ElseIf (Ak =1) resultLev0k = pure1; Else resultLev0k = AND(lev0s); Levels are objects w methods: AND,OR,Comp,Add,Mult, Neg.. Map Reduce terminology (ptrs="maps", methods="reducers"?) 1 1 1 1 1 1 1 1 1 1 1 A= E.g., P13P12 B= P33P32 B1-f: all identical C= P13P33 D(L0) P33P43 E(L1) P13P23 A1-6: pure0, resultLev01-6 is pure0 2pDoop: 2-Level Hadoop (key-value) pTreebase pX PXX M(1=mixed else 0)XX All level-0 pTrees in the range P33..P40 are identical (= p13..p20 respectively). Here, all are mixed. All level-0 pTrees in the range P13..P20 are pure. A7-a=1, resultLev07-a is pure1 Ab-f: pure0, resultLev0b-fis pure0 Level-1: P13 P12 P11 P10 P23 P22 P21 P20 P33 P32 P31 P30 P43 P42 P41 P40 M1* M2* M3* M4* And that purity is given by p12..p20 resp. 1 D D1-f C C6-e C1-5,f B B1-f A A1-6 A7-a Ab-f E E1-a,f Eb-e pure1 p13 p12 p11 p10 p23 p22 p21 p20 pure0 All pairwise ptrees put in 2pDoop upon data capture? 1 1 1 1 1 1 1 1 1 1 What I'm after here is SpSX(d(x,{yX|yx}) and I would like to compute this SpS without looping on X. All 2-level pTrees for SpS[XX,(x1-x3)2+(x2-x4)2] put in 2pDoop. embarrassingly parallelizable P131-f P121-f P111-f P101-f P231-f P221-f P211-f P201-f P331-f P321-f P311-f P301-f P431-f P421-f P411-f P401-f Level-0 ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡

27 Level-1 key map Red=pure stride (so no Level-0)
e f g h i a j b c k d m 0 0 13 12 11 10 23 22 21 20 33 32 31 30 43 42 41 40 a b c d e f g h i j k m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Level-0: key map 13 12 11 10 23 22 21 20 (6-e) = e else pur0 (6-e) = f else pur0 (6-e) = g else pur0 (6-e) = h else pur0 In this 2pDoop KEY-VALUE DB, we list keys. Should we bitmap? Each bitmap is a pTree in the KVDB. Each of these is existing, e.g., e here 5,7-a,f=f else pur0 5,7-a,f=g else pur0 5,7-a,f=h else pur0 234789bcefg els pr0 234789bcefh else pr0 124-79c-f h else pr0 (b-f) = i else pur0 (b-f) = j else pur0 (b-f) = k else pur0 (b-f) = m else pur0 (a) = j else pur0 (a) = k else pur0 (a) = m else pur0 =SpS(XX, -27( p13p33 + p13p32 + p23p43 p23p42 (3-6,8,9) k, els pr0 (3-6,8,9) m els pr0 + p13p31 + 26( p13+p23+p33+p43 +p13p12+ p23p22+ p33p32 + +p43p42 ) -26( p23p41 124679bd m els pr0 25( p13p11+ p23p21 + p33p31 + p43p41 ) -25( p13p30 +p23p40 +p12p31 +p22p41 +p12p32 +p22p42 e f 5 6 g 7 h i a j b c k d m 33 32 31 30 43 42 41 40 24( p12+p22+p32+p42 +p13p10+ +p23p20 +p33p30 +p43p40 -24(p12p30 +p22p40 +p12p11+ +p22p21 +p32p31 +p42p41 ) 23( p12p10+ p22p20 + p32p30 + p42p40 ) -23(p11p31 +p11p30 +p21p41 +p21p40 p11+p21+p31+p41 +p11p10 + +p21p20 + +p31p30 +p41p40 ) -22(p10p30 +p20p40 p10+p20+p30+p40 ) 22(

28 pTree Rank(K) /17/12 (Rank(n-1) applied to SpS(XX,d2(x,y)) gives 2nd smallest distance from each x (useful in outlier analysis?) RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P; /* Below K=n-1=7-1=6 (looking for the 6th highest = 2nd lowest value) */ /* Notice that each new P has value. We should retain every one of them. How to catalog in 2pDoop?? */ Cross out the 0-positions of P each step. (n=3) c=Count(P&P4,3)= < 6 p=6–3=3; P=P&P’4,3 masks off highest (val 8) {0} X P4, P4, P4, P4,0 10 5 6 7 11 9 3 1 1 1 1 (n=2) c=Count(P&P4,2)= >= 3 P=P&P4,2 masks off lowest (val 4) {1} (n=1) c=Count(P&P4,1)= < 3 p=3-2=1; P=P&P'4,1 masks off highest (val8-2=6 ) {0} {1} (n=0) c=Count(P&P4,0 )= >= 1 P=P&P4,0 23 * * * * = RankKval= P=MapRankKPts= ListRankKPts={2} 1 {0} {1} {0} {1}

29 P = P’4,3 masks off highest 3 (Val 8) p = 6 – 3 = 3 {0}
Suppose MinVal is duplicated (occurs at two points). What does the algorithm return? RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; ret RankKval, P; P4, P4, P4, P4,0 1. P = P4,3 Ct (P) = < 6 P = P’4,3 masks off highest (Val 8) p = 6 – 3 = 3 10 5 6 3 11 9 1 1 1 1 {0} 2. Ct(P&P4,2) = < 3 P = P&P'4,2 p=3-2=1 masks off highest 2 (val 4) {0} 3. Ct(P&P4,1 )= >= 1 P=P&P4,1 {1} 4. Ct (P&P4,0 )= >= 1 P=P&P4,0 {1} 23 * * * * = {0} {0} {1} {1} 3=MinVal=rank(n-1)Val Pmask MinPts=rank(n-1)Pts{#4,#7}

30 P = P’4,3 (masks off the highest 3 val 8) p = 6 – 3 = 3 {0}
Suppose MinVal is triplicated (occurs at three points). What does the algorithm return? RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P; P4, P4, P4, P4,0 1. P = P4,3 Ct (P) = < 6 P = P’4,3 (masks off the highest 3 val 8) p = 6 – 3 = 3 10 3 6 11 9 1 1 1 1 {0} 2. Ct(P&P4,2) = < 3 P = P&P'4,2 p=3-1=2 (masks off highest 1 val 4) {0} 3. Ct(P&P4,1 )= >= 2 P=P&P4,1 {1} 4. Ct (P&P4,0 )= >= 2 P=P&P4,0 {1} 23 * * * * = {0} {0} {1} {1} 3=MinVal. Pc mask MinPts #4,#5,#7

31 Val=0;p=K;c=0;P=Pure1; For i=n to 0 {c=Ct(P&Pi); If (c>=p){Val=Val+2i; P=P&Pi }; else{p=p-c; P=P&P'i }; return Val, P; IDX z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : X1 1 3 : 11 7 X2 1 : 11 8 X3 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 X4 : P3 1 : P2 1 P1 1 : P0 1 : d(xy) 2 1 3 4 8 14 13 12 9 6 11 10 : 7 5 P'3 1 : P'2 1 : P'1 1 : P'0 1 : Need Rank(n-1) applied to each stride instead of the entire pTree. The result from stride=j gives the jth entry of SpS(X,d(x,X-x)) Parallelize over a large cluster? Ct(P&Pi): revise the Count proc to kick out count for each stride (involves loop down pTree by register-lengths? What does P represent after each step?? How does alg go on 2pDoop (w 2 level pTrees) where each stride is separate Note: using d, not d2 (fewer pTrees). Can we estimate d? (using truncated McClarin series) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=10< 14, p=14–10=4; P=P&P' (elim 10 val8) n=2: c=Ct(P&P2)= 1 < 4, p=4-1=3; P=P&P' (elim 1 val4) n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 = 1 n=3: c=Ct(P&P3)=9< 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 0 < 5, p=5-0=5; P=P&P' (elim 0 val4) n=1: c=Ct(P&P1)=4 < 5, p=5-4=1; P=P&P' (elim 4 val2) n=0: c=Ct(P&P0 )=1>= P=P&P0 (elim 1 val<1 23 * * * * 1 = 1 n=3: c=Ct(P&P3)= 9 < 14, p=14–9=5; P=P&P' (elim 9 val8) n=2: c=Ct(P&P2)= 2 < 5, p=5-2=3; P=P&P' (elim 2 val4)2 n=1: c=Ct(P&P1)=2 < 3, p=3-2=1; P=P&P' (elim 2 val2) n=0: c=Ct(P&P0 )=2>= P=P&P0 (elim 1 val<1) 23 * * * * 1 1 = 3 n=3: c=Ct(P&P3)= 6 < 14, p=14–6=8; P=P&P' (elim 6 val8) n=2: c=Ct(P&P2)= 7 < 8, p=8-7=1; P=P&P' (elim 7 val4)2 n=1: c=Ct(P&P1)=11, p=1-1=0; P=P&P (elim 0 val2) n=0: c=Ct(P&P0 )=1 P=P&P0 (elim 0)

32 Sparse Gap Revealer Width  24 Count  2 11/10/112
on a SpS of unknown origin ;-) (can't reconstruct it). Width  24 Count  /10/112 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f X x1 x2 z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 xod 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 [ , ] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [ , ]= [0,15]=[0,16) has 1 point in it, z1. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier (yet). Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually  24 (the calculation of the min a pTree process - no x looping required!) [ , ] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [ , ] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [ , ]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [ , ]= [64, 80). This is clearly a 24 gap. But we have already declared point to the left an outlier and made a subcluster cut! [ , ]= [80, 96). z6od=80, zfod=83 [ , ]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z z7 z z7 z z7 z z7 z z7 z z7 z z8 z z8 z z8 z z8 z z8 z z8 z X1 X2 dX1X2 z9 z z9 z z9 z z9 z z9 z z10 z z10 z z10 z z10 z X1 X2 dX1X2 z11 z z11 z z11 z z12 z z12 z z13 z

33 Product pTrees on X XY(where Y=X) p13 p12 p11 p10 p23 p22 p21 p20 IDX
id x1 x2 z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 X X pTrees p13 1 p12 1 p11 1 p10 1 p23 1 p22 1 p21 1 p20 1 IDX z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : X1 1 3 : 11 7 X2 1 : 11 8 Y1 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 Y2 : P13 : 1 P12 : 1 P11 1 : P10 1 : P23 : 1 P22 : P21 : 1 P20 1 : P33 P32 1 : P31 1 : P30 1 : P43 1 : P42 1 : P41 1 : P40 1 : 1 : Constructing the pTrees for XY: Level-0s can be constructed by simple bit replication (for X1 and X2) and by simple pTree concatenation (for Y1, Y2) Level-1 stride=n: All X1 and X2 pTrees are pure determined by corresponding bit for X in the position given by the idX. All Y1 and Y2 are 0 except for any pure X pTrees (none in this example). All Y1 and Y2 level-0 strides are identical. L1P13=p13 L1P23=p23 No L0P1k or L0P2k All L1P1k and L1P2k are pure) k=3..0. L1P12=p12 L1P22=p22 L1P11=p11 L1P21=p21 L1P10=p10 L1P20=p20 Three alternatives: 1. L1P3k=0 L1P3k=0 (Assume all L0s mixed.). All L0 strides are: L0P33s=p13 L0P43s=p23 L0P32s=p12 L0P42s=p22 L0P31s=p11 L0P41s=p21 L0P30s=p10 L0P40s=p20 2. Just two bit maps: Bitmap the pure1 L0s Bitmap the pure0 L0s 3. No L1P3k No L1P4k (always go directly to L0s.)

34 Product pTrees on X (XY PTrees, Y=X)
id x1 x2 z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 X X pTrees We want to use SpS(dist2(x,y)) to aggregate the Pairwise Square Distance Matrix, PSDM(X), i.e. get the Off-Diagonal Max, Min, Avg and Std: p13 1 p12 1 p11 1 p10 1 p23 1 p22 1 p21 1 p20 1 PSDM(X) x1 x2 x xn ODMax ODMin ODAvg ODStd x d(x1,x2) d(x1,x3) d(x1,xn) ODMax ODMin ODAvg ODStd1 x2 d(x2,x1) d(x2,x3) d(x2,xn) ODMax ODMin ODAvg ODStd2 x3 d(x3,x1) d(x3,x2) d(x3,xn) ODMax ODMin ODAvg ODStd3 : L1P13=p13 xn d(xn,x1) d(xn,x2) d(xn,x3) ODMaxn ODMinn ODAvgn ODStdn No L0P1k No L0P2k (All L1s are pure.) L1P12=p12 L1P11=p11 L1P10=p10 L1P23=p23 L1P22=p22 L1P21=p21 L1P20=p20 L0 strides L1P3k = 0 L1P4k = 0 (Assume all L1 strides are mixed.) L0P33s, use p13 L0P32s, use p12 L0P31s, use p11 L0P30s, use p10 L0P43s, use p23 Notes: Off-Diagonal Max = Max, ODAvg = Sum/(n-1) and for ODStd also divide by n-1. To compute ODMin? Mask off diagonal 0s (create that Diagonal Mask 1 time). L0P42s, use p22 L0P41s, use p21 L0P40s, use p20 Construct the square distance SpS, SpS( XY, (x-y)o(x-y) ) using Md's procedure on: (x-y)o(x-y) = i=1..n (xi-yi)2 Possible definitions: (which should be easy to calculate) x is an outlier with respect to X iff ODMin(X,x) > T*AVG{ODMin(X,y)|yx} X is dense iff STD{ODMin(X,x)} < T' = i=1,2 (xi2 -2xiyi + yi2) = xi*xi - 2xi*yi + yi*yi = x1*x x2*x *x1*y1 - 2*x2*y2 + y1*y1 + y2*y2

35 Product pTrees on X XY(where Y=X) IDX IDY X1 X2 Y1 Y2 P13 P12 P11 P10
z1 z2 : ze zf IDY z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf : X1 1 3 : 11 7 X2 1 : 11 8 Y1 1 3 2 6 9 15 14 13 10 11 7 : 1 2 3 4 9 10 11 8 Y2 : P13 : 1 P12 : 1 P11 1 : P10 1 : P23 : 1 P22 : P21 : 1 P20 1 : P33 P32 1 : P31 1 : P30 1 : P43 1 : P42 1 : P41 1 : P40 1 : 4 2 8 17 68 196 170 200 153 145 181 164 85 5 40 144 122 148 109 113 136 65 : 162 128 117 116 90 80 53 1 25 61 41 29 89 52 10 20 13 SpS(x-y)o(x-y) 1 : 1 : ODMax(X,z1)=200, ODMin(X,z1)=2 ODMax(X,z2)=164, ODMin(X,z2)=2 ODMax(X,z14)=200, ODMin(X,z14)=1 ODMax(X,z15)=113, ODMin(X,z15)=10

36 (26(p13+p13p12)+24(2p13p11+p22p10+p13p10+p12p11)+23(p12p10)+22p11+p10
L1P13 L1P12 L1P11 L1P10 L1P23 L1P22 L1P21 L1P20 L0p13 L0p12 L0p11 L0p10 L0p23 L0p22 L0p21 L0p20 L1(x1*x1)= (23p13+22p12+21p11+p10) (23p13+22p12+21p11+p10) = 1 1 1 1 1 1 1 1 (26(p13+p13p12)+24(2p13p11+p22p10+p13p10+p12p11)+23(p12p10)+22p11+p10 L0(x1*x1)= empty L0(y2*y2)= (23p13+22p12+21p11+p10) (23p13+22p12+21p11+p10) = (26(p13+p13p12)+24(2p13p11+p22p10+p13p10+p12p11)+23(p12p10)+22p11+p10 L0P33 L0P32 L0P31 L0P30 L0P43 L0P42 L0P41 L0P40 L1(y2*y2)= empty L1ODMask empty L0ODMask=m: m1 1 m2 1 . mn 1 ODMax(X,zk) = Max(L0(k)[(x-y)o(x-y)] ODMin(X,zk) = Min(L0(k)[(x-y)o(x-y)*ODMask'], etc. SpS( XY, (x-y)o(x-y) ): (x-y)o(x-y) = i=1..n (xi-yi)2 = x1*x x2*x *x1*y1 - 2*x2*y2 + y1*y1 + y2*y2

37 FAUST Clustering Methods:
MCR (Using Midlines of circumscribing Coordinate Rectangle) f3 g3 x y z f2 g2 f1 g1 (nv1,nv2,Xv3) (nv1,Xv2,Xv3) (nv1,Xv2,nv3) MinVect=nv=(nv1,nv2,nv3) (Xv1,Xv2,Xv3)=Xv =MaxVect (Xv1,nv2,Xv3) (Xv1,Xv2,nv3) (Xv1,nv2,nv3) For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually): Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g| So we can do any subset (d), (df), (dg), (dfg), (f), (fg), fgd), ... Define a sequence fk,gkdk fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) f, g, d, SpS(xod) require no processing (gap-finding is the only cost). MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)). MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free). On what's left: (look for outliers in subclus1, subclus2 Sequence thru{f, g} pairs: SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap. f1 g1 0001 0011 0010 nv= 0000 0111 0101 0110 0100 0½½½= =1½½½ 1001 1011 1010 1000 1111 =Xv 1101 1110 1100 f2 = ½0½½ g2 =½1½½ f3 g3 =½½1½ =½½0½ f4 g4 =½½½1 =½½½0 = 0½½½ = 1½½½ d3 set23... set45 ver49... vir19 d1 none SubClus1 SubClus2 d2 none Sub Clus1 f1 none f1 none Sub Clus2 g1 none g1 none f2 none f2 1 41 vir23 0 47 vir18 0 47 vir32 SubClus1 g2 none d4 set44 vir39 Leaves exactly the 50 setosa. f3 none g2 none g3 none f4 none f3 none g4 none g3 none SubClus2 f4 none d4 none Leaves 50 ver and 49 vir g4 none

38 MCR(d) on Iris150+Outlier30, gap>4:
Do SpS(xodk) linear gap analysis, k=1,2,3,4. Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size  10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4). d3 set23... set25 ver49... vir19 Same split (expected) t124 t14 tal t134 d1 t124 t14 tal t134 t13 t12 t1 t123 set14 ... vir32 b12 b1 b13 b123 b124 b134 b14 ball Sub Clus1 Sub Clus1 t13 t12 t1 t123 SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa as SubCluster1. SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1 25 vir45 0 40 b4 0 40 b24 Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2. b12 b1 b13 b123 b124 b134 b14 ball MCR(d) performs well on this dataset. Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.) Speed: dk = ek so there is zero calculation cost for the d's. SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it. The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) is optimal with respect to speed! t2 t23 t24 t234 d2 t2 t23 t24 t234 ver1 ... set16 b24 b2 b234 b23 b24 b2 b234 b23

39 CCR(fgd) (Corners of Circumscribing Coordinate Rectangle) f1=minVecX≡(minXx1..minXxn) (0000)
g1=MaxVecX≡(MaxXx1..MaxXxn) (1111), d=(g-f)/|g-f| start  f1=MnVec RnGp>4 none Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d. g1=MxVec RnGp>4 0 7 vir18... 1 47 ver30 0 53 ver49.. 0 74 set14 CCR(f) Do SpS((x-f)o(x-f)) round gap analysis Sub Clus1 CCR(g) Do SpS((x-g)o(x-g)) round gap analysis. Sub Clus2 CCR(d) Do SpS((xod)) linear gap analysis. Notes: No calculation required to find f and g (assuming MaxVecX and minVecX have been calculated and residualized when PTreeSetX was captured.) If the dimension is high, since the main diagonal corners are liekly far from X and thus the large radii make the round gaps nearly linear. SubClus1 Lin>4 none SubCluster2 f2=0001 RnGp>4 none g2=1110 RnGp>4 none This ends SubClus2 = 47 setosa only Lin>4 none f1=0000 RnGp>4 none g1=1111 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none f2=0001 RnGp>4 none g2=1110 RnGp>4 none Lin>4 none g3=1101 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none g3=1101 RnGp>4 none Lin>4 none f4=0011 RnGp>4 none f4=0011 RnGp>4 none g4=1100 RnGp>4 none Lin>4 none g4=1100 RnGp>4 none f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none Lin>4 none f6=0101 RnGp>4 1 19 set26 0 28 ver49 0 31 set42 0 31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41 ver13 f5=0100 RnGp>4 none ver49 set42 ver8 set36 ver44 ver11 ver49 ver8 ver44 ver11 Subc2.1 g5=1011 RnGp>4 none Lin>4 none f6=0101 RnGp>4 none g6=1010 RnGp>4 none g6=1010 RnGp>4 none Lin>4 none Lin>4 none f7=0110 RnGp>4 none f7=0110 RnGp>4 1 28 ver13 0 33 vir49 g7=1001 RnGp>4 none Lin>4 none g7=1001 RnGp>4 none Lin>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none This ends SubClus1 = 95 ver and vir samples only

40 SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir t t t t t t t t t t t t t t tall b b b b b b b b b b b b b b ball Before adding the new tuples: MINS MAXS MEAN same after additions. 1 2 3 4 5 6 7 8 9 10 20 30 40 50

41 FM(fgd) (Furthest-from-the-Mediod)
FMO (FM using a Gram-Schmidt Orthonormal basis) X  Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.)) f=M Gp>4 1 53 b13 0 58 t123 0 59 b234 0 59 tal 0 60 b134 1 61 b123 0 67 ball DISTANCES t123 b tal b134 b123 All outliers! f0=t123 RnGp>4 1 0 t123 0 25 t13 1 28 t134 0 34 set42... 1 103 b23 0 108 b13 f1MxPt(SpS[(M-x)o(M-x)]). d1≡(M-f1)/|M-f1|. SubClust-1 f0=b2 RnGp>4 1 0 b2 0 28 ver36 SubClust-2 f0=t3 RnGp>4 none If d110, Gram-Schmidt {d1 e1...ek-1 ek+1..en} d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ... SubClust-1 f0=b3 RnGp>4 1 0 b3 0 23 vir8 ... 1 54 b1 0 62 vir39 SubClust-2 f0=t3 LinGap>4 1 0 t3 0 12 t34 f0=b23 RnGp>4 1 0 b23 0 30 b3... 1 84 t34 0 95 t23 0 96 t234 dh≡(eh-(ehod1)d1-(ehod2)d2-..-(ehodh-1)dh-1) / |eh-(ehod1)d1-(ehod2)d2-...-(ehodh-1)dh-1| Thm: MxPt[SpS((M-x)od)]=MxPt[SpS(xod)] (shift by Mod, MxPts are same Repick f1MnPt[SpS(xod1)]. Pick g1MxPt[SpS(xod1)] SubClust-2 f0=t34 LinGap>4 1 0 t34 0 13 set36 Pick fhMnPt[SpS(xodh)] Pick ghMxPt[SpS(xodh)]. f0=b124 RnGp>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86 t14 1 93 t12 0 98 t124 SubClust-1 f0=t24 RnGp>4 1 0 t24 1 12 t2 0 20 ver13 b b b24 All outliers again! SubClust-2 f0=set16 LnGp>4 none SubClust-1 f0=b1 RnGp>4 1 0 b1 0 23 ver1 SubClust-1 f1=ver49 RdGp>4 none SubClust-2 f1=set42 RdGp>4 none SubClust-1 f1=ver49 LnGp>4 none SubClust-1 f0=ver19 RnGp>4 none 1. Choose f0 (high outlier potential? e.g., furthest from mean, M?) 2. Do f0-rnd-gap analysis (+ subcluster anal?) 3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0) 4. Do f1-rnd-gap analysis (+ subclust anal?). 5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|. 6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis. 8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len... SubClust-2 f1=set42 LnGp>4 none SubClust-2 is 50 setosa! Likely f2, f3 and f4 analysis will not find none. f0=b34 RnGp>4 1 0 b34 0 26 vir1 ... 1 66 vir39 0 72 set24 1 83 t3 0 88 t34 SubClust-1 SubClust-1 f0=ver19 LinGp>4 none SubClust-2

42 FMO(d) f4=t4 g4=vir1 Ln>4 none
f1=ball g1=tall LnGp>4 ball b123 b134 b234 b13 ... t13 t134 t123 tal f2=vir11 g2=set16 Ln>4 none b123 b134 b234 f3=t34 g3=vir18 Ln>4 none f4=t4 g4=b4 Ln>4 vir1 b4 b14 f4=t4 g4=vir1 Ln>4 none This ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3. f1=b13 g1=b2 LnGp>4 none f2=t2 g2=b2 LnGp>4 set16 b2 f2=t2 g2=t234 Ln>4 t23 t234 t12 t24 t124 t2 ver11 t23 t234 t12 t24 t124 t2 CRC method g1=MaxVector ↓ x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxx x xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x f for FMG-GM g for FMG-GM x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x f2=vir11 g2=b23 Ln>4 b12 b34 b124 b23 t13 b13 b34 b124 b23 t13 b13 MCR f  MCR g f2=vir11 g2=b12 Ln>4 1 45 set16 0 61 b24 0 61 b2 0 61 b12 b24 b2 b12  CRC f1=MinVector

43 f1=bal RnGp>4 ball b123... t4 vir t34 t12 t23 t124 t234 t13 t134 t123 tal Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before). FMO(fg) start  f1MxPt(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps. Sub Clus1 Sub Clus2 t12 t23 t124 t234 We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150. We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify. SubClus2 f1=t14 Rn>4 t1 t14 ver8 ... set15 t3 t34 SubClus1 f1=b123 Rn>4 b123 b13 vir32 vir18 b23 vir6 b13 vir32 vir18 b23 If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis! FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis. Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!) SubClus2 f1=set23 Rn>4 vir39 ver49 ver8 ver44 ver11 t24 t2 SubClus1 f1=b134 Rn>4 b134 vir19 |ver49 ver8 ver44 ver11 Almost outliers! Subcluster2.2 Which type? Must classify. Sub Clus2.2 SC1 f2=ver13 Rn>4 ver13 ver43 SubClus1 f1=b234 Rn>4 b234 b34 vir10 SC1 g2=vir10 Rn>4 1 0 vir10 0 6 vir44 SubClus1 f1=b124 Rn>4 b124 b12 b14 b24 b1... t4 b3 b124 b12 b14 SbCl_2.1 g1=ver39 Rn>4 1 0 vir39 0 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters. SC1 f4=b1 Rn>4 1 0 b1 0 23 ver1 SbCl_2.1 g1=set19 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 f3=set16 Rn>4 none SbCl_2.1 g3=set9 Rn>4 none SC1 f1=vir19 Rn>4 t4 b2 SbCl_2.1 f2=set42 Rn>4 set42 set9 SC1 g4=b4 Rn>4 1 0 b4 0 21 vir15 SbCl_2.1 LnG>4 none SbCl_2.1 f4=set Rn>4 none SbCl_2.1 f2=set9 Rn>4 none SbCl_2.1 g4=set Rn>4 none SC1 g1=b2 Rn>4 t4 ver36 SubC1us1 has 91, only versicolor and virginica. SbCl_2.1 g2=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 LnG>4 none

44 Mark 10/15 (“thin” gap using tfxidf)
Mark 10/15 (“thin” gap using tfxidf). Classification left, reuters text right.  Seems right on! Mining and assays grouped, anomalies are gold strikes (vs. production), livestock.  Min gap needs to be from the MSB not LSB - ie, how many bits to consider for gaps. Reason: as you add attributes, the distances start getting large, so needs to be relative. I seem to get better results with oblique rather than round, but jury still out…. 2 JAPAN'S DOWA MINING TO PRODUCE GOLD FROM APRIL TOKYO, 3/16 - Dowa Mining Co Ltd> said it will start commercial production of gold, copper, lead and zinc from its Nurukawa Mine in northern Japan in April. A company spokesman said the mine's monthly output is expected to consist of 1,300 tonnes of gold ore and 3,700 of black ore, which consists of copper, lead and zinc ores. A company survey shows the gold ore contains up to 13.3 grams of gold per tonne, he said. Proven gold ore reserves amount to 50,000 tonnes while estimated reserves of gold and black ores total one mln tonnes, he added. GERMAN BANK SEES HIGHER GOLD PRICE FOR HAMBURG, March 16 - Gold is expected to continue its rise this year due to renewed inflationary pressures, especially in the U.S., Hamburg-based Vereins- und Westbank AG said. It said in a statement the stabilisation of crude oil prices and the Organisation of Petroleum Exporting Countries' efforts to achieve further firming of the price led to growing inflationary pressures in the U.S., The world's biggest crude oil producer. Money supplies in the U.S., Japan and West Germany exceed the central banks' limits and real growth of their gross national products, it said. Use of physical gold should rise this year due to increased industrial demand and higher expected coin production, the banksaid. Speculative demand, which influences the gold price on futures markets, has also risen. These factors and South Africa's unstable political situation, which may lead to a temporary reduction in gold supplies from that country, underline the firmer sentiment, it said. However, Australia's output is estimated to rise to 90 tonnes this year from 73.5 tonnes in 1986. SOME 7,000 MINERS GO ON STRIKE IN SOUTH AFRICA, 3/16 - Some 7,000 black miners went on strike at South African gold and coal mines, the National Union of Mineworkers (NUM) said. A NUM spokesman said 6,000 workers began an underground sit-in at the Grootvlei gold mine, owned by General Union Mining Corp, to protest the transfer of colleagues to different jobs. He said about 1,000 employees of Anglo American Corp's New Vaal Colliery also downed tools but the reason for the stoppage was not immediately clear. Officials of the two companies were not available for comment and the NUM said it was trying to start negotiations with management. LEVON RESOURCES <LVNVF> GOLD ASSAYS IMPROVED VANCOUVER, British Columbia, March 16 - Levon Resources Ltd said re-checked gold assays from the Howard tunnel on its Congress, British Columbia property yielded higher gold grades than those reported in January and February. It said assays from zone one averaged ounces of gold a ton over a 40 foot section with an average width of 6.26 feet. Levon previously reported the zone averaged ounces of gold a ton over a 40 foot section with average width of 5.16 feet. Levon said re-checked assays from zone two averaged ounces of gold a ton over a 123 foot section with average width of 4.66 feet. Levon Resources said the revised zone two assays compared to previously reported averages of ounces of gold a ton over a 103 foot section with average width of feet. Company also said it intersected another vein 90 feet west of zone two, which assayed ounces of gold a ton across a width of 3.87 feet. BP <BP> UNIT SEES MINE PROCEEDING NEW YORK, March 16 - British Petroleum Co PLC said based on a feasibility report from <Ridgeway Mining Co, its joint venture Ridgeway Project in South Carolina could start commercial gold production by mid The company said the mine would produce at an approximate rate of 158,000 ounces of gold per year over the first four full years of operation from 1989 through 1992 and at an average of 133,000 ounces a year over the full projected 11-year life of the mine. BP's partner in the venture is Galactic Resources of Toronto. The company said subject to receipt of all statutory permits, finalization of financing arrangements and management and joint venture review, construction of a 15,000 short ton per day processing facility can start.  Capital costs to bring the mine into production are estimated at 76 mln dlrs. BP UNIT SEES U.S. GOLD MINE PROCEEDING NEW YORK, March 16 - British Petroleum Co PLC said based on a feasibility report from Ridgeway Mining Co, its joint venture Ridgeway Project in South Carolina could start commercial gold production by mid The company said the mine would produce approximately 158,000 ounces of gold per year over the first four full years of operation from 1989 through 1992 and at an average 133,000 ounces a year over the full projected 11 year life of the mine. BP's partner is Galactic Resources Ltd of Toronto. BP said subject to receipt of all statutory permits, finalization of financing arrangements and management and joint venture review, construction of a 15,000 short ton per day processing facility can start. Capital costs to bring the mine into production are estimated at 76 mln dlrs LEVON RESOURCES REPORTS IMPROVED GOLD ASSAYS VANCOUVER, British Columbia, March 16 - Levon Resources Ltd said re-checked gold assays from the Howard tunnel on its Congress, British Columbia property yielded higher gold grades than those reported in January and February.   It said assays from zone one averaged ounces of gold a ton. Levon previously reported the zone averaged ounces of gold a ton. Levon said re-checked assays from zone two averaged ounces of gold a ton. Levon Resources said the revised zone two assays compared to previously reported averages of ounces of gold a ton. The company also said it intersected another vein 90 feet west of zone two, which assayed ounces of gold a ton. VICEROY RESOURCE CORP> DETAILS GOLD ASSAYS Vancouver, British Columbia, March 17 - Viceroy Resource Corp said recent drilling on the Lesley Ann deposit extended the high-grade mineralization over a width of 600 feet.  Assays ranged from 0.35 ounces of gold per ton over a 150-foot interval at a depth of 350 to 500 feet to 1.1 ounces of gold per ton over a 65-foot interval at a depth of 200 to 410 feet. STARREX LINKS SHARE PRICE TO ASSAY SPECULATION TORONTO, March 16 - Starrex Mining Corp Ltd> said a sharp rise in its share price is based on speculation for favorable results from its current underground diamond drilling program at its 35 pct owned Star Lake gold mine in northern Saskatchewan. Starrex Mining shares rose 40 cts to 4.75 dlrs in trading on the Toronto Stock Exchange. The company said drilling results from the program which started in late February are encouraging, "but it is too soon for conclusions." Starrex did not disclose check assay results from the exploration program. U.S. MEAT GROUP TO FILE TRADE COMPLAINTS WASHINGTON, March 13 - </DATELINE><BODY>The American Meat Institute, AME, said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. The meat industry will seek to have the U.S. government retaliate against EC and Korean exports if their complaints are upheld.  1

45 CRMSTD(dfg) Eliminate all columns with STD < threshold.
For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.).  The STDs of the columns, Xk, can be precomputed up front, once and for all.  STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps.  In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that A text mining scheme might be: 1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem). 2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.) A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout) 2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD). 4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) ''. look for preponderance of wide thin-gaps top down. 4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis. For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas: d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk ensinn where i's start at 0 and increment by . So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0,  divides 180 (e.g., 90, 45, ) CRMSTD(dfg) Eliminate all columns with STD < threshold. d3 set set+vir39 set25 ver ver_49vir vir19 Sub Clus1 (d3+d4)/sqr(2) clus1 none (d3+d4)/sqr(2) clus2 none d5 (f5=vir19, g5=set14) none f5 vir19 clus2 vir23 g5 none Just about all the high STD columns find the subcluster split. In addition, they find the four outliers as well Sub Clus2 (d1+d3+d4)/sqr(3) clus1 set19 vir39 (d1+d3+d4)/sqr(3) clus2 none d5 (f5=vir23, g5=set14) none,f5 none, g5 none d5 (f5=vir32, g5=set14) none, f5 none, g5 none d5 (f5=vir18, g5=set14) none f5 vir18 clus2 vir32 vir6 g5 none d5 (f5=vir6, g5=set14) none, f5 none, g5 none (d1+d2+d3+d4)/sqr(4) clus1 (d1+d2+d3+d4)/sqr(4) clus2 none (d1+d3)/sqr(2) clus1 none (d1+d3)/sqr(2) clus2: ver49 ver8 ver44 ver11 ver10 none ver49 ver8 ver44 ver11

46 CRMSTD(dfg) using IRIS rectangle on Satlog (1805 rows of R,G,IR1,IR2 with classes {1,2,3,4,5,7}.). Here I made a mistake and left MinVec, MaxVec and M as they were for IRIS (so probably far from the Satlog dataset). The results were good??? Suggests random f and g? d2 STD=23.7 gp>3 val cl num (d1+d2)/sqr2 STD=25.3 (d1+d4)/sqr2 STD=15.5 (d2+d3)/sqr2 STD=23.6 d4 STD=20.3 gp>3 val cl num SQRT(x-f2)o(x-f2) STD=26.7 val cl num (d1+d3)/sqr2 STD=16.8 (d2+d4)/sqr2 STD=20.4 same d3+d4)/sqr2 STD=25.7 d3 STD=17.2 gp>3 val cl num d1+d2+d3+d4)/sqr4 STD=25.9 same SQRT(x-g2)o(x-g2) STD=26.8 val cl num d1+d2+d3)/sqr3 STD=25.3 d1 STD=13.6 g>3 none SQRT(x-M)o(x-M) STD=28 val cl num (d1+d2+d4)/sqr3 STD=21.9 sqr(x-f4)o(x-f4 STD=27.8 val cl num d1+d3+d4/sq3 STD=22.1 SQRT(x-f1)o(x-f1) STD=27 val cl num SQRT(x-f5)o(x-f5) STD=25 val cl num Skip STD<25, same outliers: 2_85, 2_191, 3_361, 3_84, 3_100, 3_315, 5_24, 5_73, 5_75, 5_149, 5_168, SQRT(x-f3)o(x-f3) STD=27.5 val cl num SQRT(x-g4)o(x-g4) STD=27.7 val cl num SQRTx-g5ox-g5 STD=27.4 val cl num SQRT(x-g1)o(x-g1) STD=26.3 val cl num SQRT(x-g3)o(x-g3) STD=24.9 none

47 CRMSTD(dfg) Satlog corners on Satlog
1=red soil, 2=cotton, 3=grey soil, 4=damp grey soil, 5=soil w stubble, 6=mixture, 7=very damp grey soil Classes 2, 5 isolated from the rest (and each other)? 2 and 5 produced the greatest number of outliers. Take f5=c2M; g5 to be other means: Class Means c1M c2M c3M c4M c5M c7M d5(f5=c2M,g5=c7M) g>3 STD=26 val cl num : d2 STD=23.7 val cl num Sub Cluster1 Lots of outliers found, but did not separate classes as subclusters (Keeping in mind that they may butt up against each other (no gap) so that they would never appear as subclsuter via gap analysis methods.). Suppose we have a high quality training set for this dataset reliably accurate class means. Next, find any class gaps that might exist by using those as our f and g points. (d1+d2)/sqr2 STD=25.2 none d4 STD=20.3 val cl num (d1+d3)/sqr2 STD=16.6 none (d1+d4)/sqr2 STD=15.3 none Sub Cluster2 (d2+d3)/sqr2 STD=23.4 none (d2+d4)/sqr2 STD=23.4 none SubCluster1 consists of 191 class=2 samples. SubCluster3 contains every subcluster. Next, on SubCluster3 we use f5=c1M and g5=c7M. (d3+d4)/sqr2 STD=25.3 d3 STD=17.2 val cl num (d1+d2+d3)/sqr3 STD=25.2 none Sub Cluster3 (d1+d2+d4)/sqr3 STD=21.6 none (d1+d3+d4)/sqr3 STD=21.8 none (d2+d3+d4)/sqr3 STD=25.4 none (d1+d2+d3+d4)/sqr4 STD=25.4 none d1 STD=13.6 none d2 STD=23.7 val cl num val dis(1 297) f1 STD=11.8 none dis(2_200,2_160)=12.4 outlier g1 STD=14.5 none dis(2_60,2_132) =3.9 f2 STD=14.9 none (2_132,5_45) =33.6 outliers. g2 STD=23.6 none f3 STD=16.9 none g3 STD=12.7 val cl num f4 STD=22.3 none SubClus3 f5=c1M, g5=c7M. d5(f5=c2M,g5=c7M) g>2 STD=68 val cl num : g4 STD=11.6 val cl num Sub Cluster4 g4 STD=11.6 val cl num f5 STD=24.8 none g5 STD=27.1 none

48 Density: A set is T-dense iff it has no distance gaps greater than T
(Equivalently, every point has neighbors in its' T-neighborhood.) We can use L1 or HOB or L distance, since disL1(x,y)  disL2(x,y); disL2(x,y)  2*disHOB(x,y) and disL2(x,y)  n*disL(x,y) Definition: YX is T-dense iff there does not exist yY such that dis2(y, Y-{y}) > T. Theorem-1: If for every yY, dis2(y,Y-{y})  T then Y is T-dense. Using L1 distance, not L2=Euclidean: Theorem-2: disL1(x,y) disL2(x,y) (from here on we will use disk to mean disLk ). Therefore: If, for every yY, dis1(y,Y-{y})  T then Y is T-dense. ( Proof: dis2(y,Y-{y})  dis1(y,Y-{y})  T ) 2*disHOB(x,y)  dis2(x,y) (Proof: Let the bit pattern of dis2(x,y) be 001bk-1...b0 then disHOB(x,y)=2k and the most bk-1 ...b0 can contribute is 2k-1 (if it's all 1-bits). So dis2(x,y)  2k + (2k - 1)  2*2k = 2*disHOB(x,y). Theorem-3: If, for every yY, disHOB(y,Y-{y})  T/2 then Y is T-dense. Proof: dis2(y,Y-{y})  2*disHOB(y,Y-{y})  2*T/2 = T Theorem-4: If, for every yY, dis(y,Y-{y})  T/n then Y is T-dense. Proof: dis2(y,Y-{y})  n*disHOB(y,Y-{y})  n*T/n = T Pick T' based on T and the dimension, n (It can be done!). If MaxGap(yoek)=MaxGap(Yk) < T' k=1..n, then Y is T-dense (Recall, yoek is just Yk as a column of values.) Note: We use the logn pTreeGapFinder to avoid sorting. Unfortunately, it doesn't immediately find all gaps precisely at their full width (because it descends using power of 2 widths), but if we find all PTreeGaps, we can be assured that MaxPTreeGap(Y)  MaxGap(Y) or we can keep track of "thin gaps" and thereby actually identify all gaps (see the slide on pTreeGapFinder). Theorem-5: If k=1..nMaxGap(Yk)  T, then Y is T-dense Proof: dis1(y,x)≡k=1..n|yk-xk|. |yk-xk|  MaxGap(Yk) xY. So dis2(y,Y-{y})  dis1(y,Y-{y})  k=1..nMaxGap(Yk)  T

49 Alternative definition of Density: A set, Y, is kT-dense iff yY, |Disk(y,T)|k
(Equivalently, every point has at least k neighbors in its' T-neighborhood.) IRIS[SL] is below: [0,128) 150 [0,64) [64,128) [0,32) [32,64) [64,96) [96,128) [32,48) [48,64) [64,80) [80,96) [32,40) [40,48) [48,56) [56,64) [64,72) [72,80) [40,44) [44,48) [48,52) [52,56) [56,60) [60,64) [64,68) [68,72) [72,76) [76,80)

50 9/15/12 pTree Text Mining data Cube layout: tePt=again tePt=all tePt=a
lev2, pred=pure1 on tfP1 -stide 1 hdfP t=a t=again t=all lev-2 (len=VocabLen) 8 1 3 df count <--dfP3 <--dfP0 t=a t=again t=all . . . tfP0 1 tfP1 lev1tfPk eg pred tfP0: mod(sum(mdl-stride),2)=1 2 doc=1 d=2 d=3 term=a t=a t=a d= d= d=3 t=again t=again t=again tf d=1 d= d=3 t=all t=all t=all ... tePt=again t=a d=1 t=a d=2 t=a d=3 1 tePt=a t=again t=again t=again d= d= d=3 tePt=all t=all d=1 t=all d=2 t=all d=3 lev1 (len=DocCt*VocabLen) lev0 corpusP (len=MaxDocLen*DocCt*VocabLen) t=a d=1 t=a d=2 t=a d=3 t=again d=1 1 Math book mask Libry Congress masks (document categories move us up document semantic hierarchy  ptf: positional term frequency The frequency of each term in each position across all documents (Is this any good?). 2 d=1 Preface 1 d=1 commas d=1 References Reading position masks (pos categories) move us up position semantic hierarchy  (and allows puncutation etc., placement.) 1 te ... tf2 1 ... tf1 1 ... tf0 3 2 tf are April apple and an always. all again a Vocab Terms 1 3 2 df . . . . . . 1 JSE HHS LMM documnet Corpus pTreeSet data Cube layout: 1 2 3 4 5 6 7 Position

51 tf a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l 60 content words from 44 Mother Goose documents (listed on the next slide). I started with 50 documents, but only documents with at least two content words were kept.

52 Three blind mice. See how they run
Three blind mice! See how they run! They all ran after the farmer's wife, who cut off their tails with a carving knife. Did you ever see such a thing in your life as three blind mice? This little pig went to market. This little pig stayed at home. This little pig had roast beef. This little pig had none. This little pig said Wee, wee. I can't find my way home. Diddle diddle dumpling, my son John. Went to bed with his breeches on, one stocking off, and one stocking on. Diddle diddle dumpling, my son John. Little Miss Muffet sat on a tuffet, eating of curds and whey. There came a big spider and sat down beside her and frightened Miss Muffet away. Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. All the Kings horses, and all the Kings men cannot put Humpty Dumpty together again. See a pin and pick it up. All the day you will have good luck. See a pin and let it lay. Bad luck you will have all the day. Old Mother Hubbard went to the cupboard to give her poor dog a bone. But when she got there the cupboard was bare and so the poor dog had none. She went to the baker to buy him some bread. When she came back the dog was dead. Jack Sprat could eat no fat. His wife could eat no lean. And so between them both they licked the platter clean. Hush baby. Daddy is near. Mamma is a lady and that is very clear. Jack and Jill went up the hill to fetch a pail of water. Jack fell down, and broke his crown and Jill came tumbling after. When up Jack got and off did trot as fast as he could caper, to old Dame Dob who patched his nob with vinegar and brown paper. One misty moisty morning when cloudy was the weather, I met an old man clothed all in leather. He began to praise and I began to grin. How do you do? And how do you do again? There came an old woman from France who taught grown-up children to dance. But they were so stiff she sent them home in a sniff. This sprightly old woman from France. A robin and a robins son once went to town to buy a bun. They could not decide on plum or plain. And so they went back home again. If all the seas were one sea, what a great sea that would be! And if all the trees were one tree, what a great tree that would be! And if all the axes were one axe, what a great axe that would be! And if all the men were one man, what a great man he would be! And if the great man took the great axe and cut down the great tree and let it fall into the great sea, what a splish splash that would be! Great A. little a. This is pancake day. Toss the ball high. Throw the ball low. Those that come after may sing heigh ho! Flour of England, fruit of Spain, met together in a shower of rain. Put in a bag tied round with a string. If you'll tell me this riddle, I will give you a ring. Here sits the Lord Mayor. Here sit his two men. Here sits the cock. Here sits the hen. Here sit the little chickens. Here they run in. Chin chopper, chin chopper, chin chopper, chin! I had two pigeons bright and gay. They flew from me the other day. What was the reason they did go? I can not tell, for I do not know. The Lion and the Unicorn were fighting for the crown. The Lion beat the Unicorn all around the town. Some gave them white bread and some gave them brown. Some gave them plum cake, and sent them out of town. I had a little husband no bigger than my thumb. I put him in a pint pot, and I bid him drum. I bought a little hanky to wipe his little nose and a pair of little garters to tie his little hose. How many miles to Babylon? Three score miles and ten. Can I get there by candle light? Yes, and back again. If your heels are nimble and light, you may get there by candle light. There was an old woman, and what do you think? She lived on nothing but victuals and drink. Victuals and drink were the chief of her diet, yet this old woman could never be quiet. Sleep baby sleep. Our cottage valley is deep. The little lamb is on the green with woolly fleece so soft and clean. Sleep baby sleep. Sleep baby sleep, down where the woodbines creep. Be always like the lamb so mild, a kind and sweet and gentle child. Sleep baby sleep. Cry baby cry. Put your finger in your eye and tell your mother it was not I. Baa baa black sheep, have you any wool? Yes sir yes sir, three bags full. One for my master and one for my dame, but none for the little boy who cries in the lane. When little Fred went to bed, he always said his prayers. He kissed his mamma and then his papa, and straight away went upstairs. Hey diddle diddle! The cat and the fiddle. The cow jumped over the moon. The little dog laughed to see such sport, and the dish ran away with the spoon. Jack, come and give me your fiddle, if ever you mean to thrive. No I will not give my fiddle to any man alive. If I should give my fiddle, they will think that I have gone mad. For many a joyous day, my fiddle and I have had. Buttons, a farthing a pair! Come, who will buy them of me? They are round and sound and pretty and fit for girls of the city. Come, who will buy them ? Buttons, a farthing a pair! Sing a song of sixpence, a pocket full of rye. Four and twenty blackbirds, baked in a pie. When the pie was opened, the birds began to sing. Was not that a dainty dish to set before the king? The king was in his counting house, counting out his money. The queen was in the parlor, eating bread and honey. The maid was in the garden, hanging out the clothes. When down came a blackbird and snapped off her nose. Little Tommy Tittlemouse lived in a little house. He caught fishes in other mens ditches. Here we go round the mulberry bush, the mulberry bush, the mulberry bush. Here we go round the mulberry bush, on a cold and frosty morning. This is the way we wash our hands, wash our hands, wash our hands. This is the way we wash our hands, on a cold and frosty morning. This is the way we wash our clothes, wash our clothes, wash our clothes. This is the way we wash our clothes, on a cold and frosty morning. This is the way we go to school, go to school, go to school. This is the way we go to school, on a cold and frosty morning. This is the way we come out of school, come out of school, come out of school. This is the way we come out of school, on a cold and frosty morning. If I had as much money as I could tell, I never would cry young lambs to sell. Young lambs to sell, young lambs to sell. I never would cry young lambs to sell. A little cock sparrow sat on a green tree. And he chirped and chirped, so merry was he. A naughty boy with his bow and arrow, determined to shoot this little cock sparrow. This little cock sparrow shall make me a stew, and his giblets shall make me a little pie, too. Oh no, says the sparrow, I will not make a stew. So he flapped his wings and away he flew. Old King Cole was a merry old soul. And a merry old soul was he. He called for his pipe and he called for his bowl and he called for his fiddlers three. And every fiddler, he had a fine fiddle and a very fine fiddle had he. There is none so rare as can compare with King Cole and his fiddlers three. Bat bat, come under my hat and I will give you a slice of bacon. And when I bake I will give you a cake, if I am not mistaken. Hark hark, the dogs do bark! Beggars are coming to town. Some in jags and some in rags and some in velvet gowns. The hart he loves the high wood. The hare she loves the hill. The Knight he loves his bright sword. The Lady loves her will. Bye baby bunting. Father has gone hunting. Mother has gone milking. Sister has gone silking. And brother has gone to buy a skin to wrap the baby bunting in. Tom Tom the piper's son, stole a pig and away he run. The pig was eat and Tom was beat and Tom ran crying down the street. Cocks crow in the morn to tell us to rise and he who lies late will never be wise. For early to bed and early to rise, is the way to be healthy and wealthy and wise. One two, buckle my shoe. Three four, knock at the door. Five six, ick up sticks. Seven eight, lay them straight. Nine ten. a good fat hen. Eleven twelve, dig and delve. Thirteen fourteen, maids a courting. Fifteen sixteen, maids in the kitchen. Seventeen eighteen. maids a waiting. Nineteen twenty, my plate is empty. There was a little girl who had a little curl right in the middle of her forehead. When she was good she was very very good and when she was bad she was horrid. Little Jack Horner sat in the corner, eating of Christmas pie. He put in his thumb and pulled out a plum and said What a good boy am I! 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH

53 te a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH df h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

54 mtf=10 a b f *tf/df l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

55 mtf0 a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

56 mtf1 a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

57 mtf2 a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

58 mtf3 a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

59 mtf4 a b f l b r b c c c c i g w a b b b r i r c h l l c r d f d f g r h h a w a a b b a b b e g o b a i e o o o c c d i d e a d u i e i i y a b c a a k e o a h w u k l a t c w r u a s o a l l l r e g l s y y k d g e d y d t n y e d n h k n y t y h g t l e l l n h l 01TBM 02TLP 03DDD 04LMM 05HDS 06SPP 07OMH 08JSC 09HBD 10JAJ 11OMM 12OWF 13RRS 14ASO 15PCD 16PPG 17FEC 18HTP 21LAU 22HLH 23MTB 25WOW 26SBS 27CBC 28BBB 29LFW 30HDD 32JGF 33BFP 35SSS 36LTT 37MBB 38YLS 39LCS 41OKC 42BBC 43HHD 44HLH 45BBB 46TTP 47CCM 48OTB 49WLG 50LJH h m m o r t t w o k l l m e o m t n p o s h h t t w o w u i a a a m r n o h o o p p l u r i s r u o r t w i m o s n d m i e r e r e s l i i u n u n o e m w e w a f a o e g y b d n y y n r e d e g m d n g n e b n e o y e n l

60 11 docs of the 15 (11 survivors of the content word reduction).
In this slide section, the vocabulary is reduce to content words (8 of them) /25/12 mdl=5, vocab={baby,cry,dad,eat,man,mother,pig,shower}, VocabLen=8 and there are 11 docs of the 15 (11 survivors of the content word reduction). First Content Word Mask, FCWM Level-1 (rolled vocab of level-0) d= d= d= d= d= d= d= d= d= d= d= doc=73 doc=71 doc=54 Level-1 (roll up position of level-0) doc=53 doc=46 doc=29 te doc=27 tf Level-2 (roll up document of level-1) doc=09 df1 1 tf doc=08 df0 1 doc=05 tf df 2 3 VOCAB baby cry dad eat man mother pig shower doc=04 Level-0 POSITION

61 Level-0 (ordered by position, document, then vocab)
term doc tf tf1 tf0 te baby 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM 09HBD 27CBC 46TTP dad 04LMM 27CBC 29LFW eat 04LMM 08JSC man 04LMM 05HDS 53NAP mother04LMM pig 04LMM 46TTP 54BOF shower04LMM 71MWA 73SSW df 2 3 df1 1 df0 1 5 reading positions for doc=04LMM (Little Miss Muffet) baby cry dad eat man mother pig shower 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 1 baby 1 cry 1 dad 1 eat 1 man 1 mother 1 pig 1 shower Level-2 (roll up doc) Level-1 (roll up pos) Level-0 (ordered by position, document, then vocab)

62 Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals.
p x y No gaps (ct=0_intervals) on the furthest-to-Mean line, but 3 ct=1 intevals. Declare p=p12, p16, p18 anomaly if pofM is far enough from the bddry pts of its interval? VOM (34, 35) Mean, M Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals. 2. Find good gaps (dot prod with a constant vector for linear gaps?) For rounded gaps, use xox? Note: in this example, vom works better than mean.

63 Length based gapping is dependent.
Using vector lengths However, if the data happens to be shifted, as it is on the right, using lengths no longer works in this example. That is, dot product with a fixed vector, like fM is independent of the placement of the points with respect to the origin. Length based gapping is dependent. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 A squared pattern does not lend itself to rounded gap boundaries. distance from the origin is in red. Distance from (7,0) is in blue. x x 8 7 x x x x x x x x x x x x x x x x 6 x x x x x x x x x x x x x x x x 5 x x x x x x x x x x x x x x x x 4 x x x x x x x x x x x x x x x x 3 x x x x x x x x x x x x x x x x 2 x x x x x x x x x x x x x x x x 1 x x x x x x x x x x x x x x x x 0 x x x x x x x x x x x x x x x x a b c d e f

64 level-1 TermFreqPTrees (E. g
level-1 TermFreqPTrees (E.g., the predicate of tfP0: mod(sum(mdl-stride),2)=1) 8 1 3 2 df (cnt) <--dfP3 <--dfP0 8/04/12 2 doc=1 d= d=1 term=a t=again t=all 3 d=2 d= d=2 t=a t=again t=all 1 tf d=3 d= d=3 t=a t=again t=all ... tf0 tf1 Length of this level-1 TermExistencePTree =VocabLen*DocCount pred is NOTpure0 1 doc=1 doc= doc=1 term=a trm=again term=all 1 doc=2 doc= doc=2 term=a trm=again term=all 1 doc=3 doc= doc=3 term=a trm=again term=all ... Length of this level-0 pTree= mdl*VocabLen*DocCount JSE HHS LMM ...doc mdl reading-positions for doc=1, term=a (mdl = max doc length) mdl reading-positions: doc=1, term=again mdl reading-positions for doc=1, term=all 1 ... Term Ex tf2 1 tf1 1 tf0 3 2 ...Term Freq are April apple and an always. all again a ... Term (Vocab) DocTrmPos pTreeSet dfk isn't a level-2 pTree since it's not a predicate on level-1 te strides. Next slides shows how to do it differently so that even the dfk's come out as level-2 pTrees. 1 3 2 ..doc freq 1 pTree Text Mining Data Cube Text Mining 1 2 3 4 5 6 7 ... Position

65 pTree Text Mining data Cube layout: tePt=again tePt=all tePt=a
level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1 These level-2 pTrees, dfPk have len= VocabLength 1 hdfP doc doc2 doc3 8 1 3 df count <--dfP3 <--dfP0 doc doc2 doc3 . . . tfP0 1 . . . tfP1 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 2 doc=1 d=2 d=3 term=a t=a t=a . . . d= d= d=3 t=again t=again t=again tf d=1 d= d=3 t=all t=all t=all ... This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength tePt=again trm=a trm=a term=a doc doc2 doc3 1 tePt=a t=again t=again t=again doc doc doc3 tePt=all tr=all t=all t=all doc doc doc term=a doc1 term=a doc2 1 ... Term Ex tf2 1 tf1 1 tf0 3 2 ...Term Freq term=a doc3 term=again doc1 ... are April apple and an always. all again a Vocab Terms This one, overall, level-0 pTree, corpusP, has length = MaxDocLen*DocCount*VocabLen Corpus pTreeSet 1 3 2 ..doc freq 1 JSE HHS LMM ...doc pTree Text Mining data Cube layout: 1 2 3 4 5 6 7 ... Pos

66 level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP1
hdfP doc doc2 doc3 These level-2 pTrees, dfPk have len= VocabLength 8 1 3 df count <--dfP3 <--dfP0 . . . tfP0 1 . . . tfP1 doc doc2 doc3 level-1 PTrees, tfPk e.g., pred of tfP0: mod(sum(mdl-stride),2)=1 tePt=again trm=a trm=a term=a doc doc2 doc3 1 tePt=a t=again t=again t=again doc doc doc3 tePt=all tr=all t=all t=all doc doc doc This overall, level-1 pTree, teP, has length = DocCount*VocabLength 2 . . . tf doc=1 d=2 d=3 term=a t=a t=a d= d= d=3 t=again t=again t=again d=1 d= d=3 t=all t=all t=all ... This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen Pt=a,d=3 0 Pt=again,d=1 Pt=a,d=1. . . Pt=a,d=2 term=a doc1 term=a doc2 term=a doc3 term=again doc1 ... Any of these masks can be ANDed into the Pt= , d= pTrees before they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated). Refrncs pTree 1 LastChpt pTree 1 1 te tf2 1 tf1 1 tf0 Preface pTree 1 3 2 tf are April apple and an always. all again a Vocab Terms 1 3 2 ..doc freq JSE HHS LMM ...doc pTree Text Mining data Cube layout: 1 2 3 4 5 6 7 ... Pos

67 I have put together a pBase of 75 Mother Goose Rhymes or Stories
I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with  30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between . your pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 182 of curds and whey. There came a big spider and sat down... Lev-0 Little Miss Muffet Lev1 (term freq/exist) Humpty Dumpty Lev1 (term freq/exist) Lev-0 pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 . 182 te tf tf1 tf0 05HDS Humpty Dumpty sat on a wall. Humpt yDumpty a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between . your df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te34 a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between Level-2 pTrees (document frequency)

68 Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up. LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, aij, initially representing number of times the associated term appears in the indicated document, tfij. This matrix is usually large and very sparse. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. local: [13] Binary if term exists in the doc TermFrequency; global weighting functions: Binary Normal GfIdf, Idf Entropy Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) term represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, aij, initially representing number of times the associated term appears in the indicated document, tfij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Is a new SVD program run required for every new query? or is it a one time thing? If it is one-time, there is probably little advantage in searching for pTree speedups? If and when it is not a one-time application to the original data, pTree speedups my hold promise. Even if it is one-time, we might take the point of view that we do the SVD reduction (using standard horizontal methods) and then covert the result to vertical pTrees for the data mining (which would be done over and over again). That pTree-ization of the end result of the SVD reduction could be organized as in the previous slides. Here is a good paper on the subject of LSI and SVD: Thoughts for the future: I am now convinced we can do LSI using pTree processing. The heart of LSI is SVD. The heart of SVD is Gaussian Elimination (which is adding a constant times a matrix row to another row - which we can do with pTrees). We will talk more about this next Saturday and during the week.

69 SVD: Let X be the t by d TermFrequency (tf) matrix
SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T0S0D0T where T and D have ortho-normal columns and S has only the singular values on its diagonal in descending order. Remove from T0,S0,D0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSDT (X^ is the rank=k matrix closest to X). We have reduced the dimension from rank(X) to k and we note, X^X^T = TS2TT and X^TX^ = DS2DT There are three sorts of comparisons of interest: Comparing 1. terms (how similar are terms, i and j?) (comparing rows) 2. documents (how similar are documents i and j?) (comparing documents) 3. terms and documents (how associated are term i and doc j?) (examining individual cells) Comparing terms (how similar are terms, i and j?) (comparing rows) Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents). X^X^T is the square t x t symmetric matrix containing all these dot products. X^X^T = TS2TT This means the ij cell in X^X^T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms). Comparing documents (how similar are documents, i and j?) (comparing columns) Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms). X^TX^ is the square d x d symmetric matrix containing all these dot products. X^TX^ = DS2DT This means the ij cell in X^TX^ is the dot prod of i and j columns of DS (considered coords of documents). Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^) Since X^ = TSDT cell ij is the dot product of the ith row of TS½ and the jth column of DS½

70 c1 Human machine interface for Lab ABC computer apps
term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors c1 Human machine interface for Lab ABC computer apps c2 A survey of user opinion of comp system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measmnt m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey

71 term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors X = T0S0D0T T0 D0 col-orthonormal. Approx X keeping only 1st 2 singular values and corresp cols of T,D which are coords used to position terms and docs in 2D rep above. In this reduced model: X ~ X^ = TSDT

72

73 inter comp doc\term human face uter user system response time EPS c c c c c mc m m m m mm q D d (mc+mm)/ mc+mm/2*d a q * d q dot d far less than a, so q is way into the c class survey trees graph minors d(doc-i,q) human interface computer user system reponse time (c1-q)^ (c2-q)^ (c3-q)^ (c4-q)^ (c5-q)^ (m1-q)^ (m2-q)^ (m3-q)^ (m4-q)^ What this tells us is that c1 is closests to q in the full space and that the other c documents are no closer than the m documents. Therefore q would probably be classified as c (one voter in the 1.5 nbhd) but not clearly. This shows the need for SVD or Oblique FAUST! EPS survey trees graph minors

74 Provenance From Wikipedia, the free encyclopedia Jump to: navigation, search For other uses Provenance (disambiguation) /7/12 Archaeology: Evidence of provenance can be of importance in archaeology. Fakes are not unknown and finds are sometimes removed from the context in which they were found without documentation, reducing their value to the world of learning. Even when apparently discovered in-situ archaeological finds are treated with caution. The provenance of a find may not be properly represented by the context in which it was found. Artifacts can be moved far from their place of origin by mechanisms that include looting, collecting, theft or trade and further research is often required to establish the true provenance of a find. Paleontology: In paleontology it is recognised that fossils can also move from their primary context and are sometimes found, apparently in-situ, in deposits to which they do not belong, moved by, for example, the erosion of nearby but different outcrops. Most museums make strenuous efforts to record how the works in their collections were acquired and these records are often of use in helping to establish provenance. Seed provenance: Seed provenance refers to the specified area in which plants that produced seed are located or were derived. Data provenance: Scientific research is held to be of good provenance when it is documented in detail sufficient to allow reproducibility.[23] Scientific workflows assist scientists and programmers with tracking their data through all transformations, analyses, and interpretations. Data sets are reliable when the process used to create them are reproducible and analyzable for defects.[24] Current initiatives to effectively manage, share, and reuse ecological data are indicative of the increasing importance of data provenance. Examples of these initiatives are National Science Foundation Datanet projects, DataONE and Data Conservancy. Computers and law: The term provenance is used when ascertaining the source of goods such as computer hardware to assess if they are genuine or counterfeit. Chain of custody is an equivalent term used in law, especially for evidence in criminal or commercial cases. Data provenance covers the provenance of computerized data. There are two main aspects of data provenance: ownership of the data and data usage. Ownership will tells the user who is responsible for the source of the data, ideally including information on the originator of the data. Data usage gives details regarding how the data has been used and modified and often includes info on how to cite the data source or sources. Data provenance is of particular concern with electronic data, as data sets are often modified and copied without proper citation or acknowledgement of the originating data set. Databases make it easy to select specific information from data sets and merge this data with other data sources without any documentation of how the data was obtained or how it was modified from the original data set or sets. Secure Provenance refers to providing integrity and confidentiality guarantees to provenance information. In other words, secure provenance means to ensure that history cannot be rewritten, and users can specify who else can look into their actions on the object.[25] See also Dating methodology (archaeology) Post excavation Arnolfini Portrait - fairly full example of the provenance of a painting Annunciation (van Eyck, Washington) - another example Records Management Traceability External links: Look up provenance in Wiktionary, the free dictionary. EU Provenance Project - a technology project that sought to support the electronic certification of data provenance DataONE Data Conservancy

75 APPENDIX: HADOOP MapReduce
Bad news: lots of programming work Communication and coordination; Recovery from machine failure; Status reporting; Debugging; Optimization; Locality Bad news II: repeat for every problem you want to solve. How can we make it easy to write distributed programs? Data Flow in MapReduce: Read a lot of data Map: extract something you care about from each rec. Partition the output – which keys go to which reducer. Shuffle and sort – reducer expects his keys sorted and for each key – list of all values Reduce: aggregate, summarize, filter, or transform. Write the results. Map selects. Reduce does grouping and summing. Example: Word histogram Map(string input_key, string input_value); /*input_key=doc_name, input)value=doc_contents*/ For each word w in input_values: Emit_intermediate(w, *!*); Reduce(string key, Iterator intermediate_value); /*key: a word, same for i/o; intermediate_value a list of counts*/ int result – 0; for each v in intermediate_value: result += ParseInt(v); Emit(A=String(result)); HADOOP MapReduce Example: Inverted Wed Graph For each pg, gen a list of incoming links Input: Web documents Map: for each link L in doc D emit <href(L),D> Reduce: Combine all docs into a list MapReduce can do Select From Where but can’t join. Example: Joining with Other Data, e.g., For each major city in our GEO database – create a list of pages that refer to it and where: Need to go over all web docs. Per-host info might be in per-process data structure, or involve fCPC to list of machines containing data for all? Map: go over the document. Use heuristic to understand if the document talks about a place/city. For each city name referred in the doc–write doc_id and offset in it. Reduce: concat to list of top rated refs for each city.

76 Singular Value Decomposition ( http://mathworld. wolfram
If a matrix has a matrix of eigenvectors that is not invertible (for example, the matrix has the noninvertible system of eigenvectors ), then does not have an eigen decomposition. However, if is an real matrix with , then can be written using a so-called singular value decomposition of the form A = UDVT (1) Note that there are several conflicting notational conventions in use in the literature. Press et al. (1992) define to be an matrix, as , and as . However, Mathematica defines as an , as , and as . In both systems, and have orthogonal columns so that UTU=1 (2) and VTV=1 (3) (where the two identity matrices may have different dimensions), and has entries only along the diagonal. For a complex matrix , the singular value decomposition is a decomposition into the form A = UDVH (4) where U and V are unitary matrices, VH is the conjugate transpose of V, and D is a diagonal matrix whose elements are the singular values of the original matrix. If is a complex matrix, then there always exists such a decomposition with positive singular values (Golub and Van Loan 1996, pp. 70 and 73). Singular value decomposition is implemented in Mathematica as SingularValueDecomposition[m], which returns a list U, D, V , where U and V are matrices and D is a diagonal matrix made up of the singular values of . SEE ALSO: Cholesky Decomposition, Eigen Decomposition, Eigen Decomposition Theorem, Eigenvalue, Eigenvector, LU Decomposition, Matrix Decomposition, Matrix Decomposition Theorem, QR Decomposition, Singular Value, Unitary Matrix REFERENCES: Gentle, J. E. "Singular Value Factorization." §3.2.7 in Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, pp , 98. Golub, G. H. and Van Loan, C. F. "The Singular Value Decomposition" and "Unitary Matrices." §2.5.3 and in Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins University Press, pp. 70-71 and 73, 1996. Nash, J. C. "The Singular-Value Decomposition and Its Use to Solve Least-Squares Problems." Ch. 3 in Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, 2nd ed. Bristol, England: Adam Hilger, pp. 30-48, 1990. Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; and Vetterling, W. T. "Singular Value Decomposition." §2.6 in Numerical Recipes in FORTRAN: The Art of Scientific Computing, 2nd ed. Cambridge, England: Cambridge University Press, pp. 51-63, 1992. CITE THIS AS: Weisstein, Eric W. "SVD." MathWorld--A Wolfram Wolfram Web Resources Mathematica » The #1 tool for creating Demonstrations and anything technical. Wolfram|Alpha » Explore anything with the first computational knowledge engine. Wolfram Demonstrations Project » Explore thousands of free applications across science, mathematics, engineering, technology, business, art, finance, social sciences, and more. Computable Document Format » The format that makes Demonstrations (and any information) easy to share and interact with. STEM initiative » Programs & resources for educators, schools & students. Computerbasedmath.org » Join the initiative for modernizing math education. Contact the MathWorld Team© Wolfram Research, Inc. | Terms of Use THINGS TO TRY: singular value decomposition (28 base 16) + (30 base 5) d/dy f(x^2 + x y +y^2) Image Compression via the Singular Value Decomposition Chris Maes Exploratory Factor Analysis Stuart Nettleton Singular Value Decomposition Chris Maes

77 Applying the algorithm to C4:
FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) 6/9/12 FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0

78 X x1 x2 p p p p p p p p p pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D(x,M) 8 7 6 4 2 D3 1 D2 1 D1 1 D0 1 xoUp1M 1 3 4 6 9 14 13 15 10 P3 1 P2 1 P1 1 P0 1 FAUST CLUSTER-fmg: O(logn) pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c o fM/|fM| ) HOBbit Furthest pt list ={p1} Pick f=p1. dens(C)=16/82=16/64=.25 If GT=2k then add 0,1,...,2k-1 check all k of these down to level=2k P3'=[0,7] 1 ct=5 P3=[8,15] 1 ct= 10 P3'&P2' =[0,3] 1 ct =3 P3'&P2 =[4,7] 1 ct =2 P3&P2' =[8,11] 1 ct =2 P3&P2 =[12,15] 1 ct =8 P3'&P2'&P1' =[0,1] 1 ct =1 P3'&P2'&P1 =[2,3] 1 ct =2 P3'&P2&P1' =[4,5] 1 ct =1 P3'&P2&P1 =[6,7] 1 ct=1 P3&P2'&P1' =[8,9] 1 ct =1 P3&P2'& P1= [10,11] 1 ct=1 P3&P2&P1' =[12,13] 1 ct =3 P3&P2&P1 =[14,15] 1 ct =4 P3'&P2' &P1'&P0' 0ct=0 1 P3'&P2' &P1'&P0 1ct=1 P3'&P2' &P1&P0' 2ct=0 1 P3'&P2' &P1&P0 3ct=2 1 P3'&P2& P1'&P0' 4ct=0 P3'&P2 &P1'&P0 5ct=0 1 P3'&P2& P1&P0' 6ct=1 P3'&P2 &P1&P0 7ct=0 P3&P2'& P1'&P0' 8ct=0 1 P3&P2'& P1'&P0 9ct=1 1 P3&P2' &P1&P0' 10ct=1 P3&P2' &P1&P0 11ct=0 P3&P2& P1'&P0' 12ct=0 1 P3&P2 &P1'&P0 13ct=4 1 P3&P2' &P1&P0' 14ct=2 1 P3&P2 &P1&P0 15ct=2 Gaps at each red value. Get a mask pTree for each cluster by ORing the pTrees between pairs of gaps. Next slide - use xofM instead of xoUfM

79 FAUST CLUSTER ffd summary
If DT=1.1 then{pa} joins {p7,p8,p9}. If DT=0.5 then also {pf} joins {pb,pc.pd,pe} and {p5} joins {p1,p2,p3,p4}. We call the overall method FAUST CLUSTER because it resembles FAUST CLASSIFY algorithmically and k (# of clusters) is dynamically determined.  Improvements? Better stop condition? Is fmg better than ffd? In ffd, what if k over shoots its' optimal value? Add a fusion step each round? As Mark points out, having k too large can be problematic?. The proper definition of outlier or anomaly is a huge question. An outlier or anomaly should be a cluster that is both small and remote. How small? How remote?  What combination?  Should the definition be global or local? We need to research this (give users options and advice for their use). Md: create f=furthest pt from M, d(f,M) while creating D=SPTreeSet(d(x,M)? Or as a separate procedure, start with P=Dh (h=High Bit Pos.) then recursively Pk<-- P & Dh-k until Pk+1=0. Then back up to Pk and take any of those points as f and that bit pattern is d(f,M). Note that this doesn't necessarily give the furthest pt from M but gives a pt sufficiently far from M. Or use HOBbit dis? Modify to get absolute furthest pt by jumping (when AND gives zero) to Pk+2 and continuing AND from there. (Dh gives a decent f (at furthest HOBbit dis). 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f centriod=mean; h=1; DT= 1.5 gives 4 outliers and 3 non-outlier clusters

80 APPENDIX: Relative gap size on f-g line for fission pt.
Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5,p6,p7,p8,pe,pf} C2={p9,pb,pd} C3={pa} (outlier). Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5} C2={p6} (outlier) C3={p7,p8,p9,pa,pb,pc,pd,pe,pf} On C1, no gaps, so C1 has converged and is declared complete. On C1, 1 gap so declare (complete) clusters, C11={p1,p2,p3,p4} C12={p5} On C2, 1 (relative) gap, and the two subclusters are uniform so the both are complete (skipping that analysis) On C3, 1 gap so declare clusters, C31={p7,p8,p9,pa} C32={pb,pc,pd,pe,pf} On C31, 1 gap, declare complete clusters, C311={p7,p8,p9} C312={pa} On C32, 1 gap, declare complete clusters, C311={pf} C322={pb,pc,pd,pe} 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 9 a b c d e f a b c d e f Does this method also work on the first example? YES. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f

81 FAUST CLUSTER ffd on the "Linked Horseshoe" type example:
max dis to M0 6.13 dis f=p3 6.32 3.60 0.00 1.41 4.47 7.07 7.00 4.00 11.0 11.4 10.0 9.21 8.54 5.09 dis g=pa 7.07 9.43 11.4 10.7 8.60 5.65 5.00 7.61 4.00 0.00 2.23 3.00 5.09 6.32 PC1 1 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to M2 4.24 3.65 2.98 1.42 0.86 1.15 2.42 4.14 dis f2=6 0.00 1.00 4.00 5.65 3.60 4.12 3.16 dis g2=a 5.65 5.00 4.00 0.00 2.23 3.00 5.09 PC21 1 dis to M21 4.24 3.65 4.14 dis f21=e 3.16 2.24 0.00 dis g21=6 0.00 1.00 3.16 PC211 1 dis to M22 6.02 2.86 1.84 0.63 0.89 dis f22=9 0.00 4.00 2.24 3.61 5.00 dis g22=d 5.00 3.00 2.83 1.41 0.00 PC221 1 X x1 x2 p p p p p p p p p pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 FAUST CLUSTER ffd on the "Linked Horseshoe" type example: 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 9 a b c d e f a b c d e f PC222 1 dis to M222 1.70 0.75 1.37 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to f2=3 6.32 3.61 0.00 1.41 4.47 4.00 5.10 PC11 1 dis to M12 1.89 1.72 1.08 2.09 dis to f12=f 3.16 3.61 1.41 0.00 PC12 1 Discuss: Here, DT=.99 (DT=1.5 all singeltons?). We expected FAUST to fail to find interlocked horseshoes, but hoped. e,g, pa and p9 would be only singleton! Can modify so it doesn't make almost everything outliers (singles, doubles a. look at upper cluster bbdry (margin width)? b. use std ratio bddys? c. other? d. use a fussion step to weld the horseshoes back Next slide: gaps on f-g line for fission pt. X x1 x2 p p p p p p p p p pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 M dens(C0)= 15/6.132<DT inc M dens(C1)= 7/3.392 <DT inc M dens(C2)= 8/4.242<DT inc M dens(C21)= 3/4.142<DT inc C212 compl dens(C212)= 2/.52=8>DT compl M dns(C221)= 2/5<DT inc M dns(C222)=1.04<DT inc M

82 1.Pick K centroids, {Ci}i=1..K
K-means: Assign each pt to closest mean and increment sum, count for mean recalculation (1 scan). Iterate until stop_cond. pK-means: Same as above, but both assignment and means recalculation are done without scanning: 1.Pick K centroids, {Ci}i=1..K 2. Calc SPTreeSet, Di=D(X,Ci) (col of distances from all x to Ci) to get P(DiDj) i<j ( predicate is dis(x,Ci)dis(x,Cj) ). 4. Calculate the mask-pTrees for the clusters goes as follows: PC1 = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2 = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC1 PC3 = P(D3D4) & ... & P(D3DK) & ~PC1 & ~PC PCk = & ~PC1 & ~PC2 & ... & ~PCK-1 5. Calculate new Centroids, Ci = Sum(X&PCi)/count(PCi) If stop_cond=false, start next iteration with new centroids. Note: In 2. above, Md's 2's complement formulas can be used to get mask pTrees, P(DiDj) or FAUST (using Md's dot product formula) can be used. Is one faster than the other? pKl-means: ( P K-less means, pronounced pickle means ) For all K: 4'. Calculate cluster mask pTrees. For K=2..n, PC1K = P(D1D2) & P(D1D3) & P(D1D4) & ... & P(D1DK) PC2K = P(D2D3) & P(D2D4) & ... & P(D2DK) & ~PC PCK = P(X) & ~PC1 & ... & ~PCK-1 6'. If  k s.t. stop_cond = true, stop and choose that k, else start the next iteration with these new centroids. 3.5'. Continue with certain k's only (e.g., top t? Top means? a. Sum of cluster diams (use max, min of D(Clusteri, Cj), or D(Clusteri. Clusterj) ). b. Sum of diams of cluster gaps (Use D(listPCi, Cj) or D(listPCi, listPCj). c. other? Fusion: Check for clusters that should be fused?  Fuse (decrease k) 1. Empty clusters with any other and reduce k (this is probably assumed in all k-means methods since there is no mean.). 2. For some a>1, max(D(CLUSTi,Cj))< a*D(Ci,Cj) and max(D(CLUSTj,Ci))< a*D(Ci,Cj), fuse CLUSTi and CLUSTj. Avg better? Fission: Split cluster (increase k), if a. mean and vom are quite far apart, b. cluster is sparse (i.e., max(D( CLUS,C))/count(CLUS)<T (Pick fission centroid y at max dis from C. Pick z at max dis from y. (diametric opposites in C) Sort PTreeSet(dis(x,X-x)), then sort desc, gives singleton-outlier-ness. Or take global medoid, C, increase r until  ct(dis(x,Disk(C,r)))>ct(X)–n, then declare compliment outliers. .Or, loop x once - alg is O(n). ( O(n2) for horiz: x, find dis(x,y) yx (O(n(n-1)/2)=O(n2). Or predict C so it is not X-x but a fixed subset? Or create 3 col “distance table”,  DIS(x,y,d(x,y))  (limit it to only those distances < thresh?) where dis(x,y) is a PTreeSet of those distances. If we have DIS as a PTreeSet both ways - have one for “y-pTrees” and another for “x-pTrees”. y’s --> x’s  0  2  1  3  1  2… v       0  2  5  9  1… y’s close to x are in it’s cluster.  If small, and next larger d(x,y) is large, x-cluster members are outliers.

83 Mark: Curious about one other state it converges to.
Mark Silverman: I start randomly - converges in 3 cycles. Here I increase k from 3 to 5. 5th centroid could not find a member (at 0,0), 4th centroid picks up 2 points that look remarkably anomalous Treeminer, Inc. (240) WP: Start with large k? Each round, "tidy up" by fusing pairs of clusters using   max( P(dis(CLUSi, Cj))) < dis(Ci, Cj) and max( P(dis(CLUSj, Ci))) < dis(Ci, Cj) ? Eliminate empty clusters and reduce k. (Avg better than max ? in the above). Mark: Curious about one other state it converges to. Seems like when we exceed optimal k, some instability. WP: Tiding up would fuse Series4 and series3 into series34 Then calc centroid34. Next fuse Series34 and series1 into series134, calc centrod34 Also?: Each round, split a cluster (create 2nd centroid) if mean and vector_of_medians far apart. (A second go at this mitosis based on density of the cluster. If a cluster is too sparse, split it. A pTree (no looping) sparsity measure:   max(dis( CLUSTER,CENTROID )) / count(CLUSTER) X

84 FAUST CLASSIFY, d versions (dimensional versions, mm, dens, mmd...)
Declare {r1,r2,r3,O} mm: Choose dim1. 3 clusters, {r1,r2,r3,O}, {v1,v2,v3,v4}, {0}. 1.a: When d(mean,median) >c*width, declare cluster. 1.b: Same alg on subclusts. Declare {0,v1} or {v1,v2}? Take {v1,v2} (on median side of mean). Makes {0} a cluster (outlier, since it's singleton). Continuing with {v1,v2}: dim2 Declare {v1,v2,v3,v4}. Have to loop, but not on next m projs if close? o r1   v1 r                  v2      r3    v3 v4 Can skip doubletons since mean always same as median. dens: 2.a density > Density_Thresh, declare (density≡count/size). Oblique: grid of Oblique dir_vects, e.g., For 3D, DirVect from each PTM triangle. With projections onto those lines, do 1 or 2 above. Order = any sphere grid: Sn≡{x≡(x1...xn)Rn | xi2=1}, polar coords. lexicographical polar coords? 180n too many? Use e.g., 30 deg units, giving 6n vectors, for dim=n. Attrib relevance important! mmd: Use 1st criteria to trigger from 1.a, 2.a to declare clusters. Alg4: Calc mean and vom. Do 1a or 1b on line connecting them. Repeat on each cluster, use another line? Adjust proj lines, stop cond dim1 median mean median mean median mean median mean mean median median mean median mean Alg5: Proj to mean-vom-line, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, perp line? 4,9 2, ,8 4,6          3,4 dim2 dim1 11,10 10,5 9,4   ,3 7,2 6.3,5.9 6,5.5 2 1 3 mean=(8.18, 3.27, 3.73)    435  524 504        545     323                           924                      b43      e43                  c63            752             f72 vom=(7,4,3) Other option? use a p-Kmeans approach. Could use K=2 and divisive (using a GA mutation at various times to get us off a non-convergent track)? 1. no clusters determined yet. 2. (9,2,4) determined as an outlier cluster. Notes:Each round, reduce dim by one (low bound on the loop.) Each round, just need good line (in remaining hyperplane) to project cluster (so far). 1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?) 2. pick line thru longest diameter? ( or diam  1/2 previous diam?). 3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set. 3. Use red dim line, (7,5,2) an outlier cluster. maroon pts determined as cluster, purple pts too. 3.a use mean-vom again would the same be determined?

85 FAUST Classify, Oblique version
(our best classifier?) PR=P(X o dR ) < aR pass gives classR pTree D≡ mRmV d=D/|D| Separate class R using midpoint of means method: Calc a (mR+(mV-mR)/2)od = a = (mR+mV)/2od (works also if D=mVmR, Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the 1. vectors_of_median, vom, to represent each class, not the mean mV, where vomV ≡(median{v1|vV}, 2. midpt_std, vom_std methods: project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr and mv median{v2|vV}, ...) dim 2 vomR Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees) vomV r   r vv r mR   r      v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d std of distances, vod, from origin along the d-line

86 PTM_LLRR_LLRR_LR... x L ... R L R L R
What ordering is best for spherical data (e.g., Data sets involving astronomical bodies on the celestial sphere, which shares its' origin and equatorial plane with the earth, but has no radius. Hierarchical Triangle Mesh (HTM) orders its' recursive equilateral triangulations as: equator south pole north pole R L PTree Triangular Mesh (PTM) ordering is: Peel from south to north pole along quadrant great circles and the equator. L HTM sub-triangle ordering 1,1,2 1,1,0 1,1,1 1.1.3 1,2 1,1 1,0 1,3 L 1 Level-2 follows the level-1 LLRR pattern with another LLRR pattern. L Level-3 follows level-2 with LR when level-2 pattern is L and RL when level-2 pattern is R R L PTM_LLRR_LLRR_LR... R L ... L R L R R L R L R L R R R R R R L R R L L L R R L L Theorem: n,  an n-sphere filling (n-1)-sphere? Corollary:  sphere filling circle (2-sphere filling 1-sphere). Proof of Corollary: Let Cn ≡ the level-n circle, C ≡ limitnCn is a circle which fills the 2-sphere! Proof: Let x be any point on the 2-sphere. distance(x,Cn)  sidelength (=diameter) of the level-n triangles. sidelengthn+1 = ½ * sidelengthn. d(x,C) ≡ lim d(x,Cn)  lim sidelengthn  sidelength1 * lim ½n = 0 L R L x L

87 PAPA: Ptree Analysis of Partitions and Anomalies 4/21/12
Algorithm-1: Look for dimension where clustering best. Below, dimension=1 (3 clusters: {r1,r2,r3,O}, {v1,v2,v3,v4} and {0}). How to determine? 1.a: Take each dimension in turn working left to right, when d(mean,median)>¼ width, declare a cluster. 1.b: Next take those clusters one at a time to the next dimension for further sub-clustering via the same algorithm. At this point we declare {r1,r2,r3,O} a cluster and start over. At this point we need to declare a cluster, but which one, {0,v1} or {v1,v2}? We will always take the one on the median side of the mean - in this case, {v1,v2}. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v1,v2}: Declare {v1,v2,v3,v4} a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close. Next we would take one of the clusters and go to the best dimension to subcluster... Algorithm-2: 2.a Take each dim in turn, working left to right, when density>Density_Threshold, declare a cluster (density≡count/size). 2b=1b Oblique version: Take grid of Oblique direction vectors, e.g., For 3D dataset, a DirVect pointing to center of each PTM triangle. With projections onto those lines, do 1 or 2 above. Ordering = any sphere surface grid: Sn≡{x≡(x1...xn)Rn | xi2=1}, in polar coords, {p≡(θ1...θn-1) | 0θi179}. Use lexicographical polar coords? 180n too many? Use e.g., 30 deg units, giving 6n vectors, for dim=n. Attrib relevance important dim2 o r1   v1 r                  v2      r3    v3 v4 Can skip doubletons since mean always same as median. Algorithm-3: Another variation of this is to calculate the dataset mean and vector of medians. Then on the projections of the dataset onto the line connecting the two, do 1a or 1b. Then repeat on each declared cluster, but use projection line other than the one through the mean and vom, this second time, since the mean-vom-line would likely be in approx in the same direction as the first round) Do until no new clusters? Adjust? e.g., proj lines and stop cond,... 4,9 2, ,8 4,6          3,4 dim2 dim1 11,10 10,5 9,4   ,3 7,2 6.3,5.9 6,5.5 dim1 median mean median mean median mean median mean mean median median mean median mean Algorithm-4: Proj onto line of dataset mean, vom, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, Repeat on any perp line thru mean. (mn, vom far apartmulti-modality. Algorithm-4.1: 4.b.1 In each cluster, find 2 points furthest from line? (Require projection be done one point at a time? Or can we determine those 2 points in one pTree formula?) Algorithm-4.2: 4.b.2 use a grid of unit direction lines, {dvi | i=1..m}. For each, calc mn, vom of projs of each cluster (except singletons). Take the one for which the separation is max.

88 My assumption is that all I need to do is to modify as follows:
3 1. no clusters determined yet. mean=(8.18, 3.27, 3.73) 2. (9,2,4) determined as an outlier cluster.    435  524 504        545     323                           924                      b43      e43                  c63            752             f72 3. Using red dim line, (7,5,2) is determined as an outlier cluster. maroon pts determined as cluster, purple pts too. vom=(7,4,3) 3.a However, continuing to use line connecting (new) mean and vom of the projections onto this plane, would the same be determined? 1 Other option? use (at some judicious point) a p-Kmeans type approach. This could be done using K=2 and a divisive top down approach (using a GA mutation at various times to get us off a non-convergent track)? 2 Notes:Each round, reduce dim by one (low bound on the loop.) Each round, just need good line (in remaining hyperplane) to project cluster (so far). 1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?) 2. pick line thru longest diameter? ( or diam  1/2 previous diam?). 3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set. From: Mark Silverman April 21, :22 AM Subject: RE: oblique faust I’ve been doing some tests, so far not so accurate (I’m still validating the code – I “unhardcoded” it so I can deal with arbitrary datasets and it’s possible there’s a bug, so far I think it’s ok). Something rather unique about the test data I am using is that it has four attributes, but for all of the class decisions it is really one of the attributes driving the classification decision (e.g. for classes 2-10, attribute 2 is dominant decision, class 11 attribute 1 is dominant, etc). I have very wide variability in std deviation in the test data (some very tight, some wider). Thus, I think that placing “a” on the basis of relative deviation makes a lot of sense in my case (and probably in general). My assumption is that all I need to do is to modify as follows: Now:  a[r][v] = (Mr + Mv) * d / 2 Changes to a[r][v] = (Mr + Mv) * d * std(r) / (std(r) + std(s)) Is this correct?

89 Facebook-Buys: A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked. Members 4 3 2 1 F≡Friends(M,M) P≡Purchase(M,I) I≡Items 5 1 2 1 2 4 XI MX≡&xXPx People that purchased everything in X. FX≡ORmMXFb = Friends of a MX person. 4 3 2 1 So, X={x}, is Mx Purchases x strong" Mx=ORmPxFmx frequent if Mx large. This is a tractable calculation. Take one x at a time and do the OR. K2 = {1,2,4} P2 = {2,4} ct(K2) = 3 ct(K2&P2)/ct(K2) = 2/3 Mx=ORmPxFmx confident if Mx large. ct( Mx  Px ) / ct(Mx) > minconf To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves.... ct(ORmPxFm & Px)/ct(ORmPxFm)>mncnf Kx=OR Ogx frequent if Kx large (tractable- one x at a time and OR. gORbPxFb Kiddos 4 3 2 1 F≡Friends(K,B) Buddies P≡Purchase(B,I) I≡Items 5 Groupies Compatriots (G,K) Kiddos 4 3 2 1 F≡Friends(K,B) Buddies P≡Purchase(B,I) I≡Items 5 Groupies Others(G,K) 1 2 1 2 Fcbk buddy, b, purchases x, tells friends. Friend tells all friends. Strong purchase poss? Intersect rather than union (AND rather than OR). Ad to friends of friends 1 2 4 1 2 4 4 3 2 1 4 3 2 1 1 4 1 1 4 1 2 1 K2={2,4} P2={2,4} ct(K2) = 2 ct(K2&P2)/ct(K2) = 2/2 K2={1,2,3,4} P2={2,4} ct(K2) = 4 ct(K2&P2)/ct(K2)=2/4 1 2 3 4 1 2 3 4

90 The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F.
S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I downward closure: If a condition is true of A, then it is true for all subsets D of A. upward closure: If a condition is true of A then it is true of all supersets D of A. For transitive (a+c)-hop strong rule mine where the focus or count entity is a hops from the antecedent and c hops from the consequent, if a (or c) is odd/even then downward/upward closure applies to frequency (confidence). Odd  downward Even  upward The proof of the theorem: a pTree, X, is said to be "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that same position in Y. Lemma-0: For any two pTrees, X and Y, X & Y is covered by X and ct(X)  ct(X&Y) Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions. Lemma-1: Let AB, &aBXa is covered by &aAXa Proof-1: Let Z=&aB-AXa then &aB Xa = Z & (&aA Xa), so the result follows from lemma-0. Lemma-2: For a (or c) =0, frequency and confidence are upward closed Proof-2: Lemma-3: If a (or c) we have upward/downward closure of frequency or confidence, then for a+1 (or c+1) we have downward/upward closer. Proof-3: Taking the a and upward closure, going to a+1 and DA, we are removing ANDs in the numerator for both frequency and confidence, so by Lemma-1, the a+1 numerator is covers the a numerator and therefore the a+1_count  the a_count. Therefore, the condition (frequency or confidence) holds in the a+1 case and we have downward closure. ct(B)ct(A), so ct(A)>mnsp  ct(B)>mnsp and ct(C&A)/ct(C)>mncf  ct(C&B)/ct(C)>.mncf

91 Given a n-row table, a row predicate (e. g
Given a n-row table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., asc on key; or for spatial data, col/row-raster, Z, Hilbert), the sequence of predicate truth bits is the raw or level-0 predicate Tree (pTree) for that table, row predicate and row order. gte50% stride=5 P1SL,1 1 pure1 str=5 P1SL,1 gte25% str=5 P1SL,1 1 gte75% str=5 P1SL,1 1 IRIS Table Name SL SW PL PW Color setosa red setosa blue setosa red setosa white setosa blue versicolor red versicolor red versicolor white versicolor blue versicolor white virginica white virginica red virginica blue virginica red virginica red pred: rem(div(SL/2)/2)=1 order: given order P0SL,0 1 P0Color=red 1 P0SL,1 1 gte50% str=5 P1C=red 1 pure1 gte25% gte75% predicate: remainder(SL/2)=1 order: the given table order pred: Color=red order: given ord Given a raw pTree, P, a partitioned of it, par, and a bit-set predicate, bsp (e.g., pure1, pure0, gte50%One), the level-1 par, bsp pTree is the string of truths of bsp on consecutive partitions of par. If the partition is an equiwidth=m intervalization, it's called the level-1 stride=m bsp pTree. P1gte50%,s=4,SL,0≡ gte50% stride=4 P1SL,0 1 gte50% st=5 pTree predicts setosa. pred: PW<7 order: given gte50% stride=5 P1PW<7 1 rem(SL/2)=1 ord: given gte50% stride=4 P1SL,0 1 gte50% stride=8 P1SL,0 1 pred: rem(SL/2)=1 ord: given order level-2 gte50% stride=2 1 P2gte50%,s=4,SL,0 P0PW<7 1 P0SL,0 1 P0SL,0 1 R11 1 lev2 pTree= lev1 pTree on a lev1. (1col tbl) raw level-0 pTree gte50_P11 1 1 level-1 gt50 stride=4 pTree gte50% stride=16 P1SL,0 1 1 1 level-1 gt50 stride=2 pTree 1 1 1 1

92 2pstdr FAUST a = pmr + (pmv-pmr) = Satlog pstdv+2pstdr
evaluation R G ir1 ir2 mn R G ir1 ir2 std a = pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*pstdv + pmv*2pstdr pstdr +2pstdv NonOblique lev-0 1's 's 's 's 's 's True Positives: Class actual-> 2s1, # of FPs reduced and TPs somewhat reduced. Better? Parameterize the 2 to max TPs, min FPs. Best parameter? NonOblq lev1 gt50 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using midpoint of means 1's 's 's 's 's 's True Positives: False Positives: tot TP actual TP nonOb L0 pure1 TP nonOblique FP level-1 50% TP Obl level-0 FP MeansMidPoint TP Obl level-0 FP s1/(s1+s2) TP 2s1/(2s1+s2) FP Ob L0 no elim TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP BandClass rule FP mining (below) Oblique level-0 using means and stds of projections (w/o cls elim) 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projections (w cls elim in order) Note that none occurs 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using means and stds of projections, doubling pstd No elimination! 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projs,doubling pstdr, classify, eliminate in 2,3,4,5,7,1 ord 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means,stds of projs, doubling pstdr, classify, elim 3,4,7,5,1,2 ord 1's 's 's 's 's 's True Positives: False Positives: 2s1/(2s1+s2) elim ord: TP: FP: G[0,46]2 G[47,64]5 G[65,81]7 G[81,94]4 G[94,255]{1,3} R[0,48]{1,2} R[49,62]{1,5} R[82,255]3 ir1[0,88]{5,7} ir2[0,52]5 Conclusion? MeansMidPoint and Oblique std1/(std1+std2) are best with the Oblique version slightly better. I wonder how these two methods would work on Netflix? Two ways: above=(std+stdup)/gap below=(std+stddn)/gapdn suggest ord abv below abv below abv below abv below avg red green ir ir2 cls avg 4 2.12 2 2.36 5 4.03 7 4.12 1 4.71 3 5.27 UTbl(User, M1,...,M17,770) (u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m) MTbl(Movie, U1,...,U480189) (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)

93 Netflix data {mk}k=1..17770 UserTable(uID,m1,...,m17770)
m mh m17770 u1 : uk . u480189 rmhuk  47B  UserTable(uID,m1,...,m17770) m0, m17769,0 u1 : uk . u480189 1/0  47B  UPTreeSet 3*17770 bitslices wide uID rating date u i1 rmk,u dmk,u ui2 . ui n k mk(u,r,d) avg:5655u/m u ?45 m 1 2 4 5 u ?45 m 1 2 4 5 mID uID rating date m u rm,u dm,u m u2 . m u r17770, d17770,480189 or U  ,480,  Main:(m,u,r,d) avg:209m/u u uk u480189 m1 : m h m17770 rmhuk 47B  MTbl(mID,u1...u480189) u0, u480189,0 m1 : m h m17770 0/1  47B  MPTreeSet 3* bitslices wide (u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m) Lots of 0s in vector sp, umTraningTbl). Want the largest subtable without zeros. How? SubUTbl( nSup(u)mSup(n), Sup(u),m)? Using Coordinate-wise FAUST (not Oblique), in each coordinate, nSup(u), divide up all users vSup(n)Sup(m) into their rating classes, rating(m,v). then: 1. calculate the class means and stds. Sort means calculate gaps 3. choose best gap and define cutpoint using stds. Coord FAUST, in each coord, vSup(m), divide up all movies nSup(v)Sup(u) to rating classes 1. calculate the class means and stds. Sort means calculate gaps 3. choose best gap and define cutpoint using stds. Of course, the two supports won't be tight together like that but they are put that way for clarity. This of course may be slow. How can we speed it up? Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??) Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them. Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes? (u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m)

94 Bill P: FAUST should be great for that.
Mark "Faust is fast... takes ~15 sec on same dataset that takes over 9 hours with knn and 40 min with pTree knn. 3/31/12 I’m ready to take on oblique, need better accuracy (still working on that with cut method ("best gap" method)." FAUST is this many times faster than, Horizontal KNN taking hours = minutes = 32,400 sec. pCKNN: taking hours = minutes = 2,400 sec. while Mdpt FAUST takes hours = minutes = sec. "Doing experiments on faust to assess cutting off classification when gaps got too small (with an eye towards using knn or something from there). Results are pretty darn good…  for faust this is still single gap, working on total gap (max of (min of prev and next gaps)) Here’s a new data sheet I’ve been working on focused on gov’t clients." Bill P: BestClsAttrGap-FAUST using all gaps meeting criteria (e.g., sum of 2 stds < gap width), AND all mask pTrees. Oblique FAUST is more accurate and faster. Md will send what he has and please interact with him on quadratics - he will help you with the implementation. Could get datasets for your performance analysis (with code of competitor algorithms etc.?)  It would help us a lot in writing papers Work together on Oblique FAUST performance analysis using your benchmarks. You'd be co-author.  My students crunch numbers... Mark S:  Vendor opp: Provides data mining solutions to telecom operators for call analysis, etc - using faust in an unsupervised mode - thots on that for anomaly detection. Bill  P:   FAUST should be great for that.

95 kmurph2@clemson. edu Mar 06 Yes, pTREES for med informatics, Bill
Mar 06 Yes, pTREES for med informatics, Bill! We could work so many miracles.. data we can generate requires robust informatics, comp. bio. would put resources into this. Keith Murphy, Chair Genetics/Biochem Dir, Clemson U Genomics Inst. WP: 3/6 Wave applied pTrees to Bioinformatics too (took second in 2002 ACM KDD-cup in bioinformatics and took first in the 2006 ACM KDD-cup in medical informatics. 2006 ACM KDD Cup Winning Team Leader Task 2002 ACM KDD Cup, Task 2. Yeast Gene Regulation Prediction:  See Mark Silverman Feb 29: tweaking Greg's faust impl and look at gap split (looks for max gap, not max gap on both side of mean -should be?) WP: looks like 50%ones impure pTrees can give cut-hyperplanes (for FAUST) as good as raw pTrees.  what's the advantage?  Since FAUST training is a 1-time process, it isn't speed critical. Very fast impure pTree batch classification (after training) would be very exciting. Once the cut-hyper-planes identified (e.g., FPGA spits out 50%ones impure pTrees for incoming unclassified datasets (e.g., satellite images) and sends them thro (FPGA) for "Md's "One-Pass-Across-Columns = OPAC" batch classification - all happening on-the-fly with nearly zero delay... For PINE (nearest neighbor), we don't even train a model, so the 50%ones impure pTree classification-phase could be very significantly better. Business Intelligence= "What does this customer want next, based on histories?": FAUST is model-based (training phase=build model of 1 hyperplane for Oblique or up to 1-per-col for non-Oblique). Use the model to classify. In Bus-Intel, with every new unclassified sample, a different vector space appears. (every customer rates a different set of items). So to use FAUST-PINE, there's the non-vector-space problem to solve. non-Oblique FAUST better than Oblique, since cols have different cardinalities (not a vector space to calculate oblique hyperplanes). In general, we're attempting is to marry MYRRH multi-hop Relationship or Rule Mining with FAUST-PINE Classification or Table Mining. On Social Network Mining:   We have some social network mining research threads percolating: 1.  facebook-friends  multi-hopped with buying-preference relationships (or multi-hopped with security threat relationships or with?) 2. implications of twitter blooms for event prediction (e.g., commod/stock changes, events, political trends, bubbles/bursts, purchasing patterns ... I would like to tie image classification with social networks somehow too ;-) WP: 3/1/12 Note on "...very excited about the discussions on MYRRH and applying it to classification problems, seems hugely innovative..." I want to try to view Images as relationships, rather than as tables, each row = a pixel and each cols is "the photon count in a frequency band". Any table=relationship (AKA, a matrix, rolodex card) w 2 entity axes: 1. usual row entity (e.g., pixels), 2. col entity(s) (e.g., wavlen interval).   Any matrix is a dual pair of tables (via rotation). Cust-Item Rating matrix is rating tbl pair: Custs(Items) and its rotated dual, Item(Custs). When sufficient #of fine-band, hyper-spectral sensors in the air (plus on/in the ground), there will be a sufficient # of separate columns to do MYRRH on the relationship between pixels and wavelengths multi-hopped with the relationship between classes and pixels (...nearly every measurement is a summarization or a intervalization (even a pixel is a 2-D intervalization of an infinite set of points in space), so viewing wavelength as an intervalization of a continuous phenomenon is just as valid, right?).   What if we do FAUST-PINE on the rotated image relationship, Wavelength(pixel_photon_count) instead of, Pixel(Wavelength_photon_count)?   Note that classes which are not convex in Pix(WL) (that are spread out spatially all over the image) might be convex in WL(Pix)? tried prelims - disappointing for classification (tried applying concept on SatLogLandsat(R,G,ir1,ir2,class).  too few bands or classes? Still, I'm hoping for "Wow!  Look at this!" when, e.g., classes aren't known/clear and there are thousands of them and millions of bands...) e.g., 2 huge square-ish relationships to multi-hop.  difficult (curse of dim = too many cols which are the relevant?) rule mining comes into its own. One last thought:  regarding " the curse of dimensionality = too many columns - which are the relevant ones? ",   FAUST automatically filters irrelevant cols to find those that reveal [convex] classes (all good classes are convex in proper feature space. e.g., Class=yellow_car may round-ish in Pix(RedWaveLen,GreenWaveLen, BlueWaveLen, OtherWaveLens), once R,G,B are isolated as relevant ones. Class=pavement is fragmented in Pix(RWL,GWL,BWL,OWLs) but may be convex in WL(pix_x, pix_y) (because pavement is color consistent?) Last point:  We have to get you a FAUST implementation!  It almost has to be orders of magnitude faster than pknn!   The speedup should be very sublinear - almost constant (nearly independent of cardinality) - because it is a bulk classifier (one horizontal pass gains us a class_mask_pTree, distinguishing all points predicted to be in that class).   So, not only is it model-based, but it is a batch classifier.  Model-based classifiers that require scanning horizontal datasets cannot compete! Mark 3/2/12: Very close on faust.  WP: it's important the classification step be done in bulk lest you lose the main huge benefit of FAUST. What happens at the end if you've peeled off all the classes and there are still some unclassified points left? have “mixed”/“default” (e.g., SatLog class=6=“mixed”) potential interest from some folks who have close relationship with Arbitron.  Seems like a netflix story to me...

96 Netflix data {mk}k=1..17770 UserTable(uID,m1,...,m17770)
m mh m17770 u1 : uk . u480189 rmhuk  47B  UserTable(uID,m1,...,m17770) m0, m17769,0 u1 : uk . u480189 1/0  47B  UPTreeSet 3*17770 bitslices wide uID rating date u i1 rmk,u dmk,u ui2 . ui n k mk(u,r,d) avg:5655u/m u ?45 m 1 2 4 5 u ?45 m 1 2 4 5 mID uID rating date m u rm,u dm,u m u2 . m u r17770, d17770,480189 or U  ,480,  Main:(m,u,r,d) avg:209m/u u uk u480189 m1 : m h m17770 rmhuk 47B  MTbl(mID,u1...u480189) u0, u480189,0 m1 : m h m17770 0/1  47B  MPTreeSet 3* bitslices wide (u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m) Lots of 0s in vector sp, umTraningTbl). Want the largest subtable without zeros. How? SubUTbl( nSup(u)mSup(n), Sup(u),m)? Using Coordinate-wise FAUST (not Oblique), in each coordinate, nSup(u), divide up all users vSup(n)Sup(m) into their rating classes, rating(m,v). then: 1. calculate the class means and stds. Sort means calculate gaps 3. choose best gap and define cutpoint using stds. Coord FAUST, in each coord, vSup(m), divide up all movies nSup(v)Sup(u) to rating classes 1. calculate the class means and stds. Sort means calculate gaps 3. choose best gap and define cutpoint using stds. Of course, the two supports won't be tight together like that but they are put that way for clarity. This of course may be slow. How can we speed it up? Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??) Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them. Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes? (u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m)

97 D≡ mrmv. FAUST Oblique formula: P(Xod)<a X any set of vectors (e.g., a training class) /18/12 Let d = D/|D|. To separate rs from vs using means_midpoint as the cut-point, calculate a as follows: Viewing mr, mv as vectors ( e.g., mr ≡ originpt_mr ), a = ( mr+(mv-mr)/2 ) o d = (mr+mv)/2 o d What if d points away from the intersection, , of the Cut-hyperplane (Cut-line in this 2-D case) and the d-line (as it does for class=V, where d = (mvmr)/|mvmr| ? Then a is the negative of the distance shown (the angle is obtuse so its cosine is negative). But each vod is a larger negative number than a=(mr+mv)/2od, so we still want vod < ½(mv+mr) o d     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    v v     r            v                     d d a

98 FAUST Oblique vector of stds D≡ mrmv , d=D/|D| PX o d < a
= PdiXi<a To separate r from v: Using the vector of stds cutpoint , calculate a as follows: Viewing mr, mv as vectors, a = ( mr mv ) o d stdr+stdv stdr stdv What are the purple stds? approach-1: for each coordinate (or dimension) calculate the stds of the coordinate values and for the vector of those stds. Let's remind ourselves that the formula given Md's formula, does not require looping through the X-values but requires only one AND program across the pTrees. PX o d < a = PdiXi<a     r   r r v v         r mr   r    v v v       r    r       v mv v      r v v     r            v                     d

99 FAUST Oblique D≡ mrmv , d=D/|D|
PXod<a = PdiXi<a FAUST Oblique D≡ mrmv , d=D/|D| Approach 2 To separate r from v: Using the stds of the projections , calculate a as follows: a = pmr (pmv-pmr) = pstdr+pstdv pstdr pmr*pstdr + pmr*pstdv + pmv*pstdr - pmr*pstdr pstdr +pstdv next? pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*2pstdr + pmr*pstdv + pmv*2pstdr - pmr*2pstdr 2pstdr +pstdv In this case the predicted classes will overlap (i.e., a given sample point may be assigned multiple classes) therefore we will have to order the class predictions.     r   r r v v         r mr   r    v v v       r    r       v mv v      r v v     r            v                     By pmr, we mean this distance, mrod, which is also mean{rod|rR} r | v | d r | pmr | By pstdr, std{rod|rR} | r | r | r v | pmv | | v | v | v

100 PXo(mrmb)>|mrmb|/2
From notes: A Multi-attribute EIN Oblique (EINO) based heuristic: /21/12 Instead of finding the best D, take the vector connecting a class mean to another class means as D To separate r from v: D=(mrmv) and a=|mrmv|/2     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    b v v     r            b    b v                     b  mb  b                   b   b                              b    b b   To separate r from b: D=(mrmb) and a=|mrmb|/2 Question: What's the best as cutpt? mean, vector_of_medians, outermost, outermost_non-outlier? Mistake! d=D/|D|, a=(mr+mv)/2 o d Devastating to accuracy! ANDing the two pTrees masks the region (which is r) PXo(mrmb)>|mrmb|/2 By "outermost, I mean the "furthest points away from the means in each class (in terms of their projections of the D-line); By "outermost non-outlie" I mean the furthest non-outlier points; Other possibilities: the best rankK points, the best std points, etc. Comments on where to go from here (assuming we can do the above): I think the "medoid-to-mediod" method on this page is close to optimal provided the classes are convex. If they are not convex, then some sort of Support Vector Machines, SVMs, would be the next step. In SVMs the space is translated to higher dimensions in such a way that the classes ARE convex. The inner product in that space is equivalent to a kernel function in the original space so that one need not even do the translation to get inner product based results (the genius of the method). Final note: I should say "linearly separable instead of convex (slightly weaker condition). PXo(mrmv)>|mrmv|/2 masks vectors that makes a shadow on mr side of the midpt     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    b v v     r            b    b v                     b  mb  b                   b   b                              b    b b   For classes r and b

101 P(mrmv)/|mrmv|oX<a
4. FAUST Oblique: length, std, rkK for selecting best gap and multiple attrs. formula: P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ).     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    v v     r            v                     P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv a PX dot d>a = PdiXi>a E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d NOTE:!!! The picture on this page could be misleading. See next slide for a clearer picture FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? AND 2 pTrees masks P(mbmr)oX>(mr+m)|/2od P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. g b    grb  grb grb        grb    grb  grb    grb grb                    grb  In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them.     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    b v v     r            b    b v                     b  mb  b                   b   b                              b    b b   For classes r and b                      bgr      bgr                  bgr       bgr                           bgr  bgr bgr   bgr bgr bgr r D

102 PX dot d<a = PdiXi<a
4. FAUST Oblique: midpt, std, rkK for selecting best gap and multiple attrs. formula: P(X dot D)<a X any set of vectors. D≡ mrmv is the oblique vector (Note: if D=ei, PXi<a ) and let d=D/|D| To separate r from v: Using means_midpoints as cutpoints , calculate a as follows: Viewing mr, mv as vectors ( e.g., mr≡originpt_mr ), a = ( mr+(mv-mr)/2 ) o d = (mr+mv)/2 o d = ( (½)*mr + (½)*mv ) o d     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    v v     r            v                     d a

103 4. FAUST Oblique: X any set of vectors. d=(mv-mr)/|mv-mr|
Pv o d>a = PdiXi>a To separate r from v using midpts: What happens when we use the previous (mistaken) a = |mv-mr|/2 ? Cut line Pvod>a a     r   r r v v        r  mr   r      v v v       r    r       v mv v      r    v v     r            v                     mv-mr a d all rod are > a so all rs are classified incorrectly as vs

104 Oblique FAUST (level-0 case): NonOblique lev-0 1's 2's 3's 4's 5's 7's
R G ir ir2 means R G irR ir stds NonOblique lev-0 1's 's 's 's 's 's True Positives: Class Totals-> NonOblq lev-1 50% 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 (Oblique without eliminating classes as they are predicted) 1's 's 's 's 's 's True Positives: False Positives:

105 4. FAUST Oblique: X any set of vectors. D≡ mrmv , d=D/|D|
PX dot d<a = PdiXi<a 4. FAUST Oblique: X any set of vectors. D≡ mrmv , d=D/|D| To separate r from v: Using the vector of standard deviation ratios cutpoint , calculate a as follows: Viewing mr, mv as vectors, a = ( mr mv ) o d stdr+stdv stdv stdr Just as there is no median for a set of vectors, there is no std either. What is meant by the purple expression above is, for each coordinate (dimension) one calculates the stds of those coordinate values of the r and v vector sets, the ratios with those coordinate values of mr and mv. Is that the same as projecting the r-set and v-set onto the d-line (using Rod and Vod) and then using the stds of those shadow lengths to adjust the cutpoint? Or would this be yet a better way to do it? The next slide shows this approach and then we compare.     r   r r v v         r mr   r    v v v       r    r       v mv v      r v v     r            v                     d

106 4. FAUST Oblique: X any set of vectors. D≡ mrmv , d=D/|D|
PX dot d<a = PdiXi<a 4. FAUST Oblique: X any set of vectors. D≡ mrmv , d=D/|D| stdr+stdv stdr To separate r from v: Using the projections means and standard deviation cutpoint , calculate a as follows: pstdr+pstdv pstdv Viewing mr, mv as vectors, a = pmr (pmv-pmr)     r   r r v v         r mr   r    v v v       r    r       v mv v      r v v     r            v                     e.g., By pmr, we mean this distance, which is mr o d r | v | d r | pmr | | r | r | r v | pmv | | v | v | v

107 Oblique FAUST (level-0 case): NonOblique lev-0 1's 2's 3's 4's 5's 7's
R G ir ir2 means R G irR ir stds NonOblique lev-0 1's 's 's 's 's 's True Positives: Class Totals-> NonOblq lev-1 50% 1's 's 's 's 's 's True Positives: False Positives: Midpt Oblique level-0 (Oblique without eliminating classes as they are predicted) 1's 's 's 's 's 's True Positives: False Positives: Coordinatewise STDs Oblique level-0 (w/o class elimination) 1's 's 's 's 's 's True Positives: <-probably False Positives: <-mistakes) projected STDs Oblique level-0 (w/o class elimination) 1's 's 's 's 's 's True Positives: False Positives: projected STDs Oblique level-0 (with class elimination in 2,3,4,5,6,7,1 order) 1's 's 's 's 's 's True Positives: False Positives:

108 MYRRH A hop is relationship, R ( hops from one entity, E, to another, F). 1/7/12 Strong Rule Mining (SRM) finds all frequent and confident rules, AC (Non-transitive if A,CE (the ARM case). Transitive if AE, CF) R(E,F) 1 2 3 4 E F 5 Frequency can lower bound the antecedent, consequent or both (ARM = both: Its justification is the elimination of insignificant cases. Its purpose is the tractability of SRM. ct(&eACRe)mnsp) Confidence lower bds the frequency of both over the frequency of the antecedent, ct(&eARe&eCRe)/ct(&eARe)mncf The crux of SRM is frequency counts. To compare these counts meaningfully they must be on the same entity (focus entity). SRMs are categorized by the number of hops, k, whether transitive or non-transitive and by the focus entity. ARM is 1-hop, E-non-transitive (A,CE) and F-focused SRM (1nF) (How does one define non-transitive in for multi-hop SRM?) 1-hop, transitive (AE,CF), F-focused SRM (1tF) APRIORI: ct(&eARe)  mnsp ct(&eARe &PC) / ct(&eARe)  mncf 1. (antecedent downward closure) If A is frequent, all of its subsets are frequent. Or, if A is infrequent, then so are all of its supersets. Since frequency involves only A, we can mine for all qualifying antecedents efficiently using downward closure. 2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C. So  frequent antecedent, A, use upward closure to mine for all of its' confident consequents. The theorem we demonstrate throughout this section is: For transitive (a+c)-hop Apriori strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even then one can use downward/upward closure on that step in the mining of strong (frequent and confident) rules. In this case A is 1-hop from F (odd, use downward closure). C is 0-hops from F (even, use upward closure). We will be checking more examples to see if the Odddownward Evenupward theorem seems to hold. 1-hop, transitive, E-focused rule, AC SRM (1tE) |A|=ct(PA)  mnsp ct(PA&fCRf) / ct(PA)  mncf 1. (antecedent upward closure) If A is infrequent, then so are all of its subsets. 2. (consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C. In this case A is 0-hops from E (even, use upward closure). C is 1-hop from E (odd, use downward closure).

109 ct(&f&eAReSf & PC) / &f&eAReSf
2-hop transitive F-focused (focus on middle entity, F) C G AC strong if: S(F,G) 1 4 1 3 2tF ct(&eARe)  mnsp ct(&eARe &gCSg) / ct(&eARe)  mncf 1 2 1 1 1. (antecedent downward closure) If A is infrequent, then so are all of its supersets. 2 3 4 5 F 2. (consequent downward closure) If AC is non-confident, so is AD for all supersets, D. 4 1 3 1 3. Apriori for 2-hops: Find all freq antecedents, A, using downward closure. For each: find C1G, the set of g's s.t. A{g} is confident. Find C2G, the set of C1G pairs that are confident consequents for antecedent, A. Find C3G, the set of triples (from C2G) s.t. all subpairs are in C2G (ala Apriori), etc. 2 1 1 1 A  E R(E,F) The number of hops from the focus are 1 and 1, both odd so both have downward closure. Standard ARM can be viewed as 2tF where E=G, AC empty and S=Rtr. Thus, we have no non-transitive situation anymore, so we can drop the t verses n and call this F 2G ct(&f&eAReSf)mnsp  mncf ct(&f&eAReSf & PC) / &f&eAReSf The number of hops from the focus are 2 and 0, both even so both have upward closure. 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D. 2E ct(PA)mnsp  mncf ct(PA&f&gCSgRf ) / ct(PA) The number of hops from the focus are 0 and 2, both even so both have upward closure. 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D.

110 ct(PA & Rf) / ct(PA)  mncnf ct(PA) ct(& Tg & PC) mncnf /ct(& Tg)
3-hop Collapse T: TC≡ {gG|T(g,h) hC} That's just 2-hop case w TCG replacing C. ( can be replaced by  or any other quantifier. The choice of quantifier should match that intended for C.). Collapse T and S: STC≡{fF |S(f,g) gTC} Then it's 1-hop w STC replacing C. S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H mncf ct(&eARe &g&hCThSg) /ct(&eARe mnsp 3F antecedent downward closure: A infreq. implies supersets infreq. A 1-hop from F (down consequent upward closure: AC noncnf implies AD noncnf. DC. C 2-hops (up 3G mncf &hCTh) ct(&f&eAReSf /ct(&f&eAReSf) mnsp ct(&f&eAReSf) antecedent upward closure: A infreq. implies all subsets infreq. A 2-hop from G (up) consequent downward closure: AC noncnf impl AD noncnf. DC. C 1-hops (down) ct( &g=1,3,4 Sg ) /ct(1001) ct( &1001&1000&1100) / 2 ct( ) / = 1/2 Focus on F Are they different? Yes, because the confidences can be different numbers Focus on G. ct(&eARe &glist&hCThSg ) /ct(&eARe &hCTh) ct(&flist&eAReSf / ct(&flist&eAReSf) ct(&f=2,5Sf &1101 ) / ct(&f=2,5Sf ct(1101 & & / ct(1101 & ) ct(0001 ) / ct(0001) = 1/1 =1 ct(PA & Rf) f&g&hCThSg / ct(PA)  mncnf  mnsup ct(PA) 3E antecedent upward closure: A infreq. implies subsets infreq. A 0-hops from E (up) consequent downward closure: AC noncnf implies AD noncnf. DC. C 3-hops (down) 3H antecedent downward closure: A infreq. implies all subsets infreq. A 3-hops from G (down) consequent upward closure: AC noncnf impl AD noncnf. DC. C 0-hops (up) ct(& Tg & PC) g&f&eAReSf mncnf /ct(& Tg) ct(& Tg) mnsp

111 * ct(&iCUi) ) ct(&f&eAReSf &h&iCUiTh) / ct(&f&eAReSf)
S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I 4-hop Collapse U,R: Replace C by UC; A by RA as above (not different from 2 hop? Collapse R: (RA for A, use 3-hop) Collapse U: (UC for C, use 3-hop). mncf ct(&f&eAReSf &h&iCUiTh) /ct(&f&eAReSf) mnsp ct(&f&eAReSf) 4G F=G=H=genes and S,T=gene-gene intereactions. More than 3, S1, ..., Sn? &iCUi))+ (ct(S1(&eARe mncnf / ( (ct(&eARe))n * ct(&iCUi) ) &iCUi))+... ct(S2(&eARe &iCUi)) ) ct(Sn(&eARe If the S cube can be implemented so counts can be can be made of the 3-rectangle in blue directly, calculation of confidence would be fast. ... R(E,G) 1 2 3 4 E G 5 A Sn(G,G) S1(G,G) U(G,I) C I 2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C (the "list" will be larger, so the AND over the list will produce fewer ones) So  frequent antecedent, A, use upward closure to mine out all confident consequents, C. 1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure. 4G APRIORI:  mncnf ct(&f&eAReSf &h&iCUiTh) / ct(&f&eAReSf)  mnsup ct(&f&eAReSf)

112 ct(&f&eAReSf) ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf)
S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I V(I,J) J 5-hop  mnsup ct(&f&eAReSf) 5G ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf)  mncnf i(&jCVj) 5G APRIORI: 1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure. 2. (consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C. So  frequent antecedent, A, use downward closure to mine out all confident consequents, C.

113 ct( &f(& )ReSf) ct( &f(& )ReSf &h(& )UiTh) / ct( &f(& )ReSf )
6-hop S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I V(I,J) J D Q(D,E) The conclusion we have demonstrated (but not proven) is: for (a+c)-hop transitive Apriori ARM with focus the entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even use downward/upward closure on that step in the mining of strong (frequent and confident) rules. ct( &f(& )ReSf) 6G  mnsup e(&dDQd) ct( &f(& )ReSf &h(& )UiTh) / e(&dDQd) i(&jCVj) ct( &f(& )ReSf )  mncnf e(&dDQd) 6G APRIORI: 1. (antecedent downward closure) If A is infrequent, then so are all of its supersetsbsets. Frequency involves only A, so mine all qualifying antecedents using downward closure. 2. (consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C. So  frequent antecedent, A, use downward closure to mine out all confident consequents, C.

114 Given any 1-hop labeled relationship (e. g
Given any 1-hop labeled relationship (e.g., cells have values from {1,2,…,n} then there is: R3(C,M) R2(M,C) 1 2 3 4 M C 5 A D R4(M,C) R5(C,M) R0(M,C) R1(C,M) 1. a natural n-hop transitive relationship, AD, by alternating entities for each individual label value bitmap relationships. 2. cards for each entity consisting of the bitslices of cell values. E.g., netflix, Rating(Customer,Movie) has label set {0,1,2,3,4,5}, so in 1. it generates a bonafide 6-hop transitive relationship. Below, as in 2., Rn-i can be bitslices R0(E,F) Rn-2(E,F) Rn-1(E,F) F 2 3 4 5 1 E A  ... D R1(A)= "Movies rated 1 by all customers in A. R2(R1(A))= "Cust who rate as 2, all R1(A) movies" = "Cust who rate as 2, all movies rated as 1 by all A-cust". R3(R2(R1(A)))= "Movies rated as 3 by all R2(R1(A)) customers" = "Movies rated as 3 by all customers who rate as 2 all movies rated as 1 by all A-customers". R4(R3(R2(R1(A))))= "Customers who rate as 4 all R3(R2(R1(A))) movies" = "Customers who rate as 4 movies rated as 3 by all customers who rate as 2 all movies rated as 1 by all A-customers". R5(R4(R3(R2(R1(A)))))= "Movies rated as 5 by all R4(R3(R2(R1(A)))) customers" = "Movies rated 5 by all customers who rate as 4 movies rated as 3 by all customers who rate as 2 all movies rated as 1 by all A-customers". R0(R5(R4(R3(R2(R1(A))))))= "Customers who rate as 0 all R5(R4(R3(R2(R1(A))))) movies" = "Cust who rate as 0 all movies rated 5 by all cust who rate as 4 movies rated as 3 by all cust who rate as 2 all movies rated as 1 by all A-cust". R0(R5(R4(R3(R2(R1(A))))))  D E.g., equity trading on a given day, QuantityBought(Cust,Stock) w labels {0,1,2,3,4,5} (where n means n thousand shares) so that generates a bonafide 6-hop transitive relationship: E.g., equity trading - moved similarly, (define moved similarly on a day --> StockStock(#DaysMovedSimilarlyOfLast10) E.g., equity trading - moved similarly2, (define moved similarly to mean that stock2 moved similarly to what stock1 did the previous day.Define relationship StockStock(#DaysMovedSimilarlyOfLast10) E.g., Gene-Experiment, Label values could be "expression level". Intervalize and go! Has Strong Transitive Rule Mining (STRM) been done? Are their downward and upward closure theorems already for it? Is it useful? That is, are there good examples of use: stocks, gene-experiment, MBR, Netflix predictor,...

115 ct(&iABBi &tDBt) / ct(&iABBi)  mncf
Let Types be an entity which clusters Items (moves Items up the semantic hierarchy), E.g., in a store, Types might include; dairy, hardware, household, canned, snacks, baking, meats, produce, bakery, automotive, electronics, toddler, boys, girls, women, men, pharmacy, garden, toys, farm). Let A be an ItemSet wholly of one Type, TA, and l et D by a TypesSet which does not include TA. Then: 1 Buys(C,T) =1 iff tD s.t. B(c,t)=1, B(c,t)=1 means it s.t. BB(I,c)=1 BoughtBy(I,C,) Items Customers 2 3 4 5 Types (of Items) A  D 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AD might mean If iA s.t. BB(i,c) then tD, B(c,t) AD might mean If iA s.t. BB(i,c) then tD, B(c,t) AD might mean If iA s.t. BB(i,c) then tD, B(c,t) AD might mean If iA s.t. BB(i,c) then tD, B(c,t) AD confident might mean ct(&iABBi &tDBt) / ct(&iABBi)  mncf ct(&iABBi | tDBt) / ct(&iABBi)  mncf ct( | iABBi | tDBt) / ct( | iABBi)  mncf ct( | iABBi &tDBt) / ct( | iABBi)  mncf AD frequent might mean ct(&iABBi)  mnsp ct( | iABBi)  mnsp ct(&tDBt)  mnsp ct( | tDBt)  mnsp ct(&iABBi &tDBt)  mnsp, etc.

116 A thought on impure pTrees (i. e. , with predicate, 50%ones)
A thought on impure pTrees (i.e., with predicate, 50%ones). The training set was ordered by class (all setosa's came first, then all versicolor then all virginica) so that level_1 pTrees could be chosen not to span classes much. Take an images as another example. If the classes are RedCars, GreenCars, BlueCars, ParkingLot, Grass, Trees, etc., and if Peano ordering is used, what if a class spans Peano squares completely? We now create pTrees from many different predicates. Should we created pTreeSets for many different orderings as well? This would be a one time expense. It would consume much more space, but space is not an issue. With more pTrees, our PGP-D protection scheme would automatically be more secure. So move the first column values to the far right for the 1st additional Peano pTreeSet: Move the 1st 2 columns to the right for 2nd Peano pTreeSet, 1st 3 for 3rd Peano pTreeSet..

117 Move the last column to the left for the 4th, the last 2 left for the 5th, the last 3 left for the 6th additional Peano pTreeSet. For each of these 6 additional Peano pTreeSets, make the same moves vertically (64 Peano pTreeSets in all), e.g., the 25th would be (starting with the 4th horizontal, directly above). For each of these 6

118 What about this? Looking at the vertical expansions of the 2nd additional pTreeSet (the 13th and 14th additional pTreeSets, respectively?) If we're given only pixel reflectance values for GreenCar, then we have to rely on individual pixel reflectances, right? In that case, we might as well just analyze each pixel for GreenCar characteristics. And then we would not benefit from this idea except that we might be able to data mine GreenCars using level_2 only?? Question: How are the training set classes given to us in Aurora, etc.? My question is, are we just given a set of pixels that we're told are GreenCar pixels? Or are we given anything that would allow us to use shapes of GreenCars to identify em? That is, are we given a traning set of GreenCar pixels together with their relative positions to one another - or anything like that? The green car is now centered in a level_2 pixel, assuming the level_2 stride is 16 (and the level_1 stride is 4).

119 Notice that the left move 3 is the same as right move 1 (and left 2 is the same as right 2; left 1 is the same as right 3.) Thus, we have only 42 = 16 orderings (not 64) at level-2; = 4 at level-1; n at level-n. Essentially the upper right corner can be in any one of the cells in a level-n pixel and there are 4n such cells. If we always create pure1, pure0 (for complements of pure1) and GTE50% predicate trees, there would be 3*4n separate PTreeSets. Then the question is how to order pixels in a left (or up) shift? We could actually shift and then use the usual Peano? Or we could keep each cell ordering as much the same as possible (see below). One thought is to do the shifting at level-0, and percolate it upward. But we have to understand what that means. We certainly wouldn't store shifted level-0 PTreeSets since they are the same pixelization. So: construct shifted level-n pixelizations (n>0) concurrently by considering, one at a time, all level-0 pixel shifts (creating an additional PTreeSet only when it is a new pixelization (e.g., only the first level-0 pixel shift produces a new pixelization at level-1; only the first 3 at level-2, only the first 7 at level-3, etc. Throw away the bogus level-n pixels (e.g., at right throw away right column of level-2 pixels since it isn't bonefide image). Start with a fresh Z-ordering (2nd option).

120 MYRRH pTree-based ManY-Relationship-Rule Harvester
RoloDex Model: 2 Entities many relationships MYRRH pTree-based ManY-Relationship-Rule Harvester uses pTrees for ARM of multiple relationships. Supp(A) = CusFreq(ItemSet) Conf(AB) =Supp(AB)/Supp(A) 5 6 16 ItemSet ItemSet antecedent 1 2 3 4 5 6 16 itemset itemset card 1 5 people 2 3 4 items terms DataCube Model for 3 entities, items, people and terms.  Customer 1 2 3 4 Item 1 customer rates movie as 5 card cust item card 5 6 7 People  1 2 3 4 Author movie 2 3 1 5 4 customer rates movie card 2 3 4 5 PI 2 3 4 5 PI 4 3 2 1 Course Enrollments 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc People  term  7 6 5 4 3 Gene 1 2 3 4 G 5 6 1 3 Exp 7 6 5 4 3 2 t 1 termterm card (share stem?) expPI card expgene card genegene card (ppi) Items: i1 i2 i3 i4 i5 |0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10| People: p1 p p3 p4 |0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M| Terms: t1 t2 t3 t4 t5 t6 |1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00| Relationship: p1 i1 t1 |0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2 Relational Model:

121 APPENDIX: MYRRH_2e_2r (standard pARM is MYRRH_2e_1r )
e.g., Rate5(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C) pre-computed  BpTtreec 1-counts 3 2 1 2 3 1 BpTtreeb 1-cts  1 P(B,C) (S(E,F)) 1 If cust, c, rates book, b as 5, then c purchase b. For bB, {c| rate5(b,c)=y}{c| purchase(c,b)=y} ct(R5pTreei & PpTreei) / ct(R5pTreei)  mncnf ct(R5pTreei) / sz(R5pTreei)  mnsp 1 1 4 3 Speed of AND: R5pTreeSet & PpTreeSet? (Compute each ct(R5pTreeb&PpTreeb).) Slice counts, bB, ct(R5pTreeb & PpTreeb) w AND? 2 3 4 5 C (E) 2 1 1 pre-comR5pTtreeb 1-cts  1 R5(C,B) (R(E,F)) B (F) 1 1 Given eE, If R(e,f), then S(e,f) ct(Re & Se)/ct(Re)mncnf, ct(Re)/sz(Re)mnsp 1 If eA R(e,f), then eB S(e,f) R5pTtreec 1-cts 1 2 ct( &eARe &eBSe) / ct(&eARe)  mncnf. ... Schema: size(C)=size(R5pTreeb)=size(BpTreeb)=4 size(B)= size(R5pTreec)=size(BpTreec)=4 If eA R(e,f), then eB S(e,f) ct( &eARe OReBSe) / ct(&eARe)  mncnf. ... If eA R(e,f), then eB S(e,f) ct( OReARe &eBSe) / ct(OReARe)  mncnf. ... 1 R5(C,B) 1 R5pTtreeb&PpTreeb 1-counts 1 P(B,C) If eA R(e,f), then eB S(e,f) ct( OReARe OReBSe) / ct(OReARe)  mncnf. ... Consder 2 Customer classes, Class1={C=2|3} and Class2={C=4|5}. Then P(B,C) is TrainingSet: C\B Book=4 is very discriminative of Class1 and Class2, e.g., Class1=salary>$100K Then the DiffSup table is: B=1 B=2 B=3 B=4 P1={B=1|2} P2={B=3|4} C C DS P1 [and P2, B=2 and B=3] is somewhat discriminative of the classes, whereas B=1 is not.. Are "Discriminative Patterns" covered by ARM? E.g., does the same information come out of strong rule mining? Does "DP" yield information across multiple relationships? E.g., determining the classes via the other relationship?

122 Let ASL be {6,7} and CPW be {1,2}
stride=10 level-1 val SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica SL SW PL rnd(PW/10) Making 3-hops: Use 4 feature attributes of an entity. For IRIS(SL,SW,PL,PW). L(SL,PL), P(PL,PW), W(PW,SW) Let ASL be {6,7} and CPW be {1,2} SW= S PW P PL= PL= L SL

123 ct(&bAPb &cDEc) / ct(&bAPb)  minconf
2-hop transitive rules (specific examples) E(S,C) P(B,S) 1 2 3 4 B S 5 C A   D AD: If bA P(b,s), then cD E(s,c) is a strong rule if: ct(&bAPb)  minsupp ct(&bAPb &cDEc) / ct(&bAPb)  minconf 2-hop Enroll Book If a student Purchases every book in A, then that student is likely to enroll in every course in D, and lots of students purchase every book in A. In short, P(A,s) E(s,D) is confident and P(A,s) is frequent PJ(C,I) PD(I,C) 1 2 3 4 I C 5 A   D 2-hop Purchase Dec/Jan AD: If iA PD(i,c), then iD PJ(c,i) is a strong rule if: ct(&iAPDi)  minsupp ct(&iAPDi &iDPJi) / ct(&iAPDi)  minconf If a customer Purchases every item in A in December, then that customer is likely to purchase every item in D in January, and lots of customers purchase every item in A in December: PD(A,c)PJ(c,D) conf and PD(A,c) freq. B(P,I) O(E,P) 1 2 3 4 E P 5 I A   D 2-hop Event Buy AD: If eA O(e,p), then iD B(p,i) is a strong rule if: ct(&eAOe)  minsupp ct(&eAOe &iDBi) / ct(&eAOe)  minconf If every Event in A occurred in a person's life last year, then that person is likely to buy every item in D this year, and lots of people had every Event in A occur last year: O(A,p)B(p,D) conf and O(A,p) freq.

124 ct(&eAOe &mDTm) / ct(&eAOe)  minconf
2-hop stock trading T(S,M) O(E,S) 1 2 3 4 E S 5 M A   D AD: If eA O(e,s), then mD T(s,m) is a strong rule if: ct(&eAOe)  minsupp ct(&eAOe &mDTm) / ct(&eAOe)  minconf If every Event in A occurs for a company in time period 1, then the price of that stock experienced every move in D time period 2, and lots of companies had every Event in A occur in period 1: O(A,s)T(s,D) conf and O(A,s) freq. (T=True; e.g., m=1 down a lot, m=2 down a little, m=3 up a little, m=4 up a lot.) T(C,M) O(E,C) 1 2 3 4 E C 5 M A   D AD: If eA O(e,c), then mD T(c,m) is a strong rule if: 2-hop commodity trading ct(&eAOe)  minsupp ct(&eAOe &mDTm) / ct(&eAOe)  minconf If every Event in A occurs for a commodity in time period 1, then the price of that commodity experienced every move in D time period 2, and lots of commodities had every Event in A occur in period 1: O(A,c)T(c,D) conf and O(A,c) freq. B(P,I) F(P,P) 1 2 3 4 P 5 I A   D AD: If pA P(p,q), then iD B(p,i) is a strong rule if: 2-hop facebook friends buying ct(&pAFp)  minsupp ct(&pAFp &iDBi) / ct(&pAFp)  minconf F(p,q)=1 iff q is a facebook friend of p. B(p,i)=1 iff p buys item i. People befriended by everyone in A (= &pAFp denoted FA for short ) likely buy everything in D. And FA is large. So every time a new person appears in FA that person is sent ads for items in D.

125 ct(&aAAOa&gCB'g)/ct(&aAAOa)mncf ct( AOa=4)mnsp
How do we construct interesting 2-hop examples? Method-1: Use a feature attribute of a 1-hop entity. Start with a 1-hop, e.g., customers buy items, stocks have prices or people befriend people then focus on one feature attribute of one of the entities. The relationship is the projection of that entity table onto the feature attribute and the entity id attribute (key) e.g. Age, Gender, Income Level, Ethnicity, Weight, Height... of people or customer entity These are not bonafide 2-hop transitive relationships since they are many-to-one relationships, not a many-to-many (because the original entity is the primary key of its feature table). Thus, we don't get a fully transitive relationship since collapsing the original entity leaves nearly the same information as the transitive situation was intended to add. Here is an example. If, from the new transitive relationship, AgeIsAgeOfCustomerPurchasedItem, Customer is collapsed we have AgePurchaseItem and the Customer-to-Age info is still available to us in the Cust table. The relationship between Customers and Items is lost, but presumably, the reason for mining, AgeIsAgeOfCustomerPurchaseItem is to find AgePurchaseItem rules independent of the Customers involved. Then when a high confidence Age implies Item rule is found, the Customers who are of that age can be looked up from the Customer feature table and sent a flyer for that item. Also, in CustomerPurchaseItem, the antecedent, A, could have been chosen to be an age-group. So most AgePurchaseItem info would come out of CustomerPurchaseItem directly. Given a 1-hop relationship, R(E,F) and a feature attribute, A of E, if there is a pertinent way to raise E up the semantic hierarchy (cluster it) producing E', then the relationship between A and E ' is many-to-many, e.g., cluster Customers by Income Level, IL. Then AgeIsAgeOfIL is a many-to-many relationship. Note, what we're really doing here is using the many-to-many relationship between two feature attributes in one of the entity tables and then replacing the entity by the second feature. E.g., if B(C,I) is a relationship, and IL is a feature attribute in the entity table C(A,G,IL,E,W,H), then clustering (Classifying) C by IL produces a relationship, B'(IL,I), given by B'(il,i)=1 iff B(c,i)=1 for  50% of cil, which is many-to-many provided IL is not a candidate key. So from the 1-hop relationship, CB(C,I)I, we get a bonafide 2-hop relationship, AAO(A,IL)ILB'(IL,I)I. ct(&aAAOe)mnsp ct(&aAAOa&gCB'g)/ct(&aAAOa)mncf I B(C,I) 1 C A IL 2 3 4 5 1 2 3 4 I B'(IL,I) IL ct( AOa=4)mnsp ct(AOa=4 &g=3,4B'g )/ct(AOa= )mncf C ct( )mnsp ct( &100&110)/ ct(010) mncf 1 mnsp 0 / mncf Bil=2 = Bc=3OR Bc=5 Bil=1 = Bc=2 Bil=3 = Bc=4 4 5 I AO(A,IL) 1 ct(&cC(A)Bc)mnsp ct(&cC(A)Bc &IC)/ct(&cC(A)Bc)mncf ct( Bc= )mnsp ct(Bc= &0011)/ct(Bc= )mncf A ct( )mnsp ct( &0011)/ct( )mncf So these are different rules. 2 mnsp 1 / mncf


Download ppt "Research of William Perrizo, C.S. Department, NDSU"

Similar presentations


Ads by Google