From: Mark Silverman Sent: Wed, May 28, 2014 11:48 Hypothesis: How do you algorithmically.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Clustering.
Copyright © Cengage Learning. All rights reserved.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Fitting a Model to Data Reading: 15.1,
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
ASC Program Example Part 3 of Associative Computing Examining the MST code in ASC Primer.
Experimental Evaluation
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Binary Arithmetic Math For Computers.
Game Theory.
Clustering Unsupervised learning Generating “classes”
College Algebra Prerequisite Topics Review
The Marriage Problem Finding an Optimal Stopping Procedure.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Skills for October Rounds
CP Summer School Modelling for Constraint Programming Barbara Smith 1.Definitions, Viewpoints, Constraints 2.Implied Constraints, Optimization,
NUMBER SENSE AT A FLIP. Number Sense Number Sense is memorization and practice. The secret to getting good at number sense is to learn how to recognize.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Elementary Sorting Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 2.1.
CSC 211 Data Structures Lecture 13
B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.
FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa pf 9 pb a pc b pd pe c d e f a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6.
Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.
Enclose clusters with gaps using functionals (ScalarPTreeSets or SPTSs): C p,d (x)=(x-p) o d /  (x-p) o (x-p) Conical Separating clusters by cone gaps.
FAUST Oblique Analytics : X(X 1..X n )  R n |X|=N, Classes={C 1..C K }, d=(d 1..d n ) |d|=1, p=(p 1..p n )  R n, L, R: FAUST C ount C hange C lusterer.
EEE502 Pattern Recognition
FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Recursion ITI 1121 N. El Kadri. Reminders about recursion In your 1 st CS course (or its equivalent), you have seen how to use recursion to solve numerical.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
FAUST Analytics X(X1. Xn)Rn, |X|=N; Classes={C1. CK}; d=(d1
Ld Xod = Xod-pod= Ld-pod Lmind,k= min(Ld&Ck) Lmaxd,k= max(Ld&Ck)
FAUST Oblique Analytics Given a table, X(X1
FAUST Oblique Analytics Given a table, X(X1
PDR PTreeSet Distribution Revealer
Ld Xod = Xod-pod= Ld-pod Lmind,k= min(Ld&Ck) Lmaxd,k= max(Ld&Ck)
FAUST Oblique Analytics Given a table, X(X1
Fundamentals of Data Representation
FAUST Oblique Analytics are based on a linear or dot product, o Let X(X1...Xn) be a table. FAUST Oblique analytics employ.
15 things in 30 minutes 15 new activities to strengthen number work
Chapter 7: Transformations
pTree-k-means-classification-sequential (pkmc-s)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Frequency Distributions
Machine Learning = moving data up some concept hierarchy (increasing information and/or/by reducing volume). ML takes two forms: clustering and classification.
Classify x into C iff there exists cC such that (c-x)o(c-x)  r2
Presentation transcript:

From: Mark Silverman Sent: Wed, May 28, :48 Hypothesis: How do you algorithmically resolve conflicts between spts's ? You place the point in the largest found cluster where the density is greater than threshold. So point4 may not be a singleton depending on density It's a working theory I will work through some examples but I think this way may resolve my concerns on sparse highly dim data sets This first example shows that recursion is important. A slightly different example show recursion order is also important: On May 29, 2014, at 10:31AM One point of view is: The dot product functional is distance dominated (meaning the distance on the d-line between L d (x)=x o d and L d (y)=y o d is always less than or equal to distance(x,y) (so any gap in the projected L d values IS a gap, probably bigger, in the space; but many gaps in the space may not show up as gaps on a projection d-line). Therefore there's never a conflict between spts gap-enclosed clusters. Consecutive spts gaps  cluster(s), but not necessarily vice versa. (while a PCI followed by a PCD  a cluster, but there can be ambiguous nesting) Each spts reveals only some of the gaps, and the spts gap sizes are conservative. That’s why recursion is used. E.g., the gap of 110 between point4 and its complement is revealed by spts, Attr 1. We use the pTree mask restricting to points { } and there, Attr 2 reveals 2 more gaps (missed by Attr 1 ), gap2=100 between {138} and {256} and gap3=22 between {256} and {7}. But all projection gaps are legit and there are no conflicts. X Row Attr1 Attr , 3, Row Attr1 Attr Here, L e2 =Attr 2 reveals no gaps. So then, L e1 =Attr 1 is applied to all of X and reveals a gap of at least 100 for point5. The substantial gap of 50 between {1234} and {678} is missed, while if it were done in the other order: L e1 =Attr 1 is applied to all of X and reveals a gap of at least 100 for point5 (note the gap is actually ) { } and {5} are split (and {5} is declared an outlier and declared finished.) L e2 =Attr 2 is applied to X-{5}={ } and reveals a gap of at least 50 between {1234} and {678} StD: So StD doesn't always reveal the best order!

From Sent Thur, May 29, :21 I see the confusion. You start with any spts then recurse thru unclustered pts. {recurse on each subcluster set produced by that SPTS}. I'm thinking how to do this in hadoop in parallel - I have different spts arriving at different processors, so my issue is what happens if these processors arrive at different conclusions. Good question! Possible answer: Parallelize incrementally (as the recursion progresses). Let's assume the entire dataset is replicated to all nodes. ( If not, modifications need to be made.) Do the first dot product spts splitting. We get a mask pTree for each gap-enclosed sub-cluster (a mask pTree in Hadoop, is a horizontal array of bits specifying which documents are in the sub-cluster) In parallel, send each of those mask pTrees to a different node. (also send all of those mask pTrees to the designated “dendogram building” node). Second issue is that entire dataset cannot fit into memory so getting to density is a challenge. If we deal strictly with gaps, density may not be necessary. In any case, Barrel Density could give a good approx. Once the Linear distrib is done, compute the distribution of radial reach distances from the d-line. The max (or last PCD) of those numbers should give you a good barrel radius. From here one could simply take the max of the max radial reach (mrr) and the linear projection radius (lpr) as a radius, r. The volume would be roughly r n to divided into the count for density. If that proves to rough, the actual barrel volume is roughly mrr n-1 * lpr From: Mark Silverman Sent: Thursday, May 29, :09 Each processor sees only some of attributes, not all. I'm thinking we have one pass through the attributes to get stats and also we can pull radius. Then we can sequentially process Only issue now is cluster merge which I don't see a way around but huge benefit if we can figure it out which we will :) Sparseness is a major issue-for example in a million docs the word "arrrgghhh" in a single document could cluster. We might want it to be an outlier (2 s with “arrrgghhh” probably came from the same sender?). Or we might make a pass through our vocab, flagging those words that we don't want to trigger a cut. Then we would not use those columns as spts's. Clearly we need to apply some sort of relevance tuning probably better than stddev. Agree. Previous slide, last example shows that stddev can fail. Sparseness breaks kmeans because the average of any attribute will tend to be pretty close to zero. So I'm trying to keep that in mind. The other issue is that we cannot have a single processor manage every attribute so we need this notion of merging clusters somehow - that's why I've been thinking about the issue where two processors make two different decisions. I have to think about that more. We could think about those decisions as being separate collections of mask pTrees, then we just AND the collections for the combined dendogram level? Consider also steaming data where later arriving data may need to be either placed in an existing cluster or a new one. I think it's the same problem? We should always save the spts cut_points and use those to put new data in the proper cluster. If new data pt, y, is in the middle of a previous gap, it is the start of a new cluster? We could just build dis 2 (X,y) and calculate the minimum to decide. I’m going to assume: 1. Gap-based clustering (or splitting) will suffice in text mining (no need to split at PCCs also). Gap-based never requires a dendogram and therefore never requires density nor fusion. When I use the term “gap” I will mean gap > Gap_Threshold for some chosen GT. 2. The initial dispersal of attributes to nodes is a partition (mutually exclusive). Then for d, to get “X dot d” spts use a master-slave (slaves compute their parts and send their partial spts to the master to be added onto the final spts Of course, if d=ek the spts is the just the given kth column and is available without any required computation at one node. Thought: A cluster subset is always expressed as a mask pTree (a row of bits in Hadoop). Those results can be sent to a master node to build the full clustering (by ANDind each pair coming from different sites). I think such a master could keep up with the slaves working in parallel. We just have to AND all pairs of mask pTrees, one coming from each of two different sites. The result is one clustering of the entire dataset (the one our gap-based FAUST Oblique algorithm produces for the chosen Minimum Gap Threshold). In the text mining case I am going to guess using only the ek’s will suffice (We won’t even need combo of them). Said another way: A gap in any “X dot d” spts gives us two disjoint mask pTrees that split the space in two and we know that no bonafide cluster subset can span that divide (so there should never be a need to fuse). Therefore it doesn’t matter where cluster subsets (or splits) are revealed – at one site or by ANDing pairs of mask pTrees from different sites. From this point of view, a merge step may be unnecessary. Disclaimer: In order to be guaranteed we have gotten all linear gaps, we may need to employ all unit vectors, d (a full partitioning of the unit n- 1_half_sphere). And there are gaps that are not linear (not hyperplanar).

Machine Learning = moving data up some concept hierarchy (increasing information and/or/by reducing volume). ML takes two forms: clustering and classification (unsupervised and supervised). Clustering groups similar objects into a single higher level object or a cluster. Classification does the same but is supervised by an existing class assignment function on a Training Set, f:TS  {Classes}. 1. Clustering can be done for Anomaly Detection (detecting those object that are dissimilar from the rest; which boils down to finding singleton [and/or doubleton..] clusters. 2. Clustering can be done to develop a Training Set for classification of future unclassified objects. NCL: (k is discovered, not specified. Assign each (object,class) a ClassWeight, CW  Reals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0. Do for next d, compute L d = Xod until after masking off new cluster, count is too high (doesn't drop enough) For the next PCI in L d (next-larger starting from smallest) If followed by a PCD, declare next Class k and define it to be the set spherically gapped (or PCDed) around centroid, C k =Avg or VoM k over L d -1 [PCI, PCD]. Mask off this ticketed new Class k and contin If followed by a PCI, declare next Class k and define it to be the set spherically gapped (or PCDed) around the centroid, C k =Avg or VoM k over L d -1 [ (3PCI 1 +PCI 2 )/4, PCI 2 ) Mask off this ticketed new Class k and continue. For the next-smaller PCI (starting from largest) in L d If preceded by a PCD, declare next Class k and define it to be the set spherically gapped (or PCDed) around centroid C k =Avg or VoM k over L d -1 [PCD, PCI]. Mask off this ticketed new Class k, contin. If preceded by a PCI, declare next Class k and define it to be the set spherically gapped (or PCDed) around the centroid, C k =Avg or VoM k over L d -1 ( PCI 2, (3PCI 1 +PCI 2 )/4] Mask off this ticketed new Class k and continue. Machine Learning moves data up a concept hierarchy, according to some criteria which we call a similarity. A similarity is usually a function, s:X  X  CardinalSet s.t.  x  X s(x,x)  s(x,y)  y  X (every x must be at least as similar to itself as it is to any other object) and s(x,y)=s(y,x). OrdinalSet is usually a subset of {0,1,...} (e.g., binary {0,1}={No,Yes}). Classification is binary clustering: s(x,y)=1 iff f(x)=f(y). Using that part of f which is known to predict that which is unknown

Pillar pk-means clusterer (k is not specified - it reveals itself.) m 1 m 2 m 3 m 4 1a. Choose m 1 maximizing D 1 =Dis(X,avgX). 2a. Choose m 2 maximizing D 2 =Dis(X,m 1 ) 3a. Choose m 3 maximizing D 3 =D 2 +Dis(X,m 2 )... Do until the MinDist(m h,m k ) k<h < Threshold 1b. Check if m 1 is outlier with S m 1 1c. Repeat until m 1 is a non-outlier 3b. Check if m 3 is outlier with S m 3 3c. Repeat until m 3 is a non-outlier. 3d. Compute M i,3 =P m i  m 3 M 3,i =P m i <m 3 i<3 4a. Choose m 4 maximizing D 4 =D 3 +Dis(X,m 3 ) 2b. Check if m 2 is outlier with S m 2 2c. Repeat until m 2 is a non-outlier. 2d. Compute M 1,2 =P m 1  m 2 M 2,1 =P m 1 <m 2 4b. Check if m 4 is outlier with S m 4 4c. Repeat until m 4 is a non-outlier. 4d. Compute M i,4 =P m i  m 4 M 3,i =P m i <m 4 i<4 M j = & h  j M j,h are the mask pTrees of the k clusters for the k first round clusters. Apply pk-means from here on.

FAUST Oblique Analytics X(X 1..X n )  R n, |X|=N; Classes={C 1..C K }; d=(d 1..d n ), |d|=1; p=(p 1..p n )  R n ; Functionals: FAUST CountChange clusterer If DensThres unrealized, cut C at PCC s  L d,p &C with next (d,pd)  dpSet FAUST TopKOutliers Use D 2 NN=SqDist(x, X')=rank 2 S x for TopKOutlier-slider. FAUST Linear classifier y  C k iff y  LH k  {z | Lmin d,p,k  (z-p) o d  Lmmax d,pd,k }  (d,p)  dpSet LH k is a hull around C k. dpSet is a set of (d,p) pairs, e.g., (Diag,DiagStartPt). LH k is a hull around C k. dpSet is a set of (d,p) pairs, e.g., (Diag,DiagStartPt). Rk i Ptr(x,Ptr  Rank i S x ). Rk i SD(x,rank i S 2 ) ordered desc on rank i S x as constructed. Pre-compute what? 1. col stats(min, avg, max, std,...) ; 2. X o X; X o p, p=class_Avg/Med); 3. X o d, d=interclass_Avg/Med_UnitVector; 4. X o x, d 2 (X,x), Rk i d 2 (X,x), x  X, i=2..; 5. L d,p and R d,p  d,p  dpSet FAUST LinearSphericalRadial classifier y  C k iff y  LSRH k  {z | Tmin d,p,k  (z-p) o d  Tmax d,p,k  (d,p)  dpSet,  T=L|S|R } y  C k iff y  LSRH k  {z | Tmin d,p,k  (z-p) o d  Tmax d,p,k  (d,p)  dpSet,  T=L|S|R } L d,p  (X-p) o d= X o d-p o d= L d -p o dLmin d,p,k = min(L d,p &C k )Lmax d,p,k = max(L d,p &C k ) S p  (X-p) o (X-p)= X o X+X o (-2p)+p o pSmin p,k = min(S p &C k ) Smax p,k = max(S p &C k ) R d,p  S p -L 2 d,p = X o X+X o (-2p)+p o p-L 2 d -2p o d*X o d+p o d 2 = L -2p-(2p o d)d +p o p+p o d 2 +X o X-L 2 d Rmin d,pd,k = min(R d,p &C k ) Rmaxax d,pd,k = max(R d,p &C k ) X o X+p o p-2X o p - X o X+p o p-2X o p - [X o d-p o d] 2 (X-p) o (X-p) - [(X-p) o d] 2 = p x d  (x-p) o (x-p) (x-p) o d = |x-p| cos   (x-p) o (x-p) - (x-p) o d 2 X o d 2 - 2p o d X o d + p o d 2 X o X+p o p-2X o p - X o X+p o p-2X o p - + p o d 2 +X o X+p o p or + p o d 2 +X o X+p o p or X o d 2 - 2p o d X o d - 2p o d X o d + p o d 2 X o X+p o p-2X o p X o X+p o p-2X o p - X o d 2 + X o X + X o X

FAUST Oblique LSR Classification IRIS150 L d d= 0001 p=origin S 1 6 E I p=Avg S p=Avg E p=Avg I L d d= 1000 p=origin S43 59 E I , L d d= 0010 p=origin S E I L d d= 0100 p=origin S E I ,

LSR IRIS150- . Consider all 3 functionals, L, S and R. What's the most efficient way to calculate all 3?\ L p,d  (X - p) o d = L o,d - [p o d] minL p,d,k = min[L p,d & C k ] maxL p,d,k = max[L p,d & C k [ = [minL o,d,k ] - p o d = [maxL o,d,k ] - p o d = [minL o,d,k ] - p o d = [maxL o,d,k ] - p o d = min(X o d & C k ) - p o d = max(X o d & C k ) - p o d OR = min(X o d & C k ) - p o d = max(X o d & C k ) - p o d OR = min(X&C k ) o d - p o d = max(X&C k ) o d - p o d S p = (X - p) o (X - p) = -2X o p+S o +p o p = L o,-2p + (S o +p o p) minS p,k =minS p &C k maxS p,k = maxS p &C k = min[(X o ( -2p) &C k )] + (X o X+p o p) =max[(X o ( -2p) &C k )] + (X o X+p o p) OR = min[(X&C k ) o -2p] + (X o X+p o p) =max[(X&C k ) o -2p] + (X o X+p o p) R p,d  S p, - L p,d 2 minR p,d,k =min[R p,d &C k ] maxR p,d,k =max[R p,d &C k ] o=origin; p  R n ; d  R n, |d|=1; {C k } k=1..K are the classes; An operation enclosed in a parallelogram,, means it is a pTree op, not a scalar operation (on just numeric operands) I suggest that we use each of the functionals with each of the pairs, (p,d) that we select for application (since, to get R we need to compute L and S anyway). So it would make sense to develop an optimal (minimum work and time) procedure to create L, S and R for any (p,d) in the set. APPENDIX

LSR on IRIS150 Dse S E I L H C 1,3 : 0 s 49 e 11 i Dei E y isa O if yoDei  (- ,-117)  (-3,  ) I y isa O or E or I if yoDei  C 2,1  [-62,-44] L H y isa O or I if yoDei  C 2,2  [-44, -3] C 2,1 : 2 e 4 i Dei E y isa O if yoDei  (- ,420)  (459,480)  (501,  ) I y isa O or E if yoDei  C 3,1  [420,459] L H y isa O or I if yoDei  C 3,2  [480,501] Continue this on clusters with OTHER + one class, so the hull fits tightely (reducing false positives), using diagonals? y isa OTHERif y o Dse  (- ,495)  (802,1061)  (2725,  ) y isa OTHER or S if y o Dse  C 1,1  [ 495, 802] y isa OTHER or Iif y o Dse  C 1,2  [1061,1270] y isa OTHER or Iif y o Dse  C 1,4  [2010,2725] y isa OTHER or E or I if y o Dse  C 1,3  [1270,2010 C 13 C 1,1 : D= y isa O if yoD  (- ,43)  (58,  ) L H y isa O|S if yoD  C 2,3  [43,58] C 2,3 : D= y isa O if yoD  (- ,23)  (44,  ) L H y isa O|S if yoD  C 3,3  [23,44] C 3,3 : D= y isa O if yoD  (- ,10)  (19,  ) L H y isa O|S if yoD  C 4,1  [10,19] C 4,1 : D= y isa O if yoD  (- ,1)  (6,  ) L H y isa O|S if yoD  C 5,1  [1,6] C 5,1 : D= y isa O if yoD  (- ,68)  (117,  ) L H y isa O|S if yoD  C 6,1  [68,117] C 6,1 : D= y isa O if yoD  (- ,54)  (146,  ) L H y isa O|S if yoD  C 7,1  [54,146] C 7,1 : D= y isa O if yoD  (- ,44)  (100,  ) L H y isa O|S if yoD  C 8,1  [44,100] C 8,1 : D= y isa O if yoD  (- ,36)  (105,  ) L H y isa O|S if yoD  C 9,1  [36,105] C 9,1 : D= y isa O if yoD  (- ,26)  (61,  ) L H y isa O|S if yoD  C a,1  [26,61] C a,1 : D= y isa O if yoD  (- ,12)  (91,  ) L H y isa O|S if yoD  C b,1  [12,91] C b,1 : D= y isa O if yoD  (- ,81)  (182,  ) L H y isa O|S if yoD  C c,1  [81,182] C c,1 : D= y isa O if yoD  (- ,71)  (137,  ) L H y isa O|S if yoD  C d,1  [71,137] C d,1 : D= y isa O if yoD  (- ,55)  (169,  ) L H y isa O|S if yoD  C e,1  [55,169] C e,1 : D= y isa O if yoD  (- ,39)  (127,  ) L H y isa O|S if yoD  C f,1  [39,127] C f,1 : D= y isa O if yoD  (- ,84)  (204,  ) L H y isa O|S if yoD  C g,1  [84,204] C g,1 : D= y isa O if yoD  (- ,10)  (22,  ) L H y isa O|S if yoD  C h,1  [10,22] C h,1 : D= y isa O if yoD  (- ,3)  (46,  ) L H y isa O|S if yoD  C i,1  [3,46] The amount of work yet to be done., even for only 4 attributes, is immense.. For each D, we should fit boundaries for each class, not just one class.  D, not only cut at minC o D, maxC o D but also limit the radial reach for each class (barrel analytics)? Note, limiting the radial reach limits all other directions [other than the D direction] in one step and therefore by the same amount. I.e., it limits all directions assuming perfectly round clusters). Think about Enron, some words (columns) have high count and others have low count. Our radial reach threshold would be based on the highest count and therefore admit many false positives. We can cluster directions (words) by count and limit radial reach differently for different clusters?? For 4 attributes, I count 77 diagonals*3 classes = 231 cases. How many in the Enron case with 10,000 columns? Too many for sure!!

Dot Product SPTS computation: X o D =  k=1..n X k D k /*Calc P XoD,i after P XoD,i-1 CarrySet=CAR i-1,i RawSet=RS i */ INPUT: CAR i-1,i, RS i ROUTINE: P XoD,i =RS i  CAR i-1,i CAR i,i+1 =RS i &CAR i-1,i OUTPUT: P XoD,i, CAR i,i CAR1 1,2  & P XoD,1 100  &  & 001 CAR2 2,3 100 P XoD,2 & 011 P XoD,3 000 CAR1 3, D D 1,1 D 1,0 1 1 D 2,1 D 2, X X 1 X 2 p 11 p p 21 p XoDXoDXoDXoD p XoD,3 p XoD,2 p XoD,1 p XoD,0 ( = (1 p 1,0 + 1 p p (1 p 1,0 1 p 1,1 + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) P XoD,0 CAR1 0,1  & & P XoD,3 010 P XoD,4 Different data. 3 3 D D 1,1 D 1,0 1 1 D 2,1 D 2,0 1 1 ( = 22= 22= 22= (1 p 1,0 + 1 p p (1 p 1,0 1 p 1,1 + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) X pTrees XoDXoDXoDXoD We have extended the Galois field, GF(2)={0,1}, XOR=add, AND=mult to pTrees. 011 P XoD,0 100 CAR1 0,1  &  000 & 101 CAR2 1, CAR1 2,3  & 010  & P XoD,  &  &  & 010&  &  &  & 010 P XoD,2 & = (2 1 p 1, p 1,0 ) (2 1 p 2, p 2,0 ) = 2 2 p 1,1 p 2, ( p 1,1 p 2,0 + p 2,1 p 1,0 ) p 1,0 p 2,0 X1*X2X1*X2X1*X2X1*X & p X 1 *X 2,0 &011&010 & p X 1 *X 2,3 010  & 000 p X 1 *X 2,2 010  & 001 p X 1 *X 2,1 SPTS multiplication: (Note, pTree multiplication = &) X X 1 X 2 p 11 p 10 p 21 p X1*X2X1*X2X1*X2X1*X p X 1 *X 2, p X 1 *X 2,2 p X 1 *X 2,1 p X 1 *X 2,0

Rank N-1 (X o D)=Rank 2 (X o D) D=x 1 D 1,1 D 1,0 0 1 D 2,1 D 2, X X 1 X 2 p 11 p p 21 p XoDXoDXoDXoD p3p3p3p3 p2p2p2p2 p1p1p1p1 p,0 RankK: p is what's left of K yet to be counted, initially p=K V is the RankKvalue, initially 0. For i=bitwidth+1 to 0 if Count(P&P i )  p { KVal=KVal+2 i ; P=P&P i }; else /* < p */{ p=p-Count(P&P i ); P=P&P' i }; 111 P=P&p 1 3  2 1* P 111 p1 p1 p1 p1n=1p=2011 P=p 0 &P 2  2 1*2 1 +1*2 0 =3 so -2x 1 o X = P 011 &p 0 n=0p=2 Rank N-1 (X o D)=Rank 2 (X o D) D=x 2 D 1,1 D 1,0 1 1 D 2,1 D 2, XoDXoDXoDXoD p3p3p3p3 p2p2p2p2 p1p1p1p1 p,0 101 P=P&p' 3 1<2 2-1=1 0* P 010 p3 p3 p3 p3n=3p=2101 P=p' 2 &P 0<1 1-0=1 0*2 3 +0* P 000 &p 2 n=2p=1 101 P=p 1 &P 2  1 0*2 3 +0*2 2 +1* P 101 &p 1 n=1p=1100 P=p 0 &P 1  1 0*2 3 +0*2 2 +1*2 1 +1*2 0 =3 so -2x 2 o X= -6 so -2x 2 o X= P 110 &p 0 n=0p=1 Rank N-1 (X o D)=Rank 2 (X o D) D=x 3 D 1,1 D 1,0 1 0 D 2,1 D 2, XoDXoDXoDXoD p3p3p3p3 p2p2p2p2 p1p1p1p1 p,0 011 P=P&p 2 2  2 1* P 011 p2 p2 p2 p2n=2p=2001 P=p' 1 &P 1<2 2-1=1 1*2 2 +0* P 110 &p 1 n=1p=2 001 P=p 0 &P 1  1 1*2 2 +0*2 1 +1*2 0 =5 so -2x 3 o X= -10 so -2x 3 o X= P 101 &p 0 n=0p=1Example: FAUST Oblique: X o D used in CCC, TKO, PLC and LARC) and (x-X) o (x-X) = -2X o x+x o x+X o X is used in TKO. = -2X o x+x o x+X o X is used in TKO. So in FAUST, we need to construct lots of SPTSs of the type, X dotted with a fixed vector, a costly pTree calculation (Note that X o X is costly too, but it is a 1-time calculation (a pre-calculation?). x o x is calculated for each individual x but it's a scalar calculation and just a read-off of a row of X o X, once X o X is calculated.. Thus, we should optimize the living he__ out of the X o D calculation!!! The methods on the previous seem efficient. Is there a better method? Then for TKO we need to computer ranks:

pTree Rank(K) computation: (Rank(N-1) gives 2 nd smallest which is very useful in outlier analysis?) X P 4,3 P 4,2 P 4,1 P 4, {0} {1} {0} {1} (n=3) c=Count(P&P 4,3 )= 3 < 6 p=6–3=3; P=P&P’ 4,3 masks off highest 3 (val  8) (n=2) c=Count(P&P 4,2 )= 3 >= 3 P=P&P 4,2 masks off lowest 1 (val  4) (n=1) c=Count(P&P 4,1 )=2 < 3 p=3-2=1; P=P&P' 4,1 masks off highest 2 (val  8-2=6 ) (n=0) c=Count(P&P 4,0 )=1 >= 1 P=P&P 4, {0}{1}{0}{1} RankKval=0; p=K; c=0; P=Pure1; /*Note: n=bitwidth-1. The RankK Points are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&P i ); If (c>=p) {RankVal=RankVal+2 i ; P=P&P i }; else {p=p-c; P=P&P' i }; return RankKval, P; /* Above K=7-1=6 (looking for the Rank6 or 6 th highest vaue (which is also the 2 nd lowest value) */ Cross out the 0-positions of P each step. 5 P=MapRankKPts= ListRankKPts={2} * * * * = RankKval= Rank N-1 (X o D)=Rank 2 (X o D) D D 1,1 D 1,0 0 1 D 2,1 D 2, X X 1 X 2 p 11 p p 21 p XoDXoDXoDXoD p3p3p3p3 p2p2p2p2 p1p1p1p1 p,0 011 P=P&p 3 2  2 1* P=p' 2 &P 0<2 2-0=2 1*2 3 +0* P 011 p3 p3 p3 p3 011 P 100 &p 2 n=3p=2n=2p=2 011 P=p' 1 &P 0<2 2-0=2 1*2 3 +0*2 2 +0* P 100 &p 1 n=1p=2 011 P=p 0 &P 2  2 1*2 3 +0*2 2 +0*2 1 +1*2 0 =9 011 P 011 &p 0 n=0p=2

p6' 1 0 5/64 [0,64) p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p /64 [64,128) p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 Y y1 y2 y1 1 1 y2 3 1 y3 2 2 y4 3 3 y5 6 2 y6 9 3 y y y ya 13 4 pb 10 9 yc yd 9 11 ye yf 7 8 yofM p6 0 1 p p p p p p p6' 1 0 p5' p4' p3' p2' p1' p0' p3' [0,8) p [8,16) p3' [16,24) p [24,32) p3' [32,40) p [40,48) p3' [48,56) p [56,64) p3' p p3' [80,88) p [88,96) p3' [96,104) p [194,112) p3' [112,120) p [120,128) p4' /16[0,16) p4' p /16[16,32) p p4' [32,48) p4' p [48,64) p p4' [64,80) p4' p [80,96) p p4' [96,112) p4' p [112,128) p p5' /32[0,32) p5' p5' p5' p5' /32[64,96) p5' p5' p5' p /32[32,64) p p p p ¼[96,128) p p p f= UDR Univariate Distribution Revealer (on Spaeth:) Pre-compute and enter into the ToC, all DT(Y k ) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector. Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in fact are often all we need to know about the pTree to get the answers we are after.) depthDT(S)  b≡BitWidth(S) h=depth of a node k=node offset Node h,k has a ptr to pTree{x  S | F(x)  [k2 b-h+1, (k+1)2 b-h+1 )} and its 1count applied to S, a column of numbers in bistlice format (an SpTS), will produce the DistributionTree of S DT(S) 15depth=h=0 depth=h=1 node 2,3 [96.128)

So let us look at ways of doing the work to calculate As we recall from the below, the task is to ADD bitslices giving a result bitslice and a set of carry bitslices to carry forward X o D =  k=1..n X k *D k 3 3 D D 1,1 D 1,0 1 1 D 2,1 D 2,0 1 1 ( = p 1,0 + 1 p p p 1,0 1 p 1,1 (( + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) X pTrees ( = p 1,0 + 1 p p p 1,0 1 p 1,1 (( + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) I believe we add by successive XORs and the carry set is the raw set with one 1-bit turned off iff the sum at that bit is a 1-bit Or we can characterize the carry as the raw set minus the result (always carry forward a set of pTrees plus one negative one). We want a routine that constructs the result pTree from a positive set of pTrees plus a negative set always consisting of 1 pTree. The routine is: successive XORs across the positive set then XOR with the negative set pTree (because the successive pset XOR gives us the odd values and if you subtract one pTree, the 1-bits of it change odd to even and vice versa.): /*For P XoD,i (after P XoD,i-1 ). CarrySetPos=CSP i-1,i CarrySetNeg=CSN i-1,i RawSet=RS i CSP -1 =CSN -1 =  */ INPUT: CSP i-1, CSN i-1, RS i ROUTINE: P XoD,i =RS i  CSP i-1,i  CSN i-1,i CSN i,i+1 =CSN i-1,i  P XoD,i ; CSP i,i+1 =CSP i-1,i  RS i-1 ; OUTPUT: P XoD,i, CSN i,i+1 CSP i,i  RS = 699 XoDXoDXoDXoD P XoD,0 CSP -1,0 =CSN -1,0 =  RS 1  CSN 0,1 = CSN -1.0  P XoD,0  000 = P XoD,     CSP 0,1 = CSP -1,0  RS

X o D =  k=1..n X k *D k  k=1..n ( = 2 2B + 2 2B-1 D k,B p k,B-1 + D k,B-1 p k,B + D k,B-1 p k,B + 2 2B-2 D k,B p k,B-2 + D k,B-1 p k,B-1 + D k,B-1 p k,B-1 + D k,B-2 p k,B + 2 2B-3 D k,B p k,B-3 + D k,B-1 p k,B-2 + D k,B-1 p k,B-2 + D k,B-2 p k,B-1 +D k,B-3 p k,B D k,B p k,0 + D k,2 p k,1 + D k,2 p k,1 + D k,1 p k,2 +D k,0 p k, D k,2 p k,0 + D k,1 p k,1 + D k,1 p k,1 + D k,0 p k, D k,1 p k,0 + D k,0 p k,1 + D k,0 p k, D k,0 p k,0 D k,B p k,B  k=1..n (  k=1..n ( X o D=  k=1,2 X k *D k with pTrees: q N..q 0, N=2 2B+roof(log 2 n)+2B+1 N=2 2B+roof(log 2 n)+2B+1  k=1..2 ( = D k,1 p k,0 + D k,0 p k,1 + D k,0 p k, D k,0 p k,0 D k,1 p k,1  k=1..2 ( XpTrees D D 1,1 D 1,0 0 1 D 2,1 D 2,0 1 0 B=1 ( = D 1,1 p 1,0 + D 1,0 p 11 + D 1,0 p D 1,0 p 1,0 D 1,1 p 1,1 (( + D 2,1 p 2,1 ) + D 2,1 p 2,0 + D 2,0 p 2,1 ) + D 2,0 p 2,1 ) + D 2,0 p 2,0 ) ( = D 1,1 p 1,0 + D 1,0 p 11 + D 1,0 p D 1,0 p 1,0 D 1,1 p 1,1 (( + D 2,1 p 2,1 ) + D 2,1 p 2,0 + D 2,0 p 2,1 ) + D 2,0 p 2,1 ) + D 2,0 p 2,0 ) q 0 = p 1,0 = no carry 110 q 1 = carry 1 = q 1 = carry 1 = q 2 =carry 1 = no carry D D 1,1 D 1,0 1 1 D 2,1 D 2,0 1 1 q 0 = carry 0 = ( = p 1,0 + 1 p p p 1,0 1 p 1,1 (( + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) q 1 =carry 0 +raw 1 = carry 1 = A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?) 1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added) 2. carryTree as a SPTS, s 1 ? (next level rawTree=SPTS, s 2, then s 10 & s 20 = q next_level and carry next_level ? q 2 =carry 1 +raw 2 = carry 2 = q 3 =carry 2 = carry 3 =  q 3 =carry 2 = carry 3 =  111 CCC Clusterer If DT (and/or DUT) not exceeded at C, partition C further by cutting at each gap and PCC in C o D For a table X(X 1...X n ), the SPTS, X k *D k is the column of numbers, x k *D k. X o D is the sum of those SPTSs,  k=1..n X k *D k X k *D k = D k  b 2 b p k,b = 2 B D k p k,B D k p k,0 = D k (2 B p k,B p k,0 ) = (2 B p k,B p k,0 ) (2 B D k,B D k,0 ) + 2 2B-1 (D k,B-1 p k,B D k,0 p k,0 = 2 2B ( D k,B p k,B ) +D k,B p k,B-1 ) So, DotProduct involves just multi-operand pTree addition. (no SPTSs and no multiplications) Engineering shortcut tricka would be huge!!!

Question: Which primitives are needed and how do we compute them? X(X 1...X n ) D 2 NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}). D 2 NN = each min[D 2 NN(x)] (x-X)o(x-X)=  k=1..n (x k -X k )(x k -X k )=  k=1..n (  b=B..0 2 b x k,b -2 b p k,b )( (  b=B..0 2 b x k,b -2 b p k,b ) =  k=1..n (  b=B..0 2 b (x k,b -p k,b ) ) ( ----a k,b a k,b ---  b=B..0 2 b (x k,b -p k,b ) ) (2 B a k,B + 2 B-1 a k,B a k, a k, 0 ) (2 B a k,B + 2 B-1 a k,B a k, a k, 0 ) =k=k=k=k ( 2 2B a k,B a k,B + 2 2B-1 ( a k,B a k,B-1 + a k,B-1 a k,B ) + { 2 2B a k,B a k,B-1 } 2 2B-2 ( a k,B a k,B-2 + a k,B-1 a k,B-1 + a k,B-2 a k,B ) + { 2B-1 a k,B a k,B B-2 a k,B B-3 ( a k,B a k,B-3 + a k,B-1 a k,B-2 + a k,B-2 a k,B-1 + a k,B-3 a k,B ) + { 2 2B-2 ( a k,B a k,B-3 + a k,B-1 a k,B-2 ) } 2 2B-4 (a k,B a k,B-4 +a k,B-1 a k,B-3 +a k,B-2 a k,B-2 +a k,B-3 a k,B-1 +a k,B-4 a k,B )... {2 2B-3 ( a k,B a k,B-4 +a k,B-1 a k,B-3 )+2 2B-4 a k,B-2 2 } =2 2B ( a k,B 2 + a k,B a k,B-1 ) + 2 2B-1 ( a k,B a k,B-2 ) + 2 2B-2 ( a k,B B-3 ( a k,B a k,B-4 +a k,B-1 a k,B-3 ) + 2 2B-4 a k,B a k,B a k,B-3 + a k,B-1 a k,B-2 ) + D2NN=multi-op pTree adds? When x k,b =1, a k,b =p' k,b and when x k,b =0, a k,b = -p k.b So D2NN just multi-op pTree mults/adds/subtrs? Each D2NN row (each x  X) is separate calc. Should we pre-compute all p k,i *p k,j p' k,i *p' k,j p k,i *p' k,j ANOTHER TRY! X(X 1...X n ) RKN (Rank K Nbr), K=|X|-1, yields1.a_outlier_detector (top y dissimilarity from X-{x}). Install in RKN, each RankK(D2NN(x)) (1-time construct but for. e.g., 1 trillion x s ? |X|=N=1T, slow. Parallelization?)  x  X, the square distance from x to its neighbors (near and far) is the column of number (vTree or SPTS) d 2 (x,X)= (x-X) o (x-X)=  k=1..n |x k -X k | 2 =  k=1..n (x k -X k )(x k -X k )=  k=1..n (x k 2 -2x k X k +X k 2 ) = -2  k x k X k +  k x k 2 +  k X k 2 = -2x o X + x o x + X o X  k=1..n  i=B..0,j=B..0 2 i+j p k,i p k,j  i,j 2 i+j  k p k,i p k,j 1. precompute pTree products within each k 2. Calculate this sum one time (independent of the x) 3. Pick this from XoX for each x and add to Add 3 to this -2x o X cost is linear in |X|=N.x o x cost is ~zero. X o X is 1-time -amortized over x  X (i.e., =1/N) or precomputed The addition cost, -2x o X + x o x + X oX, is linear in |X|=N So, overall, the cost is linear in |X|=n. Data parallelization? No! (Need all of X at each site.) Code parallelization? Yes! (After replicating X to all sites, Each site creates/saves D2NN for its partition of X, then sends requested number(s) (e.g., RKN(x) ) back.

LSR on IRIS150-3 Here we use the diagonals. d=e 1 p=AVGs, L=(X-p) o d S E I R(p,d,X) S E I [43,49) S(16) [49,58) E(24)I(6) 0 S(34) [70,79] I(12) [58,70) E(26) I(32) Only overlap L=  [58,70), R  [792,1557] (E(26), I(5)) With just d=e 1, we get good hulls using LARC: While  I p,d containing >1class, for next (d,p) X o d-p o dX o X+p o p-2X o p-L 2 create L(p,d)  X o d-p o d, R(p,d)  X o X+p o p-2X o p-L 2 1.  MnCls(L), MxCls(L), create a linear boundary. 2.  MnCls(R), MxCls(R).create a radial boundary. 3. Use R&C k to create intra-C k radial boundaries H k =  {I | L p,d includes C k } R & L I(1) I(42) E(50) I(7) (36,7) (11) d=e 1 p=A S L=(X-p) o d (-pod=-50.06) S&L -1; E&L I&L -8,-2 16 [-2,8) 34, 24, [20,29] 12 [8,20) w p=AvgE 26, <-- E=6 I=4 p=AvgE d=e 1 p=Avg S, L = X o d S&L E&L I&L Here we try using other p points for the R step (other than the one used for the L step). d=e 1 p=A S L=(X-p) o d (-pod=-50.06) S&L -1; E&L I&L -8,-2 16 [-2,8) 34, 24, [20,29] 12 [8,20) 26, E=26 I=5 p=AvgS 30ambigs, 5 errs d=e 1 p=A S L=(X-p) o d (-pod=-50.06) S&L -1; E&L I&L -8,-2 16 [-2,8) 34, 24, [20,29] 12 [8,20) w p=AvgI 26, <-- E=25 I=10 p=AvgI There is a best choice of p for the R step (p=AvgE) but how would we decide that ahead of time? d=e 4 p=Avg S, L=(X-p) o d -2 4 S&L 7 16 E&L I&L -2,4) 50 [7,11) 28 [16,23] I=34 [11,16) 22, E=22 I=7 p=AvgS d=e 4 p=Avg S, L=(X-p) o d -2 4 S&L 7 16 E&L I&L -2,4) 50 [7,11) 28 [16,23] I=34 [11,16) 22, E=17 I=7 p=AvgE d=e 4 p=Avg S, L=(X-p) o d -2 4 S&L 7 16 E&L I&L -2,4) 50 [7,11) 28 [16,23] I=34 [11,16) 22, E=22 I=8 p=AvgI For e4, the best choice of p for the R step is also p=AvgE. (There are mistakes in this column on the previous slide!)

LSR on IRIS150 Dse ; x o Des: S E I y isa O if y o D  (- ,-184)  (123,381)  (2046,  ) y isa O or S(50) if y o D  C 1,1  [-184, 123] y isa O or I(1) if y o D  C 1,2  [ 381, 590] y isa O or I(38) if y o D  C 1,4  [1331,2046] y isa O or E(50) or I(11) if y o D  C 1,3  [ 590,1331] SRR(AVGs,dse) on C 1, S y isa O if y isa C 1,1 AND SRR(AVGs,Dse)  (154,  ) y isa O or S(50) if y isa C 1,1 AND SRR(AVGs,DSE)  [0,154] SRR(AVGs,dse) on C 1,2 only one such I SRR(AVGs,dse) onC 1, E 7 143I y isa O if y isa C 1,3 AND SRR(AVGs,Dse)  (- ,2)U(143,  ) y isa O or E(10) if y isa C 1,3 AND SRR in [2,7) y isa O or E(40) or I(10) if y isa C 1,3 AND SRR in [7,137) = C 2,1 y isa O or I(1) if y isa C 1,3 AND SRR in [137,143] etc. We use the Radial steps to remove false positives from gaps and ends. We are effectively projecting onto a 2-dim range, generated by the Dline and the D  line (which measures the perpendicular radial reach from the D-line). In the D  projections, we can attempt to cluster directions into "similar" clusters in some way and limit the domain of our projections to one of these clusters at a time, accommodating "oval" shaped or elongated clusters giving a better hull fit. E.g., in the Enron case the dimensions would be words that have about the same count, reducing false positives. Dei ; x o Dei on C 2,1 : E -2 3 I y isa O if y o D  (- ,-2)  (19,  ) y isa O or I(8) if y o D  [ -2, 1.4] y isa O or E(40) or I(2) if y o D  C 3,1  [ 1.4,19] SRR(AVGe,dei) onC 3, E 8 106I y isa O if y isa C 3,1 AND SRR(AVGs,Dei)  [0,2)  (370,  ) y isa O or E(4) if y isa C 3,1 AND SRR(AVGs,Dei)  [2,8) y isa O or E(27) or I(2) if y isa C 3,1 AND SRR(AVGs,Dei)  [8,106) y isa O or E(9) if y isa C 3,1 AND SRR(AVGs,Dei)  [106,370] LSR on IRIS150-2 We use the diagonals. Also we set a MinGapThres=2 which will mean we stay 2 units away from any cut d=e 1 =1000; The x o d limits: S E I y isa O if y o D  (- ,43)  (79,  ) y isa O or S( 9) if y o D  [43,47] y isa O or S(41) or E(26) or I( 7) if y o D  (47,60) (y  C 1,2 ) y isa O or E(24) or I(32) if y o D  [60,72] (y  C 1,3 ) y isa O if y o D  [43,47]&SRR  (- ,52)  (60,  ) y isa O or I(11) if y o D  (72,79] y isa O if y o D  [72,79]&SRR  (- ,49)  (78,  ) d=e 3 =0010 on C 2,2 x o d lims: S E I y isa O if y o D  (- ,28)  (33,  ) y isa O or S(13) or E(10) or I(3) if y o D  [28,33] d=e 3 =0001 x o d lims: E I y isa O or S(13) if y o D  [1,5] y isa O if y o D  (- ,1)  (5,12)  (24,  ) y isa O or E( 9) if y o D  [12,16) y isa O or E( 1) or I( 3) if y o D  [16,24) y isa O if y o D  [12,16)&SRR  [0,208)  (558,  ) y isa O if y o D  [16,24 )&SRR  [0,1198)  (1199,1254)  1424,  ) y isa O or E(1) if y o D  [16,24)&SRR  [1198,1199] y isa O or I(3) if y o D  [16,24)&SRR  [1254,1424] y isa O or E( 3) if y o D  [18,23) y isa O if y o D  (- ,18)  (46,  ) y isa O or E(13) or I( 4) if y o D  [23,28) (y  C 2,1 ) y isa O or S(13) or E(10) or I( 3) if y o D  [28,34) (y  C 2,2 ) y isa O or S(28) if y o D  [34,46] y isa O if y o D  [18,23)&SRR  [0,21) y isa O if y o D  [34,46]&SRR  [0,32]  [46,  ) d=e 2 =0100 on C 1,2 x o d lims: S E I d=e 2 =0100 on C 1,3 x o d lims: E I zero differentiation! y isa O or E(17) if y o D  [60,72]&SRR  [1.2,20] y isa O or I(25)if y o D  [60,72]&SRR  [66,799] y isa O or E( 7) or I( 7)if y o D  [60,72]&SRR  [20, 66] y isa O if y o D  [0,1.2)  (799,  )

LSR IRIS150. d=e 1 p=A S L=(X-p) o d (-pod=-50.06) S&L -1; E&L I&L -8,-2 16 [-2,8) 34, 24, [20,29] 12 [8,20) 26, E=26 I=5 30ambigs, 5 errs d=e 4 p=Avg S, L=(X-p) o d -2 4 S&L 7 16 E&L I&L -2,4) 50 [7,11) 28 [16,23] I=34 [11,16) 22, E=22 I=16 38ambigs 16errs d=e 3 p=Avg E, L=(X-p) o d S&L E&L I&L,-25) , [9,27] I=34 [-12,9) 49, 15 2(17) E=32 I=14 d=e 4 p=Avg E, L=(X-p) o d S&L -3 5 E&L 1 12 I&L -7] 50 [-3,1) 21 [5,12] 34 [1,5) 22, E=22 I=16 d=e 2 p=Avg S, L=(X-p) o d S&L E&L I&L,-13) 1 -13,-11 0, 2, 1 all=-11 [0,4) [4, ,0 29,47, , 1 46,11 2, 1 9, 3 d=e 3 p=Avg S, L=(X-p) o d -5 5 S&L E&L 4 55 I&L -5,4) 47 [4,15) 3 1 [37,55] I=34 [15,37) 50, E=18 I=12 3, 1 d=e 1 p=A E L=(X-p) o d (-pod=-59.36) S&L E&L I&L [-11,-1) 33, 21, [11,20] I12 [-1,11) 26, E=7 I=4 E=5 I=3 d=e 2 p=Avg E, L=(X-p) o d -5 `17 S&L -8 7 E&L I&L,-6) 1 [-6, -5) 0, 2, [7,11) [11, err [-5,7) 29,47, , 21 21, 3 d=e 1 p=A I L=(X-p) o d (-pod=-65.88) S&L E&L I&L [-17,-8) 33, 21, [-8,4) 26, E=26 I=11 E=2 I=1 d=e 2 p=Avg I, L=(X-p) o d -7 `15 S&L E&L -8 9 I&L,-6) 1 [6,11) [11, [-7, 4) 29,46, [-8, -7) 2, 1 allsame E=2 I=1 E=47 I=22 [5, 9] 9, 2, 1 allsame S=9 E=2 I=1 d=e 3 p=Avg I, L=(X-p) o d S&L E&L I&L,-25) , [9,27] I=34 [-25,-4) 50, E=32 I=14 E=46 I=14 d=e 4 p=Avg I, L=(X-p) o d S&L E&L -6 5 I&L [5,12] 34 [-6,-3) 22, 16 same range E=22 I=16 d=Avg E  Avg I p=Avg E, L=(X-p) o d S E I R(p,d,X) S E I [-17,-14)] I(1) [-14,11) (50, 13) [11,33] I(36) E=47 I=12 R(p,d,X) S E I [12,17.5)] I(1) d=Avg S  Avg I p=Avg S, L=(X-p) o d -6 5 S E I [17.5,42) (50,12) [11,33] I(37) E=45 I=12 d=Avg S  Avg E p=Avg S, L=(X-p) o d -6 4 S E I R(p,d,X) S E I [11,18)] I(1) [18,42) (50,11) [42,64] 38 E=39 I=11 d=e 1 p=Avg S, L = X o d S&L E&L I&L Note that each L=(X-p) o d is just a shift of X o d by -p o d (for a given d). Next, we examine: For a fixed d, the SPTS, L p,d. is just a shift of L d  L origin, d by -p o d we get the same intervals to apply R to, independent of p (shifted by -p o d). Thus, we calculate once, ll d =minX o d hl d =maxX o d, then for each different p we shift these interval limit numbers by -p o d since these numbers are really all we need for our hulls (Rather than going thru the SPTS calculation of (X- p) o d anew  new p). There is no reason we have to use the same p on each of those intervals either. So on the next slide, we consider all 3 functionals, L, S and R. E.g., Why not apply S first to limit the spherical reach (eliminate FPs). S is calc'ed anyway?

LSR IRIS150 e 2 L d d= 0100 p=origin Setosa vErsicolor vIrginica d=0100 p=AS=( ) d=0100 p=AE=( ) d=0100 p=AI=( ) all

XoX FAUST Oblique, LSR Linear, Spherical, Radial classifier  p, (pre-ccompute?)  p, (pre-ccompute?) L d,p  (X-p) o d=L d -p o d n k,L,d,p  min(C k &L d,p )=n k,L,d -p o d x k,L,d.p  max(C k &L d,p )=x k,L,d -p o d On IRIS150  d, precompute! X o X, L d =X o d n k,L,d  Lmin(C k &L d ) x k,L,d  max(C k &L d ) d=1000 p=A S =( ) d=1000 p=A E =( ) d=1000 p=A I =( ) p=Avg S p=Avg E p=Avg I d=0100 p=A S =( ) d=0100 p=A E =( ) d=0100 p=A I =( ) d=0010 p=A S =( ) d=0010 p=A E =( ) d=0010 p=A I =( ) d=0001 p=A S =( ) d=0001 p=A E =( ) d=0001 p=A I =( ) We have introduce 36 linear bookends to the class hulls, 1 pair for each of 4 d s, 3 p s, 3 class. For fixed d, C k, the pTree mask is the same over the 3 p's. However we need to differentiate anyway to calculate R correctly. That is, for each d-line we get the same set of intervals for every p (just shifted by -p o d). The only reason we need to have them all is to accurately compute R on each min-max interval. In fact, we computer R on all intervals (even those where a single class has been isolated) to eliminate False Positives (if FPs are possible - sometimes they are not, e.g., if we are to classify IRIS samples known to be Setosa, vErsicolor or vIriginica, then there is no "other"). n k,L,d,p = n k,L,d -p o d x k,L,d.p = x k,L,d -p o d. Assuming L d, n k,L,d and x k,L,d have been pre-computed and stored, the cut-pt pairs of (n k,L,d,p ; x k,L,d,p ) are computed without further pTree processing, by the scalar computations: n k,L,d,p = n k,L,d -p o d x k,L,d.p = x k,L,d -p o d. L d d= L p,d =L d -p o d d=e 1 d=e 2 d=e 3 d=e 4 d=1000 p=0000 n k,L,d x k,L,d S E I S E I d=0100 p=0000 n k,L,d x k,L,d S E I d=0010 p=0000 n k,L,d x k,L,d S 1 6 E I d=0001 p=0000 n k,L,d x k,L,d Form Class Hulls using linear d  boundaries thru min and max of L k.d,p =(C k &(X-p)) o d  On every I k,p,d  {[ep i,ep i+1 ) | ep j =minL k,p,d or maxL k,p,d for some k,p,d} interval add spherical and barrel boundaries with S k,p and R k,p,d similarly (use enough (p,d) pairs so that no 2 class hulls overlap) Points outside all hulls are declared as "other".  all p,d dis(y,I k,p,d ) = unfitness of y being classed in k. Fitness of y in k is f(y,k) = 1/(1-uf(y,k))

LSR IRIS150 e 1 only S p  (X-p) o (X-p) = X o X + L -2p + p o p n k,S,p = min(C k &S p ) x k,S,p  max(C k &S p ) S p  (X-p) o (X-p) = X o X + L -2p + p o p n k,S,p = min(C k &S p ) x k,S,p  max(C k &S p ) R p,d  S p -L 2 p,d = L -2p-(2p o d)d + p o p + p o d 2 + X o X - L 2 d n k,R,p,d = min(C k &R p,d ) x k,R,p,d  max(C k &R p,d ) Analyze R:R n  R 1 (and S:R n  R 1 ?) projections on each interval formed by consecutive L:R n  R 1 cut-pts. d=1000 p=A S =( ) d=1000 p=A E =( ) d=1000 p=A I =( ) L d d= 1000 p=origin Setosa vErsicolor vIrginica , If we have computed, S:R n  R 1, how can we utilize it?. We can, of course simply put spherical hulls boundaries by centering on the class Avgs, e.g., S p p=AvgS Setosa E=50 I=11 vErsicolor vIrginica with AI with AE , eliminates FPs better? What is the cost for these additional cuts (at new p-values in an L-interval)? L -2p-(2pod)d It looks like:make the one additional calculation: L -2p-(2pod)d then AND the interval masks, then AND the class masks? (Or if we already have all interval-class mask, only one mask AND step.) Recursion works wonderfully on IRIS: The only hull overlaps after only d=1000 are And the 4 i's common to both are {i24 i27 i28 i34}. We could call those "errors". If on the L 1000,avgE interval, [-1, 11) we recurse using S avgI we get , Thus, for IRIS at least, with only d=e 1 =(1000), with only the 3 p s avgS, avgE, avgI, using full linear rounds, 1 R round on each resulting interval and 1 S, the hulls end up completely disjoint. That's pretty good news! There is a lot of interesting and potentially productive (career building) engineering to do here. What is precisely the best way to intermingle p, d, L, R, S? (minimizing time and False Positives)?

A pTree Pillar k-means clustering method (The k is not specified - it reveals itself.) m 1 m 2 m 3 m 4 Choose m 1 as a pt that maximizes Distance(X, avgX) Choose m 2 as a pt that maximizes Distance(X, m 1 ) Choose m 3 as a pt that maximizes  h=1..2 Distance(X, m h ) Choose m 4 as a pt that maximizes  h=1..3 Distance(X,m h ) Do until minimum h=1..k Distance(X,m h ) < Threshold This gives k. Apply pk-means. (Note we already have all Dis(X,m h )s for the first round. Note: D=m 1  m 2 line. Treat PCCs like parentheses - ( corresponds to a PCI and ) corresponds to a PC. Each matched pair should indicate a cluster somewhere in that slice. Where? One could take the VoM as the best-guess centroid? Then proceed by restricting to that slice. Or 1 st apply R and do PCC parenthesizing on R values to identify radial slice where the cluster occurs. VoM of that combo slice (linear and radial) as the centroid. Apply S to confirm. Note: A possible clustering method for identifying density clusters (as opposed to round or convex clusters) (Treating PCCs like parentheses) d-line PCIPCDPCIPCD (or Do until m k < Threshold)

Clustering: 1. For Anomaly Detection 2. To develop Classes against which we future unclassified objects are classified. ( Classification = moving up a concept hierarchy using a class assignment function, caf:X  {Classes} ) NewClu (k is discovered, not specified. Assign each (object,class) a ClassWeight, CW  Reals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0. Do for next d, compute L d = X o d until after masking off new cluster, count is too high (doesn't drop enough) For the next PCI in L d (next-larger starting from smallest) If followed by a PCD, declare next Class k and define it to be the set spherically gapped (or PCDed) around centroid, C k =Avg or VoM k over L d -1 [PCI, PCD]. Mask off this ticketed new Class k and contin If followed by a PCI, declare next Class k and define it to be the set spherically gapped (or PCDed) around the centroid, C k =Avg or VoM k over L d -1 [ (3PCI 1 +PCI 2 )/4, PCI 2 ) Mask off this ticketed new Class k and continue. For the next-smaller PCI (starting from largest) in L d If preceded by a PCD, declare next Class k and define it to be the set spherically gapped (or PCDed) around centroid C k =Avg or VoM k over L d -1 [PCD, PCI]. Mask off this ticketed new Class k, contin. If preceded by a PCI, declare next Class k and define it to be the set spherically gapped (or PCDed) around the centroid, C k =Avg or VoM k over L d -1 ( PCI 2, (3PCI 1 +PCI 2 )/4] Mask off this ticketed new Class k and continue. When is it important not to over partition? Sometimes it is but sometimes it is not. In 2. it usually isn't. With gap clustering we don't ever over partition, but with PCC based clustering we can. If it is important that each cluster be whole, when using a k=means type clusterer, each round we can fuse C i and C j iff on L m i  m j their projections touch or overlap.