FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built for speed improvements so that big data can be mined in human time. Oblique FAUST is generalized to CC FAUST which places cuts at all large Count Changes (CCs). A CC reveals a cluster boundary almost always (i.e., almost always a large Count Decrease (CD) occurs iff we are exiting a cluster somewhere on the cut hyper-plane and a large Count Increase (CI) occurs iff we are entering a cluster. CC FAUST makes a cut at each CC in the y o d values (A gap is a LCD followed by a LCI so LCC includes Oblique FAUST) CC FAUST is Divisive Hierarchical Clustering which, if continued to singleton sub-clusters, builds a complete dendogram. If the problem at hand is outlier [anomaly] detection, any singleton sub-cluster separated by a sufficient gaps, is an outlier. CC FAUST will scale up, because entering and leaving a cluster "smoothly" (meaning without noticeable count change) is no more likely for large datasets than for small. (It's a measure=0 phenomenon). Do we need BARREL FAUST at all now? BARREL CC FAUST is still useful for estimating the diameter of a set as SQRT((dot_product_width onto a d-line) 2 +(max barrel radius from that d-line) 2 ). Density Uniformity (DU) of a sub-cluster might be defined as the reciprocal of the variance of the counts. A sub-cluster dendogram should have a Density Label (DE) and a Density Uniformity label (DU) on every edge (subcluster We can end a dendogram branch as soon as DE and DU are high enough (> thresholds, DT and DUT) to save time. How can we [quickly] estimate DE and DU? DU is easy - just calculate the variance of the point counts. Density, DE, = count/volume = count/c n r n. We have the count and n. c n is a known constant (e.g., c 1 = , c 2 =4 /3...) We have volume once we have the radius. Barrel CC FAUST gives us a good radius estimate. In advance, we decide on a density threshold, DET, and a Density Uniformity Threshold DUT. To choose the "best" clustering (partitioning of the set into sub-clusters) we proceed depth first across the dendogram left-most branch to right-most branch, going down until the DET and DUT thresholds are exceeded. See next slides)
Oblique FAUST Code Layering? A layer (or object or black box or procedure) in the code called the CUTTER : INPUTS: I1. A SPTS (Scalar PTreeSet = bitsliced column of numbers, presumably coming from a dot product functional) I2. The method: Cut_at? I2a. p%_Count_Change (e.g., p=25%), I2b. Other, non-uniform count change thresholds? I2c. centers of gaps only I3. Whether the 1-counts of the sub-cluster mask pTrees should be computed and returned (Y/N), since it is an expensive step. OUTPUTS: O1. A pointer to a mask pTree for each new "sub-cluster" (i.e., indetifying each set of points separated by consecutive cuts). O2. The 1-count of each mask the GRAMMER : INPUTS: I1. An existing Labeled Dendogram (labeled with e.g., the unit vector that produced it, the density of each edge subcluster...) including the tree of pointers to a mask pTrees for each node (incl. the root, which need not be all of the original set) I2. The new threshold levels (if, e.g., the density threshold is lower than that of the existing, GRAMMER prunes the dendogram OUTPUTS: O1. The new labeled Dendogram I like the idea of building a custom dendogram for the user according to specifications. Then the user can examine it while we churn out the next level (as done in the next 2 slides, i.e., the next higher density threshold). The reason is that the full dendogram down to singletons is impossibly large and the information gain with new each level rises from zero up to a maximum and then falls steadily to zero again at the singleton level (the bottom of the full dendo is huge but worthless) A thought on the sub-cluster dendogram in general: The root should be labeled with the PTreeSet of the table involved. A thought on the sub-cluster dendogram in general: The root should be labeled with the PTreeSet of the table involved. The sub-root level should be labeled with the particular SPTS of this branch (the D-line or unit vectors, d, of the dot product... Each sub-level after that should be labeled as above. Hadoop Treeminer principles? Never discard a derived pTree and never, never discard a computed count (makes catalogue mgmt a serious undertaking?) OR pTree hoarding is good.
Choosing a clustering from a DEL and DUL labeled Dendogram A B C D E F G The algorithm for choosing the optimal clustering from a labeled dendogram is as follows: Let DET=.4 and DUT=½ DEL=.1DUL=1/6DEL=.2DUL=1/8 DEL=.4DUL=1 DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL=.5 DUL=½ DEL=.3
1 y1y2 y7 2 y3 y5 y8 3 y4 y6 y9 4 ya yf 9 yb a yc b yd ye c d e f a b c d e f MA cut at 7 and a bc def APPLYING CC FAUST TO SPAETH Density Count/ r 2 labeled dendogram for LCC FAUST on Spaeth with D=AvgMedian DET=.3 Y (.15) {y1,y2,y3,y4,y5} (.37) {y6,yf} (.08) {y7,y8,y9,ya,yb.yc.yd.ye} (.07) {y7,y8,y9,ya} (.39) {yb,yc,yd,ye} (1.01) {y6} ( ) {yf} ( ) D=AM DET=.5 {y1,y2,y3,y4} (.63) {y5} ( ) {y7,y8,y9} (1.27) {ya} ( ) D=AM DET=1 {y1,y2,y3} (2.54) {y4} ( ) D Count/ r 2 labeled dendogram for LCC FAUST on Spaeth w D=cylces thru diagonals nnxx,nxxn,nnxx,nxxn..., DET=.3 Y (.15) {y1,y2,y3,y4,y5} (.37) {y6,y7,y8,y9,ya,yb.yc.yd.ye,yf} (.09) {y6,y7,y8,y9,ya} (.17) {yb,yc,yd,ye,yf} (.25) {yf} ( ) {yb,yc,yd,ye} (1.01) {y7,y8,y9,ya} (.39) {y6} ( ) D-line labeled dendogram for LCC FAUST on Spaeth w D=furthestAvg, DET=.3 Y (.15) Y (.15) y1,2,3,4,5 (.37 {y6,yf} (.08) {y7,y8,y9,ya,yb.yc.yd.ye} (.07) {y7,8,9,a} (.39) {yb,yc,yd,ye} (1.01) {y6} ( ) {yf} ( ) 1 y1y2 y7 2 y3 y5 y8 3 y4 y6 y9 4 ya yf 9 yb a yc b yd ye a b c d e f
p6' 1 0 5/64 [0,64) p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p /64 [64,128) p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 Y y1 y2 y1 1 1 y2 3 1 y3 2 2 y4 3 3 y5 6 2 y6 9 3 y y y ya 13 4 pb 10 9 yc yd 9 11 ye yf 7 8 yofM p6 0 1 p p p p p p p6' 1 0 p5' p4' p3' p2' p1' p0' p3' [0,8) p [8,16) p3' [16,24) p [24,32) p3' [32,40) p [40,48) p3' [48,56) p [56,64) p3' p p3' [80,88) p [88,96) p3' [96,104) p [194,112) p3' [112,120) p [120,128) p4' /16[0,16) p4' p /16[16,32) p p4' [32,48) p4' p [48,64) p p4' [64,80) p4' p [80,96) p p4' [96,112) p4' p [112,128) p p5' /32[0,32) p5' p5' p5' p5' /32[64,96) p5' p5' p5' p /32[32,64) p p p p ¼[96,128) p p p f= UDR Univariate Distribution Revealer (on Spaeth:) Pre-compute and enter into the ToC, all DT(Y k ) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector. Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in fact are often all we need to know about the pTree to get the answers we are after.) depthDT(S) b≡BitWidth(S) h=depth of a node k=node offset Node h,k has a ptr to pTree{x S | F(x) [k2 b-h+1, (k+1)2 b-h+1 )} and its 1count applied to S, a column of numbers in bistlice format (an SpTS), will produce the DistributionTree of S DT(S) 15depth=h=0 depth=h=1 node 2,3 [96.128)