Q&A f=distance dominated functional, avgGap=(f max -f min )/|f(X)| may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if gap around {x} > 3*avgGap? If d and t are trained over DocumentTerm (DT) Gradient(F)=G=(G d, G t ). Instead of a LineSearch using F(s)=f +sG, always use 2D-RectangleSearch, F(s d,s t )=F(f + s d *G d + s t *G t ). Set F/ s d =0 and F/ s t =0. It may be a better approach to find dense cells (sphere, barrel, cone) then fuse them, because it's difficult to position themaround clusters (due to bumps, protrusion etc.) (Not true for outlier clusters (singleton\doubleton)) An Akg: Start with a line and a small radius barrel around it. Find dense regions between 2 consecutive gaps in this pipe. This should identify portion of a dense cluster. Lots of ways to go from there: a. Use centroid of dense pipe piece as sphere|barrel center. b. Move to a better centroid for that cluster by a gradient asc/desc process c. In a "GA mutation" fashion, jump to a nearby centroid, governed by some fitness function (e.g., count in dense pipe piece). If the minimum barrel radii >> 0, we have chosen a d-line far from the data. It may be advisable to pick p to ba an actual data point. Here are the formulas from the spreadsheet: G=(B12-B$6)*B$9+(C12-C$6)*C$9+(D12-D$6)*D$9+(E12-E$6)*E$9 H=G12-$G$9 L=(x-p)od-min I=(B12-B$6)^2+(C12-C$6)^2+(D12-D$6)^2+(E12-E$6)^2 B=SQRT[(x-p)o(x-p)-(x-p)od^2] Note we don't round, so we are calculating pTree bitslices by truncating. We don't even need to do that! For fixed piont, here are the bislice formulas: Keep going (take bitslices to the right of decimal pt ) Floating point? Bitslice the mantissa. The exponent shifts the slice name. E.g.,.1011 Gap analytic tools: L(x)=x o d, S(x)=(x-p) o (x-p) and then from those, B(x)=S(x)-L 2 (x) (If T is the minimum gap threshold, use T 2 for S and B ) Oblique FAUST, Barrel (OFLB) Alternate L pq x, B pq x to get a cluster dendogram (topdown). Take p=1st_TR pt? d=vom avg Defining Avg Density? AvD = count / k=1..dim (max k -min k )? This is for choosing good Thresholds. MinGapThres=T b,AvD ≡ b*(1/ AvD) 1/dim b=adjustable param If we're given a TrainingSet, TR, with K classes, is avg k=1..K vom k a better mediod than VoM? Take p=MinCorner, q=MaxCorner of box circumscribing {VoM k } k=1..K better than not circ box of TR? USPTS = Universe of all SPTSs (columns of reals); V = n-dim vector space. Code operations on USPTS (both 1 level or multi-level): DP v =Dot Prod w a fixed vec, v V More efficient? use v's bit pat USPTS ... USPTS USPTS (n-ary Algebraic Operations): incl: +, -, /, *=Row_Wise_Product; binary only: =, >, <, , , SP c =Scalar Prod (mult by const c). More efficient? use c's bit pat? Note, USPTS includes SPTSs of all cardinalities (= depths = # of rows) It seems best to code on USPTS rather than on USPTSn (card(SPTS)=n). Of course, it is very important to know what the rows represent so as to avoid nonsense results, however, why restrict the operations themselves? When SPTS operands are of different depths, the result SPTS's depth = depth of the shallowest operand (operate from the top of each). ER a =FP's EinRings Result=pTree mask of rows < a use a's bit pattern only? SPTS R incl AG a = YC's Aggregates, count, sum, avg, max, min, median, rank_k, top_k, IceBergQueries. SD v =Square Dis from fixed vector, v V. More eff? use v's bit pattern only? USPTS USPTS (Unary Operations) including:
C p,d (x)=(x-p) o d / (x-p) o (x-p) Oblique FAUST Cone (OFC) (Enclose clusters with cone gaps) gap Barrel Oblique FAUST (OF) Clustering: Linear (default) OFL, Spherical OFS, Barrel OFB, Conical OFC) B p,d (x)=(x-p) o (x-p)-((x-p) o d) 2 Oblique FAUST Barrel (OFB) (Enclose clusters with barrel gaps) Search for Gap Lower >T, Gap Upper >T and Gap Barrel >T 2 (BR≡Barrel_Radius) Search S p for spherical gap, {x | r 2 S p (x) < (r+T) 2 }= so that the interior of the r-sphere about p encloses a sub-cluster. S p (x)=(x-p) o (x-p) Oblique FAUST Spherical (OFS) (Enclose clusters with spherical gaps) No gaps show on the red, blue or green projection lines d p r p L p,d :X R: L p,d (x)=(x-p) o d Oblique FAUST Linear (OFL) clustering (Enclose clusters between (n-1)-dimensional hyperplanar gaps) Find a 1 <a 2 such that =Gap Lower ={x | a 1 <L pd (x)<a 1 +T} and =Gap Upper ={x | a 2 <L pd (x)<a 2 +T} and C={x|a 1 +T<L pd (x)<a 2 } Gap Upper d p Gap Lower a1a1 a2a2 B pd x x Note: B pd (x) = S p (x) - L 2 pd (x) Note: C 2 pd (x) = L 2 pd (x) / S p (x) Assume a real number table, TBL(C 1..C n ), (= n-dim vector space; or categorical columns, either code to real numbers or bitmap, e.g., a Month column can be coded as {1,...,12} and a Color column can be bitmapped by Red(yes=1|no=0)...Violet(yes=1|no=0) ). TBL is converted to a PTreeSet. Define distance function ds(x,y):TBL TBL R ds(x,y)= k CR r k |x k -y k | 2 + k CC c k |x k -y k | where CR is the set of real columns, CC is the set of categorical columns (consider coded columns as real) and r k, c k are real coefficients. Each method uses a real valued functional from X to R and all methods are completely data parallel (data can be distributed over a cluster, processed in parallel (dot product), then the partial results sent home to be added.
p T=MGW=12 d=x-n= CONCRETE ST CM WA FA AG L1 M1 L2 M12 H 17 C3 OF LB...LB Clustering on Concrete(STrength,ConcreteMix,WAter,FineAggregate, AGgregate). Assess STerror L<40 M<60 H (x-p)od/4 Ct Gp 3 C if 1 st B radius>>0, use p=min_radius_pt L2 M1 C0 L20 M9 H4 C1 H4 M3 H1 C4 H1 L18 M26 H28 C2 Br/4 ct gp 3 C H1 M3 Br/4 ct gp 3 C L1 M3 H3 C31 L1 M1 H4 C32 H1 M1 H5 C33 H1 H2 H1 M1 M2 M1 M3 (x-p) o d/4 gp 3 C M3. H3 (x-p) o d/4 gp 3 C L1 M1 d=4. H4 Br/4 gp 3 C L1 L9 M1 C21 L4 M3 H1 C22 M1 L2 M4 H3 C23 M1 L2 M3 H16 C24 L2 M3 H4 C25 M1 M1 H3 C26 M1 M2 M3 H1 C27 M2 (x-p)od/4 gp 3 C M3. H1 (x-p)od/4 gp 3 C M1 ' H3 (x-p)od/4 gp 3 C M2 H2 M1 H1C251 H1 Br/4 gp 3 C M1 H1 (x-p)od/4 gp 3 C L1 L1 M2 H16 C241 M1 Br/4 gp 3 C L1 M1 H5 C2411 H5 M1 c (Clust dendogram w/o purity) c0 c1 c2 c3 c4 c31 c32 c33 c21 c22 c23 c24 c25 c26 c27 (x-p) o d/4 gp 3 C M1. H5 c251 c241 (x-p) o d/4 g 3 C M1. H5 c2411 (x-p)od/4 gp 3 C M3 L1 H1 L1 M1 C231 c231 Br/4 gp 3 C L1 M1 (x-p)od/4 gp 3 C L3 M2. H1 M1 L1 (x-p)od/4 gp 3 C L6 L3 M1 C211 Br/4 gp 3 C c211 L1 M1 L1 Br/4 gp 3 C L2 L11 M3 C11 L1 L4 M1 M2 L1 M2 H1 C12 H3 M1 L1 c11 c12 (x-p)od/4 gp 3 C L1 H1 M2 (x-p)od/4 gp 3 C L11 M3 Br/4 gp 3 C L1 L1 M1 d=4
p d ClsAreaLnkeAcoeLnk OF LB...LB Clustering on SEEDS(cls area lnker acoef lnkrgv) (x-p)od*10 Ct Gp 3 if 1 st B radius>>0, use p=min_radius_pt Br*10 gp 3 c (x-p)od g 3 c c c c c c Br*10 gp 3 c c c c c