Download presentation
Presentation is loading. Please wait.
Published byChristine Larsen Modified over 6 years ago
1
Research of William Perrizo, C.S. Department, NDSU
I datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows). How do I do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees"). I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time. What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows). Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification. The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining. ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)". To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologist George A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7. Some current pTree Data Mining research projects FAUST pTree PREDICTOR/CLASSIFIER (FAUST= Functional Analytic Unsupervised and Supervised machine Teaching): FAUST pTree CLUSTER/ANOMALASER pTrees in MapReduce MapReduce and Hadoop are key-value approaches to organizing and managing BigData. pTree Text Mining:: capturie the reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus. Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions. pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools. pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases? pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed. pTree Recommender: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.
2
FAUST clustering (the unsupervised part of FAUST)
This class of partitioning or clustering methods relies on choosing a dot product projection so that if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.). The Coordinate Projection Functionals (ej) Check gaps in ej(y) ≡ yoej = yj The Square Distance Functional (SD) Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid). The Dot Product Projection (DPP) Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2 The Square Dot Product Radius (SDPR) SDPRpq(y) ≡ SDp(y) - DPPpq(y)2 (easier pTree processing) DPP-KM 1. Check gaps in DPPp,d(y) (over grids of p and d?) Check distances at any sparse extremes After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined). DPP-DA 2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster Check distances at sparse extremes against subcluster density Apply other methods once Dot ceases to be effective. DPP-SD) Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid) Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!) SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps! SDp(y) ≡ (y-p)o(y-p) = yoy - 2 yop pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract. Calculate DPPpq(y)2. Then subtract it from SDp(y)
3
next 50 are Versicolor (e), next 50 are Virginica (i) irises.
DPP 60 59 58 62 63 61 57 64 56 25 27 22 29 24 26 37 31 34 30 35 23 32 21 28 SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver 1 2 3 4 5 6 7 8 9 10 20 40 50 ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir 3 4 5 6 7 8 9 50 1 2 10 20 30 40 37 29 28 19 11 15 12 24 16 17 13 18 DPP 27 23 21 26 36 32 33 25 31 SL SW PL PW ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver FAUST DPP CLUSTER on IRiS with DPP(y)=(y-p)o(q-p)/|q-p|, where p is the min (or n) corner and q is the max (x) corner of the circumscribing rectangle (mdpts or avg (a) is used also). Checking [0,4] distances (s42 Setosa outlier) F s14 s42 s45 s23 s16 s43 s3 s s s s s s s IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises. CL1 F<17 (50 Set) CL3 w outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 17<F<23 CL2 (e8,e11,e44,e49,i39) gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 23<F CL3 (46 vers,49 vir) Thinning=[6,7 ] CL3.1 <6.5 44 ver 4 vir CL3.2 >6.5 2 ver 39 vir No sparse ends Check distances in [12,28] s16,,i39,e49, e11, {e8,e44, i6,i10,i18,i19,i23,i32 outliers F s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s s s s s i e e e e e e e Checking [57.68] distances i10,i36,i19,i32,i18, {i6,i23} outliers F i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i i i i i i i i i i Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps. Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method).
4
"Gap Hill Climbing": mathematical analysis
1. To increase gap size, we hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger since more dispersion allows for more and/or larger gaps. This is very heuristic but it works. 2. We are more interested in growing the largest gap(s) of interest ( or largest thinning). To do this we could do: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap =p q= a b c d e f f e d c b a 9 8 7 6 a j k b c q d e f 2 1 d2 d2-gap p q d2 d2-gap
5
Maximizing theVariance =
: xN x1od x2od xNod = Xod=Fd(X)=DPPd(X) d1 dn How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let V(d)≡VarianceXod=(Xod)2 - (Xod)2 = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk . M1 M2 : MC - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) N 1 = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk N 1 2 Then Xi = Mean(X)i and and XiXj = Mean Mi1 Mj1 . : MiC MjC = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) subject to i=1..ndi2=1 dT o A o d = V(d) V i XiXj-XiX,j : d dn d1 dn We can separate out the diagonal or not: These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. + jkajkdjdk V(d)=jajjdj2 ijaijdidj V(d) = FAUST Classifier MVDI (Maximized Variance Definite Indefinite: d0, one can hill-climb it to locally maximize the variance, V, as follows: Build a Decision tree. 1. Find the d that maximizes the variance of the dot product projections of the class means each round. 2. Apply DI each round (see next slide). d1≡(V(d0)); d2≡(V(d1)):... where 2a a a1n 2a21 2a a2n : ' 2an ann d1 di dn V(d)≡Gradient(V)=2Aod or V(d)= 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj Theorem1: k{1,...,n}, d=ek will hill-climb V to its globally maximum. Let d=ek s.t. akk is a maximal diagonal element of A, Theorem2 (working on it): d=ek will hill-climb V to its globally maximum.
6
FAUST DI K-class training set, TK, and a given d (e. g
FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definite_i = ( Mx<i, Mn>i ) Indefinite_i_i+1 = [Mn>i, Mx<i+1] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_i_______ Indefinite_i_i+1______ class Mx<i MN>i class MN>i Mx<i+1 s-Mean s(i=1) e-Mean e(i=2) se empty i-Mean i(i=3) ei F < setosa (35 seto) ST ROUND D=MeansMeane 18 < F < versicolor (15 vers) 37 F IndefiniteSet (20 vers, 10 virg) 48 < F virginica (25 virg) F < versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg) F < versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) < F virginica Test: F < setosa (15 seto) ST ROUND D=MeansMeane 15 < F < versicolor ( 0 vers, 0 virg) 15 F IndefiniteSet (15 vers, 1 virg) 41 < F virginica ( virg) F < versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k= (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k= (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)
7
FAUST MVDI 16.5 xod0 < 38 xod0 < 16.5 48 < xod0 xod1 < 9
on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean s e-Mean e s_ei empty i-Mean i se_i (-1, 16.5=avg{23,10})s sCt= (16.5, 38)e eCt= (48.128)i iCt=39 d=(.33, -.1, .86, .38) indef[38, 48]se_i seCt=26 iCt=13 Definite Indefinite i-Mean i e-Mean e i_e empty d=(-.55, -.33, .51, .57) (-1,8)e Ct= (10,128)i Ct=9 indef[8,10]e_i eCt=5 iCt=4 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree: 38 xod0 48 d1=(-.55, -.33, .51, .57) Versicolor xod1 < 9 Virginica xod1 9 Setosa d0=(.33, -.1, .86,.38) xod0 < 16.5 16.5 xod0 < 38 48 < xod0
8
SatLog 413train 4atr 6cls 127test
FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn mn mn Using full data: (much better!) mn mn mn Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) Fomn Ct min max max+1 mn mn mn mn mn mn F[a,b) Class d=( ) Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) F[a,b) Class d=( ) MNod Ct ClMn ClMx ClMx+1 mn mn F[a,b) Class 7 5 5 2 2 d=( ) Cl=7 cl=7 Gradient Hill Climb of Var(d)on t257 Same using class means or training subset. cl=4 F[a,b) Class 4 1 1 d=(-.66, .19, .47, .56) F[a,b) Class 1 d=(-.81, .17, .45, .33) F[a,b) Class 5 7 d=(-.01, -.19, .7, .69) Gradient Hill Climb of Var(d)on t75 Gradient Hill Climb of Var(d)on t13 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 For WINE: min max+1 Awful results! Gradient Hill Climb of Var t156161 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)
9
FAUST MVDI Concrete Seeds 7 test errors / 30 = 77%
xod0<320 Class=m (test:1/1) d1= xod0>=634 Class=l (test:1/1) 7 test errors / 30 = 77% For Concrete min max+1 train l m h Test l ****** m ****** h ****** 321 l m h 0 l ***** m ***** h 92 ***** xod2<28 Class= l or m d3= d2= xod2>=92 Class=m (test:2/2) d4 = xod3<544 Cl=m *test 0/0) xod2>=662 Cl=h (test:11/12) xod3<969 Cl=l *test 6/9) xod3>=868 Cl=m (test:1/1) xod4<640 Cl=l *test 2/2) xod4>=681 Cl=l (test:0/3) d0 l m h Seeds d3 l m h 0 l ******* m ******* h ******* 8 test errors / 32 = 75% d1 l m h xod<13.2 Class=h errs:0/5) xod>=19.3 Class=m errs0/1) l m h xod<13.2 Class=h errs:0/5) xod>=18.6 Class=m errs0/4) d2 l m h 1 l ****** m ****** h 662 ****** Class=h errs:0/1) l m h Class=m errs0/0) Class=l errs:0/4) Class=m errs8/12)
10
P(mrmv)/|mrmv|oX<a
FAUST Oblique Classifier: formula: P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ). r r r v v r mr r v v v r r v mv v r v v r v P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv a PX dot d>a = PdiXi>a E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? AND 2 pTrees masks P(mbmr)oX>(mr+m)|/2od P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. g b grb grb grb grb grb grb grb grb grb In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them. r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b For classes r and b bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr r D
11
FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a std of these distances from origin along the d-line
12
L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z 12/8/12 x y x\y a b 3 3 4 9 3 6 f 14 2 8 d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f
13
L1(x,y) Value Array z z z z z z z z z z z z z z z L1(x,y) Count Array z z z z z z z z z z z z z z z This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis. x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis. After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.
14
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 gap: 10-6 gap: 5-2 cluster PTree Masks (by ORing) z11 1 z12 1 z13 1
15
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 6-9 z71 1 z72 1
16
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 gap: 3-7 z71 1 z72 1 zd1 1 zd2 1
17
Cluster by splitting at gaps > 2
yo(x-M)/|x-M| Value Arrays z z z z z z z z z z z z z z z Cluster by splitting at gaps > 2 yo(x-M)/|x-M| Count Arrays z z z z z z z z z z z z z z z x y x\y a b 3 3 4 9 3 6 f 14 2 8 M d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f x y F z1 z1 14 z1 z2 12 z1 z3 12 z1 z4 11 z1 z5 10 z1 z6 6 z1 z7 1 z1 z8 2 z1 z9 0 z1 z10 2 z1 z11 2 z1 z12 1 z1 z13 2 z1 z14 0 z1 z15 5 9 5 Mean z11 1 z12 1 z13 1 z71 1 z72 1 zd1 1 zd2 1 AND each red with each blue with each green, to get the subcluster masks (12 ANDs)
18
FAUST Clustering Methods:
MCR (Using Midlines of circumscribing Coordinate Rectangle) f3 g3 x y z f2 g2 f1 g1 (nv1,nv2,Xv3) (nv1,Xv2,Xv3) (nv1,Xv2,nv3) MinVect=nv=(nv1,nv2,nv3) (Xv1,Xv2,Xv3)=Xv =MaxVect (Xv1,nv2,Xv3) (Xv1,Xv2,nv3) (Xv1,nv2,nv3) For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually): Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g| So we can do any subset (d), (df), (dg), (dfg), (f), (fg), fgd), ... Define a sequence fk,gkdk fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) f, g, d, SpS(xod) require no processing (gap-finding is the only cost). MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)). MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free). On what's left: (look for outliers in subclus1, subclus2 Sequence thru{f, g} pairs: SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap. f1 g1 0001 0011 0010 nv= 0000 0111 0101 0110 0100 0½½½= =1½½½ 1001 1011 1010 1000 1111 =Xv 1101 1110 1100 f2 = ½0½½ g2 =½1½½ f3 g3 =½½1½ =½½0½ f4 g4 =½½½1 =½½½0 = 0½½½ = 1½½½ d3 set23... set45 ver49... vir19 d1 none SubClus1 SubClus2 d2 none Sub Clus1 f1 none f1 none Sub Clus2 g1 none g1 none f2 none f2 1 41 vir23 0 47 vir18 0 47 vir32 SubClus1 g2 none d4 set44 vir39 Leaves exactly the 50 setosa. f3 none g2 none g3 none f4 none f3 none g4 none g3 none SubClus2 f4 none d4 none Leaves 50 ver and 49 vir g4 none
19
MCR(d) on Iris150+Outlier30, gap>4:
Do SpS(xodk) linear gap analysis, k=1,2,3,4. Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size 10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4). d3 set23... set25 ver49... vir19 Same split (expected) t124 t14 tal t134 d1 t124 t14 tal t134 t13 t12 t1 t123 set14 ... vir32 b12 b1 b13 b123 b124 b134 b14 ball Sub Clus1 Sub Clus1 t13 t12 t1 t123 SubClus1 d4 1 6 set44 0 18 vir39 Leaves exactly the 50 setosa as SubCluster1. SubClus2 d4 0 0 t4 1 0 t24 0 10 ver18 ... 1 25 vir45 0 40 b4 0 40 b24 Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2. b12 b1 b13 b123 b124 b134 b14 ball MCR(d) performs well on this dataset. Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.) Speed: dk = ek so there is zero calculation cost for the d's. SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it. The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) is optimal with respect to speed! t2 t23 t24 t234 d2 t2 t23 t24 t234 ver1 ... set16 b24 b2 b234 b23 b24 b2 b234 b23
20
CCR(fgd) (Corners of Circumscribing Coordinate Rectangle) f1=minVecX≡(minXx1..minXxn) (0000)
g1=MaxVecX≡(MaxXx1..MaxXxn) (1111), d=(g-f)/|g-f| start f1=MnVec RnGp>4 none Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d. g1=MxVec RnGp>4 0 7 vir18... 1 47 ver30 0 53 ver49.. 0 74 set14 CCR(f) Do SpS((x-f)o(x-f)) round gap analysis Sub Clus1 CCR(g) Do SpS((x-g)o(x-g)) round gap analysis. Sub Clus2 CCR(d) Do SpS((xod)) linear gap analysis. Notes: No calculation required to find f and g (assuming MaxVecX and minVecX have been calculated and residualized when PTreeSetX was captured.) If the dimension is high, since the main diagonal corners are liekly far from X and thus the large radii make the round gaps nearly linear. SubClus1 Lin>4 none SubCluster2 f2=0001 RnGp>4 none g2=1110 RnGp>4 none This ends SubClus2 = 47 setosa only Lin>4 none f1=0000 RnGp>4 none g1=1111 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none f2=0001 RnGp>4 none g2=1110 RnGp>4 none Lin>4 none g3=1101 RnGp>4 none Lin>4 none f3=0010 RnGp>4 none g3=1101 RnGp>4 none Lin>4 none f4=0011 RnGp>4 none f4=0011 RnGp>4 none g4=1100 RnGp>4 none Lin>4 none g4=1100 RnGp>4 none f5=0100 RnGp>4 none g5=1011 RnGp>4 none Lin>4 none Lin>4 none f6=0101 RnGp>4 1 19 set26 0 28 ver49 0 31 set42 0 31 ver8 0 32 set36 0 32 ver44 1 35 ver11 0 41 ver13 f5=0100 RnGp>4 none ver49 set42 ver8 set36 ver44 ver11 ver49 ver8 ver44 ver11 Subc2.1 g5=1011 RnGp>4 none Lin>4 none f6=0101 RnGp>4 none g6=1010 RnGp>4 none g6=1010 RnGp>4 none Lin>4 none Lin>4 none f7=0110 RnGp>4 none f7=0110 RnGp>4 1 28 ver13 0 33 vir49 g7=1001 RnGp>4 none Lin>4 none g7=1001 RnGp>4 none Lin>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none This ends SubClus1 = 95 ver and vir samples only
21
SL SW PL PW set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir t t t t t t t t t t t t t t tall b b b b b b b b b b b b b b ball Before adding the new tuples: MINS MAXS MEAN same after additions. 1 2 3 4 5 6 7 8 9 10 20 30 40 50
22
FM(fgd) (Furthest-from-the-Mediod)
FMO (FM using a Gram-Schmidt Orthonormal basis) X Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.)) f=M Gp>4 1 53 b13 0 58 t123 0 59 b234 0 59 tal 0 60 b134 1 61 b123 0 67 ball DISTANCES t123 b tal b134 b123 All outliers! f0=t123 RnGp>4 1 0 t123 0 25 t13 1 28 t134 0 34 set42... 1 103 b23 0 108 b13 f1MxPt(SpS[(M-x)o(M-x)]). d1≡(M-f1)/|M-f1|. SubClust-1 f0=b2 RnGp>4 1 0 b2 0 28 ver36 SubClust-2 f0=t3 RnGp>4 none If d110, Gram-Schmidt {d1 e1...ek-1 ek+1..en} d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1| d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ... SubClust-1 f0=b3 RnGp>4 1 0 b3 0 23 vir8 ... 1 54 b1 0 62 vir39 SubClust-2 f0=t3 LinGap>4 1 0 t3 0 12 t34 f0=b23 RnGp>4 1 0 b23 0 30 b3... 1 84 t34 0 95 t23 0 96 t234 dh≡(eh-(ehod1)d1-(ehod2)d2-..-(ehodh-1)dh-1) / |eh-(ehod1)d1-(ehod2)d2-...-(ehodh-1)dh-1| Thm: MxPt[SpS((M-x)od)]=MxPt[SpS(xod)] (shift by Mod, MxPts are same Repick f1MnPt[SpS(xod1)]. Pick g1MxPt[SpS(xod1)] SubClust-2 f0=t34 LinGap>4 1 0 t34 0 13 set36 Pick fhMnPt[SpS(xodh)] Pick ghMxPt[SpS(xodh)]. f0=b124 RnGp>4 1 0 b124 0 28 b12 0 30 b14 1 32 b24 0 41 vir10... 1 75 t24 1 81 t1 1 86 t14 1 93 t12 0 98 t124 SubClust-1 f0=t24 RnGp>4 1 0 t24 1 12 t2 0 20 ver13 b b b24 All outliers again! SubClust-2 f0=set16 LnGp>4 none SubClust-1 f0=b1 RnGp>4 1 0 b1 0 23 ver1 SubClust-1 f1=ver49 RdGp>4 none SubClust-2 f1=set42 RdGp>4 none SubClust-1 f1=ver49 LnGp>4 none SubClust-1 f0=ver19 RnGp>4 none 1. Choose f0 (high outlier potential? e.g., furthest from mean, M?) 2. Do f0-rnd-gap analysis (+ subcluster anal?) 3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0) 4. Do f1-rnd-gap analysis (+ subclust anal?). 5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|. 6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis. 8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len... SubClust-2 f1=set42 LnGp>4 none SubClust-2 is 50 setosa! Likely f2, f3 and f4 analysis will not find none. f0=b34 RnGp>4 1 0 b34 0 26 vir1 ... 1 66 vir39 0 72 set24 1 83 t3 0 88 t34 SubClust-1 SubClust-1 f0=ver19 LinGp>4 none SubClust-2
23
FMO(d) f4=t4 g4=vir1 Ln>4 none
f1=ball g1=tall LnGp>4 ball b123 b134 b234 b13 ... t13 t134 t123 tal f2=vir11 g2=set16 Ln>4 none b123 b134 b234 f3=t34 g3=vir18 Ln>4 none f4=t4 g4=b4 Ln>4 vir1 b4 b14 f4=t4 g4=vir1 Ln>4 none This ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3. f1=b13 g1=b2 LnGp>4 none f2=t2 g2=b2 LnGp>4 set16 b2 f2=t2 g2=t234 Ln>4 t23 t234 t12 t24 t124 t2 ver11 t23 t234 t12 t24 t124 t2 CRC method g1=MaxVector ↓ x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxx x xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x f for FMG-GM g for FMG-GM x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x f2=vir11 g2=b23 Ln>4 b12 b34 b124 b23 t13 b13 b34 b124 b23 t13 b13 MCR f MCR g f2=vir11 g2=b12 Ln>4 1 45 set16 0 61 b24 0 61 b2 0 61 b12 b24 b2 b12 CRC f1=MinVector
24
f1=bal RnGp>4 ball b123... t4 vir t34 t12 t23 t124 t234 t13 t134 t123 tal Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before). FMO(fg) start f1MxPt(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps. Sub Clus1 Sub Clus2 t12 t23 t124 t234 We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150. We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify. SubClus2 f1=t14 Rn>4 t1 t14 ver8 ... set15 t3 t34 SubClus1 f1=b123 Rn>4 b123 b13 vir32 vir18 b23 vir6 b13 vir32 vir18 b23 If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis! FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis. Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!) SubClus2 f1=set23 Rn>4 vir39 ver49 ver8 ver44 ver11 t24 t2 SubClus1 f1=b134 Rn>4 b134 vir19 |ver49 ver8 ver44 ver11 Almost outliers! Subcluster2.2 Which type? Must classify. Sub Clus2.2 SC1 f2=ver13 Rn>4 ver13 ver43 SubClus1 f1=b234 Rn>4 b234 b34 vir10 SC1 g2=vir10 Rn>4 1 0 vir10 0 6 vir44 SubClus1 f1=b124 Rn>4 b124 b12 b14 b24 b1... t4 b3 b124 b12 b14 SbCl_2.1 g1=ver39 Rn>4 1 0 vir39 0 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters. SC1 f4=b1 Rn>4 1 0 b1 0 23 ver1 SbCl_2.1 g1=set19 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 f3=set16 Rn>4 none SbCl_2.1 g3=set9 Rn>4 none SC1 f1=vir19 Rn>4 t4 b2 SbCl_2.1 f2=set42 Rn>4 set42 set9 SC1 g4=b4 Rn>4 1 0 b4 0 21 vir15 SbCl_2.1 LnG>4 none SbCl_2.1 f4=set Rn>4 none SbCl_2.1 f2=set9 Rn>4 none SbCl_2.1 g4=set Rn>4 none SC1 g1=b2 Rn>4 t4 ver36 SubC1us1 has 91, only versicolor and virginica. SbCl_2.1 g2=set16 Rn>4 none SbCl_2.1 LnG>4 none SbCl_2.1 LnG>4 none
25
CRMSTD(dfg) Eliminate all columns with STD < threshold.
For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.). The STDs of the columns, Xk, can be precomputed up front, once and for all. STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps. In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that A text mining scheme might be: 1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem). 2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.) A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout) 2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD). 4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) ''. look for preponderance of wide thin-gaps top down. 4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis. For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas: d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk ensinn where i's start at 0 and increment by . So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0, divides 180 (e.g., 90, 45, ) CRMSTD(dfg) Eliminate all columns with STD < threshold. d3 set set+vir39 set25 ver ver_49vir vir19 Sub Clus1 (d3+d4)/sqr(2) clus1 none (d3+d4)/sqr(2) clus2 none d5 (f5=vir19, g5=set14) none f5 vir19 clus2 vir23 g5 none Just about all the high STD columns find the subcluster split. In addition, they find the four outliers as well Sub Clus2 (d1+d3+d4)/sqr(3) clus1 set19 vir39 (d1+d3+d4)/sqr(3) clus2 none d5 (f5=vir23, g5=set14) none,f5 none, g5 none d5 (f5=vir32, g5=set14) none, f5 none, g5 none d5 (f5=vir18, g5=set14) none f5 vir18 clus2 vir32 vir6 g5 none d5 (f5=vir6, g5=set14) none, f5 none, g5 none (d1+d2+d3+d4)/sqr(4) clus1 (d1+d2+d3+d4)/sqr(4) clus2 none (d1+d3)/sqr(2) clus1 none (d1+d3)/sqr(2) clus2: ver49 ver8 ver44 ver11 ver10 none ver49 ver8 ver44 ver11
26
CRMSTD(dfg) using IRIS rectangle on Satlog (1805 rows of R,G,IR1,IR2 with classes {1,2,3,4,5,7}.). Here I made a mistake and left MinVec, MaxVec and M as they were for IRIS (so probably far from the Satlog dataset). The results were good??? Suggests random f and g? d2 STD=23.7 gp>3 val cl num (d1+d2)/sqr2 STD=25.3 (d1+d4)/sqr2 STD=15.5 (d2+d3)/sqr2 STD=23.6 d4 STD=20.3 gp>3 val cl num SQRT(x-f2)o(x-f2) STD=26.7 val cl num (d1+d3)/sqr2 STD=16.8 (d2+d4)/sqr2 STD=20.4 same d3+d4)/sqr2 STD=25.7 d3 STD=17.2 gp>3 val cl num d1+d2+d3+d4)/sqr4 STD=25.9 same SQRT(x-g2)o(x-g2) STD=26.8 val cl num d1+d2+d3)/sqr3 STD=25.3 d1 STD=13.6 g>3 none SQRT(x-M)o(x-M) STD=28 val cl num (d1+d2+d4)/sqr3 STD=21.9 sqr(x-f4)o(x-f4 STD=27.8 val cl num d1+d3+d4/sq3 STD=22.1 SQRT(x-f1)o(x-f1) STD=27 val cl num SQRT(x-f5)o(x-f5) STD=25 val cl num Skip STD<25, same outliers: 2_85, 2_191, 3_361, 3_84, 3_100, 3_315, 5_24, 5_73, 5_75, 5_149, 5_168, SQRT(x-f3)o(x-f3) STD=27.5 val cl num SQRT(x-g4)o(x-g4) STD=27.7 val cl num SQRTx-g5ox-g5 STD=27.4 val cl num SQRT(x-g1)o(x-g1) STD=26.3 val cl num SQRT(x-g3)o(x-g3) STD=24.9 none
27
CRMSTD(dfg) Satlog corners on Satlog
1=red soil, 2=cotton, 3=grey soil, 4=damp grey soil, 5=soil w stubble, 6=mixture, 7=very damp grey soil Classes 2, 5 isolated from the rest (and each other)? 2 and 5 produced the greatest number of outliers. Take f5=c2M; g5 to be other means: Class Means c1M c2M c3M c4M c5M c7M d5(f5=c2M,g5=c7M) g>3 STD=26 val cl num : d2 STD=23.7 val cl num Sub Cluster1 Lots of outliers found, but did not separate classes as subclusters (Keeping in mind that they may butt up against each other (no gap) so that they would never appear as subclsuter via gap analysis methods.). Suppose we have a high quality training set for this dataset reliably accurate class means. Next, find any class gaps that might exist by using those as our f and g points. (d1+d2)/sqr2 STD=25.2 none d4 STD=20.3 val cl num (d1+d3)/sqr2 STD=16.6 none (d1+d4)/sqr2 STD=15.3 none Sub Cluster2 (d2+d3)/sqr2 STD=23.4 none (d2+d4)/sqr2 STD=23.4 none SubCluster1 consists of 191 class=2 samples. SubCluster3 contains every subcluster. Next, on SubCluster3 we use f5=c1M and g5=c7M. (d3+d4)/sqr2 STD=25.3 d3 STD=17.2 val cl num (d1+d2+d3)/sqr3 STD=25.2 none Sub Cluster3 (d1+d2+d4)/sqr3 STD=21.6 none (d1+d3+d4)/sqr3 STD=21.8 none (d2+d3+d4)/sqr3 STD=25.4 none (d1+d2+d3+d4)/sqr4 STD=25.4 none d1 STD=13.6 none d2 STD=23.7 val cl num val dis(1 297) f1 STD=11.8 none dis(2_200,2_160)=12.4 outlier g1 STD=14.5 none dis(2_60,2_132) =3.9 f2 STD=14.9 none (2_132,5_45) =33.6 outliers. g2 STD=23.6 none f3 STD=16.9 none g3 STD=12.7 val cl num f4 STD=22.3 none SubClus3 f5=c1M, g5=c7M. d5(f5=c2M,g5=c7M) g>2 STD=68 val cl num : g4 STD=11.6 val cl num Sub Cluster4 g4 STD=11.6 val cl num f5 STD=24.8 none g5 STD=27.1 none
28
Density: A set is T-dense iff it has no distance gaps greater than T
(Equivalently, every point has neighbors in its' T-neighborhood.) We can use L1 or HOB or L distance, since disL1(x,y) disL2(x,y); disL2(x,y) 2*disHOB(x,y) and disL2(x,y) n*disL(x,y) Definition: YX is T-dense iff there does not exist yY such that dis2(y, Y-{y}) > T. Theorem-1: If for every yY, dis2(y,Y-{y}) T then Y is T-dense. Using L1 distance, not L2=Euclidean: Theorem-2: disL1(x,y) disL2(x,y) (from here on we will use disk to mean disLk ). Therefore: If, for every yY, dis1(y,Y-{y}) T then Y is T-dense. ( Proof: dis2(y,Y-{y}) dis1(y,Y-{y}) T ) 2*disHOB(x,y) dis2(x,y) (Proof: Let the bit pattern of dis2(x,y) be 001bk-1...b0 then disHOB(x,y)=2k and the most bk-1 ...b0 can contribute is 2k-1 (if it's all 1-bits). So dis2(x,y) 2k + (2k - 1) 2*2k = 2*disHOB(x,y). Theorem-3: If, for every yY, disHOB(y,Y-{y}) T/2 then Y is T-dense. Proof: dis2(y,Y-{y}) 2*disHOB(y,Y-{y}) 2*T/2 = T Theorem-4: If, for every yY, dis(y,Y-{y}) T/n then Y is T-dense. Proof: dis2(y,Y-{y}) n*disHOB(y,Y-{y}) n*T/n = T Pick T' based on T and the dimension, n (It can be done!). If MaxGap(yoek)=MaxGap(Yk) < T' k=1..n, then Y is T-dense (Recall, yoek is just Yk as a column of values.) Note: We use the logn pTreeGapFinder to avoid sorting. Unfortunately, it doesn't immediately find all gaps precisely at their full width (because it descends using power of 2 widths), but if we find all PTreeGaps, we can be assured that MaxPTreeGap(Y) MaxGap(Y) or we can keep track of "thin gaps" and thereby actually identify all gaps (see the slide on pTreeGapFinder). Theorem-5: If k=1..nMaxGap(Yk) T, then Y is T-dense Proof: dis1(y,x)≡k=1..n|yk-xk|. |yk-xk| MaxGap(Yk) xY. So dis2(y,Y-{y}) dis1(y,Y-{y}) k=1..nMaxGap(Yk) T
29
Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals.
p x y No gaps (ct=0_intervals) on the furthest-to-Mean line, but 3 ct=1 intevals. Declare p=p12, p16, p18 anomaly if pofM is far enough from the bddry pts of its interval? VOM (34, 35) Mean, M Round 2 is straight forward. So, 1. Given gaps, find ct=k_intervals. 2. Find good gaps (dot prod with a constant vector for linear gaps?) For rounded gaps, use xox? Note: in this example, vom works better than mean.
30
Length based gapping is dependent.
Using vector lengths However, if the data happens to be shifted, as it is on the right, using lengths no longer works in this example. That is, dot product with a fixed vector, like fM is independent of the placement of the points with respect to the origin. Length based gapping is dependent. 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 A squared pattern does not lend itself to rounded gap boundaries. distance from the origin is in red. Distance from (7,0) is in blue. x x 8 7 x x x x x x x x x x x x x x x x 6 x x x x x x x x x x x x x x x x 5 x x x x x x x x x x x x x x x x 4 x x x x x x x x x x x x x x x x 3 x x x x x x x x x x x x x x x x 2 x x x x x x x x x x x x x x x x 1 x x x x x x x x x x x x x x x x 0 x x x x x x x x x x x x x x x x a b c d e f
31
Applying the algorithm to C4:
FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) 6/9/12 FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0
32
X x1 x2 p p p p p p p p p pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D(x,M) 8 7 6 4 2 D3 1 D2 1 D1 1 D0 1 xoUp1M 1 3 4 6 9 14 13 15 10 P3 1 P2 1 P1 1 P0 1 FAUST CLUSTER-fmg: O(logn) pTree method for finding P-gaps: P ≡ ScalarPTreeSet( c o fM/|fM| ) HOBbit Furthest pt list ={p1} Pick f=p1. dens(C)=16/82=16/64=.25 If GT=2k then add 0,1,...,2k-1 check all k of these down to level=2k P3'=[0,7] 1 ct=5 P3=[8,15] 1 ct= 10 P3'&P2' =[0,3] 1 ct =3 P3'&P2 =[4,7] 1 ct =2 P3&P2' =[8,11] 1 ct =2 P3&P2 =[12,15] 1 ct =8 P3'&P2'&P1' =[0,1] 1 ct =1 P3'&P2'&P1 =[2,3] 1 ct =2 P3'&P2&P1' =[4,5] 1 ct =1 P3'&P2&P1 =[6,7] 1 ct=1 P3&P2'&P1' =[8,9] 1 ct =1 P3&P2'& P1= [10,11] 1 ct=1 P3&P2&P1' =[12,13] 1 ct =3 P3&P2&P1 =[14,15] 1 ct =4 P3'&P2' &P1'&P0' 0ct=0 1 P3'&P2' &P1'&P0 1ct=1 P3'&P2' &P1&P0' 2ct=0 1 P3'&P2' &P1&P0 3ct=2 1 P3'&P2& P1'&P0' 4ct=0 P3'&P2 &P1'&P0 5ct=0 1 P3'&P2& P1&P0' 6ct=1 P3'&P2 &P1&P0 7ct=0 P3&P2'& P1'&P0' 8ct=0 1 P3&P2'& P1'&P0 9ct=1 1 P3&P2' &P1&P0' 10ct=1 P3&P2' &P1&P0 11ct=0 P3&P2& P1'&P0' 12ct=0 1 P3&P2 &P1'&P0 13ct=4 1 P3&P2' &P1&P0' 14ct=2 1 P3&P2 &P1&P0 15ct=2 Gaps at each red value. Get a mask pTree for each cluster by ORing the pTrees between pairs of gaps. Next slide - use xofM instead of xoUfM
33
FAUST CLUSTER ffd summary
If DT=1.1 then{pa} joins {p7,p8,p9}. If DT=0.5 then also {pf} joins {pb,pc.pd,pe} and {p5} joins {p1,p2,p3,p4}. We call the overall method FAUST CLUSTER because it resembles FAUST CLASSIFY algorithmically and k (# of clusters) is dynamically determined. Improvements? Better stop condition? Is fmg better than ffd? In ffd, what if k over shoots its' optimal value? Add a fusion step each round? As Mark points out, having k too large can be problematic?. The proper definition of outlier or anomaly is a huge question. An outlier or anomaly should be a cluster that is both small and remote. How small? How remote? What combination? Should the definition be global or local? We need to research this (give users options and advice for their use). Md: create f=furthest pt from M, d(f,M) while creating D=SPTreeSet(d(x,M)? Or as a separate procedure, start with P=Dh (h=High Bit Pos.) then recursively Pk<-- P & Dh-k until Pk+1=0. Then back up to Pk and take any of those points as f and that bit pattern is d(f,M). Note that this doesn't necessarily give the furthest pt from M but gives a pt sufficiently far from M. Or use HOBbit dis? Modify to get absolute furthest pt by jumping (when AND gives zero) to Pk+2 and continuing AND from there. (Dh gives a decent f (at furthest HOBbit dis). 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f centriod=mean; h=1; DT= 1.5 gives 4 outliers and 3 non-outlier clusters
34
Relative gap size on f-g line for fission pt.
Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5,p6,p7,p8,pe,pf} C2={p9,pb,pd} C3={pa} (outlier). Declare 2 gaps (3 clusters), C1={p1,p2,p3,p4,p5} C2={p6} (outlier) C3={p7,p8,p9,pa,pb,pc,pd,pe,pf} On C1, no gaps, so C1 has converged and is declared complete. On C1, 1 gap so declare (complete) clusters, C11={p1,p2,p3,p4} C12={p5} On C2, 1 (relative) gap, and the two subclusters are uniform so the both are complete (skipping that analysis) On C3, 1 gap so declare clusters, C31={p7,p8,p9,pa} C32={pb,pc,pd,pe,pf} On C31, 1 gap, declare complete clusters, C311={p7,p8,p9} C312={pa} On C32, 1 gap, declare complete clusters, C311={pf} C322={pb,pc,pd,pe} 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 9 a b c d e f a b c d e f Does this method also work on the first example? YES. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f
35
FAUST CLUSTER ffd on the "Linked Horseshoe" type example:
max dis to M0 6.13 dis f=p3 6.32 3.60 0.00 1.41 4.47 7.07 7.00 4.00 11.0 11.4 10.0 9.21 8.54 5.09 dis g=pa 7.07 9.43 11.4 10.7 8.60 5.65 5.00 7.61 4.00 0.00 2.23 3.00 5.09 6.32 PC1 1 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to M2 4.24 3.65 2.98 1.42 0.86 1.15 2.42 4.14 dis f2=6 0.00 1.00 4.00 5.65 3.60 4.12 3.16 dis g2=a 5.65 5.00 4.00 0.00 2.23 3.00 5.09 PC21 1 dis to M21 4.24 3.65 4.14 dis f21=e 3.16 2.24 0.00 dis g21=6 0.00 1.00 3.16 PC211 1 dis to M22 6.02 2.86 1.84 0.63 0.89 dis f22=9 0.00 4.00 2.24 3.61 5.00 dis g22=d 5.00 3.00 2.83 1.41 0.00 PC221 1 X x1 x2 p p p p p p p p p pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 FAUST CLUSTER ffd on the "Linked Horseshoe" type example: 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 9 a b c d e f a b c d e f PC222 1 dis to M222 1.70 0.75 1.37 dis to M1 2.94 1.17 3.39 2.29 1.34 1.11 2.52 dis to f2=3 6.32 3.61 0.00 1.41 4.47 4.00 5.10 PC11 1 dis to M12 1.89 1.72 1.08 2.09 dis to f12=f 3.16 3.61 1.41 0.00 PC12 1 Discuss: Here, DT=.99 (DT=1.5 all singeltons?). We expected FAUST to fail to find interlocked horseshoes, but hoped. e,g, pa and p9 would be only singleton! Can modify so it doesn't make almost everything outliers (singles, doubles a. look at upper cluster bbdry (margin width)? b. use std ratio bddys? c. other? d. use a fussion step to weld the horseshoes back Next slide: gaps on f-g line for fission pt. X x1 x2 p p p p p p p p p pa 13 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 M dens(C0)= 15/6.132<DT inc M dens(C1)= 7/3.392 <DT inc M dens(C2)= 8/4.242<DT inc M dens(C21)= 3/4.142<DT inc C212 compl dens(C212)= 2/.52=8>DT compl M dns(C221)= 2/5<DT inc M dns(C222)=1.04<DT inc M
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.