Download presentation
Presentation is loading. Please wait.
Published byOswin Houston Modified over 6 years ago
1
FAUST Classifier PR=Pxod<a PV=Pxoda d2-line d-line
0. Cut in middle of the means: a= (mR+(mV-mR)/2)od = (mR+mV)/2od D≡mRmV d=D/|D| PR=Pxod<a PV=Pxoda 1. Cut in middle of: voms instead of means. (then use stdev ratio to better place the cut?) 2. Cut in the middle of {Max{rod}, Min{vod}. (assuming mrodmvod) If no gap, move cut to minimize r_errors + v_errors. 3. Hill-climb d to maximize gap or to minimize training set errors or (simplest) to monimize dis(max{rod},min{vod}) . 4. Replace mr, mv with the avg of the margin points? Min{Vod}Max{Rod} CutR=CutV=avg{minVod,minRod}, else CutR≡Min{Vod}, Cut≡Max{Rod} 5. PR=Pxod<CutR PV=Pxod>CutV y PR or yPV , Definite classifications; else re-do on Indefinite region, PCutRxodCutV until actual gap (AND with certain stop cond? E.g., "On nth round, use definite only (cut at midpt(mR,mV)." V Another way to view FAUST DI is that it is a Decision Tree Method. With each non-empty indefinite set, descend down the tree to a new level For each definite set, terminate the descent and make the classification. R dim 2 vomR d2-line d2 vomV MaxRod MnVod Each round, it may be advisable to go through an outlier removal process on each class before setting Min{Vod} and Max{Rod} (E.g., Iteratively check if F-1(Min{Vod}) consists of V-outliers). r v v r mR r v v v v r r v mV v r v v r v d-line dim 1 d a
2
FAUST DI K-class training set, TK, and a given d (e. g
FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definitei = ( Mx<i, Mn>i ) Indefinitei,i+1 = [ Mn>i, Mx<i+1 ] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_____ Indefinite__ s-Mean s e-Mean e se empty i-Mean i ei F < setosa (35 seto) ST ROUND D=MeansMeane 18 < F < versicolor (15 vers) 37 F IndefiniteSet (20 vers, 10 virg) 48 < F virginica (25 virg) F < versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg) F < versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) < F virginica Test: F < setosa (15 seto) ST ROUND D=MeansMeane 15 < F < versicolor ( 0 vers, 0 virg) 15 F IndefiniteSet (15 vers, 1 virg) 41 < F virginica ( virg) F < versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k= (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k= (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)
3
FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Option-4, means pair most separated in X. m d(m1,m2) DEFINITE INDEFINITE m d(m1,m3) inf 0 m d(m2,m3) F 106, inf so totally non-productive! Option-6: D Median-to-Mean of IndefSet (initially IS=X) m meanF1 DEFINITE Cl= INDEFINITE m meanF2 def3[ -inf 21) m `2.0 meanF3 def1[ ) ind1[ ) On whole TR def2[ 58 inf) ind2[ ) m avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m avF3 def1[ 37 inf ) in11[ ) On Indef-1 Cl= Cls1 outlier(F=54) m avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m avF3 def1[ 13 inf ) in11[ ) On Indef-11 Cl= Cls1 outlier (F=29) m avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m avF3 def1[ 19 inf ) in111[ ) On Indef-111 Cl= Cls3 outlier (F=0) m avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m avF3 def1[ 19 inf ) in1111[ ) On Indef-1111 Cl= done! declare Class=1
4
FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Option-6: D Median-to-Mean of X m meanF1 DEFINITE Cl= INDEFINITE m meanF2 def3[ -inf 21) m `2.0 meanF3 def1[ ) ind1[ ) On whole TR def2[ 58 inf) ind2[ ) m avF1 DEFINITE INDEFINITE m avF3 def1[ -inf 18 ) def2[ 55 inf ) in11[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet1 m avF1 DEFINITE INDEFINITE m avF3 def3[ -inf 10 ) def1[ 20 inf ) in111[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet11 m avF1 DEFINITE INDEFINITE m avF3 def3[ -inf 0 ) def1[ 5 inf ) in1111[ ) Cl= The rest, Class=1 D Mean(loF)-to-Mean(hiF) of IndefSet111 (d repeats after this so in1111=C1 m avF1 DEFINITE INDEFINITE m avF2 def1[ -inf 2 ) def2[ 15 inf ) in2[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet2
5
Maximize variance - is it wise?
FAUST CLUSTERING = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk 2 + jkajkdjdk V(d)=jajjdj2 subject to i=1..ndi2=1 dT o VX o d = VarDPPdX≡V V i XiXj-XiX,j : d dn d1 dn ijaijdidj V(d) = x1 x2 xN x1od x2od xNod = Xod=Fd(X)=DPPd(X) - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) Use DPPd(x), but which unit vector, d*, provides the best gap(s)? 1. DPPd exhaustively searches a grid of d's for the best gap provider. 2. Use some heuristic to choose a good d? GV: Gradient-optimized Variance MM: Use the d that maximizes |MedianF(X)-Mean(F(X))|. We have Avg as a function of d. Median? (Can you do it?) HMM: Use a heuristic for MedianF(X): F(VectorOfMedians)=VOMod MVM: Use D=MEAN(X)VOM(X), d=D/|D| V(d)= 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk) 2a a a1n 2a21 2a a2n ' 2an ann d1 di dn GRADIENT(V) = 2A o d Maximize variance - is it wise? median std variance Avg consecutive differences avgCD maxCD ||mean-VOM| Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps = j=1..n Xjdj Mean(DPPdX)=(1/N)i=1..Nj=1..n xi,jdj sub to i di2=1 Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)| =j (1/Ni xi,j ) dj Compute Median(DPPd(X)? Want to use only pTree processing. Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good.
6
FAUST Clustering, simple example: Gd(x)=xod Fd(x)=Gd(x)-MinG on a dataset of 15 image points
The 15 Value_Arrays (one for each q=z1,z2,z3,...) z z z z z z z z z za zb zc zd ze zf X x1 x a b =q f p d a b b c e c d a e 8 f 7 9 The 15 Count_Arrays z z z z z z z z z za zb zc zd ze zf Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf gap: [F=6, F=10] gap: [F=2, F=5] F=2 F=1 Fp=MN,q=z1=0 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1
7
What is the DPPd FAUST CLUSTER algorithm?
What have we learned? What is the DPPd FAUST CLUSTER algorithm? X2=SubCluster2 SubCluster1 D=MedianMean, d1≡D/|D| is a good start. But first, Variance-Gradient hill-climb it. (Median means Vector of Medians). For X2=SubCluster2 use a d2 which is perpendicular to d1? In high dimensions, there are many perpendicular directions. GV hill-climb d2=D2/|D2| (D2=MedianX2-MeanX2) constrained to be to d1, i.e., constrained to d2od1=0 (in addition to d2od2=1. We may not want to constrain this second hill-climb to unit vectors perpendicular to d1. It might be the case that the gap gets wider using a d2 which is not perpendicular to d1? GMP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) Variance-Gradient hill-climbed subject only to dod=1 (We shouldn't constrain the 2nd hill-climb to d1od2=0 and subsequent hill-climbs to dkodh=0, h=2...k-1. (gap could be larger). So the 2nd round starts at d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) and hill-climbs subject only to dod=1) GCCP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where D2=CCi(X2)-CCj(X2), and hill-climbs subject to dod=1, where the CCs are two of the Circumscribing rectangle's Corners (the CCs may be a faster calculations than Mean and Vom). Taking all edges and diagonals of CCR(X) (the Coordinate-wise Circumscribing Rectangle of X) provides a grid of unit vectors. It is an equi-spaced grid iff we use a CCC(X) (Coordinate-wise Circumscribing Cube of X). Note that there may be many CCC(X)s. A canonical one is the one that is furthest from the origin (take the longest side first. Extend each other side the same distance from the origin side of that edge. A good choice may be to always take the longest side of CR(X) as D, D≡LSCR(X). Should outliers on the (n-1)-dim-faces at the ends of LSCR(X) be removed first? So remove all LSCR(X)-endface outliers until after removal the same side is still the LSCR(X). Then use that LSCR(X) as D.
8
MVM WINE GV C11 F-MN gp2 15 2 GM ACCURACY WINE GV MVM GM C1(F-MN) gp3 55 2 F-MN Ct gp8 [0.12) 1L 0H (F-MN) gp8 _4L 2H XF-M gp3 114 1 ___ _ [12,28) 1L 2H _2L 1H 2L 0H C1 C7F-M*3 g3 96 1 _0L 2H _0L 2H C2 3L 5H ___ _ [28,46) 2L 6H 1L 1H C5 g3 33 1 ___ _ [46,57) 2L 2H _2L 4H C71 C6*8 16 C121 max thin 17 4 _2L 5H C3 _0L 1H C4 C1 F-M Ct g3 55 1 _3L 0H ___ L 2H _1L 2H _0L 1H C4 ___ L 2H _1L 2H 3L 2H _2L 23L 25H 6L 21H _1L 6H _2L 12H 5L 5H _9L 7H C5 C763F-M*8 g8 71 2 _1L 4H C76*4 g3 97 2 C11 10L 13H C12 ___ _1L 0H 0L 2H ___ _0L 1H _2L 0H [0.35) C11 38L 68H C12 F-M gp2 31 1 4L 8H C6 _2L 4H 4L8H C763 _0L 2H C766 *16 g4 115 1 _0L 1H [35,53) C12 10L 13H ___ _ 2L 9H _2L 0H ___ [53,56) 3L 2H 29L 46H ___ _ 1L 8H _3L 1H 51L 83H [0.66) C1 _1L 3H ___ _ [66,75) 2L 2H _2L 0H _2L 0H 7L 19H 2L 2H ___ _ [75,98) 2L 6H 0L 1H ___ [57,115) 51L 83H C1 _4L 8H ___ _ [98,115) 2L 2H 38L 68H C7 17L 15H C766 _2L 0H _0L 2H ___ 28L 44H C76 1L ___ _ 3L 3H _1L 0H
9
SEEDS GV MVM GM C3 .97 .15 .09 .14 0 ACCURACY SEEDS WINE GV 94 62.7
akk d1 d2 d3 d4 V(d 10(F-MN) gp6 10(F-MN)gp6 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [9,18) 1k 0r 24c C2 GM ___ ___ [9,18) 1k 0r 24c C2 10(F-MN) gp3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [38,49) 13k 2r 0c C5 C2: 10(F-MN) gp10 63 1 ___ ___ [0,22) 0k 0r 42c C1 ___ ___ [38,49) 13k 2r 0c C5 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [0,31) 9k 0r 0c C21 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [31,41) 1k 0r 4c C22 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [22,33) k 0r 8c C2 ___ ___ [41,64) 0k 0r 4c C23 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [102,105) 0k 4r 0c Cb ___ ___ [102,105) 0k 4r 0c Cb C3 200(F-MN)gp12 ___ ___ [33,57) k 2r 0c C3 C F-M g9 70 2 ___ ___ [0,10) k 0r 0c ___ ___ [0,35) k 0r 0c ___ ___ [10,20) 2k 0r 1c ___ ___ [35,48) 2k 0r 3c C4: 10(F-MN) gp21 99 3 ___ ___ [20,30) 2k 0r 1c ___ ___ [48,72) 0k 0r 2c ___ ___ [57,69) 6k 9r 0c C4 ___ ___ [30,40) 4k 0r 1c ___ ___ [69,76) 1k 4r 0c C6 ___ ___ [40,50) 0k 0r 1c ___ ___ [72,113) 0k 0r 3c ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [0,52) 1k 7r C41 ___ ___ [70,71) k 0r 2c ___ ___ [52,79) 1k 2r C42 C6 10(F-M) g12 48 1 akk C6 200(F-MN)gp12 74 2 ___ ___ [79100) 4k 0r C43 ___ ___ [0,50) k 0r 0c ___ ___ [50,60) k 0r 2c ___ ___ [76,103) 0k 26r 0c C7 ___ ___ [0,22) 4k 0r 0c ___ ___ [60,74) k 0r 3c ___ ___ [74,75) k 0r 1c ___ ___ [22,49) 3k 6r 0c ___ ___ [103,109) 0k 6r 0c C8
10
MVM IRIS GM C12 4*F-M g3 93 1 GV C2 d1 d2 d3 d4 (F-MN)*3 Ct gp3 111 1 C23 F-M*3 g3 97 1 ACCURACY IRIS SEEDS WINE GV MVM GM F-MN gp8 68 1 F-MN Ct gp5 70 2 F-MN Ct gp3 70 1 ___ 1e 0i C1 2*(F-M g3 96 1 50s i C1 C2 __2e i ___ 4e 1i C21 ___ 19e 1i C22 4(F-) g4 ... ___50s 1i C1 ___ 6e 0i ___ e C221 29e 14i ___ ___ ___28i C11 ___ 19e 1i C22 ___ 16e 11i 18e i C123 ___ e ___ e ___ 3e i C221 8F- g5 95 1 ___1e ___ 0e i ___ 2e ___ i ___ 0e i C221 8F-)g5 95 1 ___50e 49i C1 C123 12*F-M g4 85 1 ___ 3e __ 4e i __ 1i ___ 1e ___ 50e 40i C2 9i C3 _46e 21i C12 ___ 5e 1i ___9e ___ 4e C13 ___ 27e 16i C23 ___9e ___ 50s 1i C2 ___ 9e i . MVM C2 2(F-)g4 ... 91 1 __9e 2i _ 4e ___ 9i C24 ___ e __9e i __ 0e 2i . 47e i C22 ___ i ___ i ___ 2e 6i . ___ 5e i ___ _3i ___ 0e i ___ i ___ 2e i ___ 5e i
11
CONCRETE MVM GV GM MVM C11 F-/4 g4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1
40 2 MVM (F-)/4 gp4 GM ACCURACY CONCRETE IRIS SEEDS WINE GV MVM GM C F-m/8 g4 C2 s 65 1 X g4 (F-MN)/8 s ___ M C2 gp8 (F-MN)/5 C L 8M 0H C L 33M 55H C M 0H C23 g4 F-MN/8 67 2 C21 g4 F-M/4 86 1 99 3 C211 g5 F-M)/4 98 2 GV C L M 49H ___ 7M C2 ___5L ___ 4M C3 ___6M C4 ___ 30L 1M 4H C231 g4 F-M/8 s 56 2 __20L 5M . ___14M 0H C1 C2 C1F-/4 g4 ... 1s+2s 123 1 C23 g3 F-M/8 50 2 ___ 5L 1M . 0L 32M 13H 11L 13M 54H ___ M . ___ 2L 1M . C211 32L M 0H ___5L 1M _30L 8H_ . 3L 2M C L 23M 53H C212 g5 F-M/3 C232 g2 F-M/8 51 1 ___6M 2H C212 7L M 10H 2L 2M 1H __6L 3M . __1L 2H ___1L 4M 3H C111 F-/4 g4 87 1 ___1L 1M 4H ___ __1L ___ H ___ 3L 2M 18H 43L 38M 55H C2 0L 14M 0H C1 1L M 43L M 55H C21 __ 1L 2M 20H C213 4L M 38H ___4L 2M H ___ H ___ ___ 1H 2M ___ M 9H C214 0L M 7H 0L M 0H C22 ___1L H ___ __ 31H
12
ABALONE GV C1 g3 400*F-M ... 97 1 MVM GM ACR CONC IRIS SEEDS WINE ABAL GV MVM GM g3 200*F-M ... s 92 1 1H 1M _ 1H X g2 100(F-M) 102 1 C1 g3 300(F-M) 92 1 2M 1H _ 5M 12H _ 6L 1M _ 3L 30L M 12H C1 C1 g3 100*F-M s 71 2 1H 7M 4H . 20L M 11H C11 10L 1M 0H 12L M _ C11 g3 400*F-M .. 85 1 3L M _ 2M 1H _ 4M 1H _ 2L 0M 0H _ 1L M 1H _ 16M 8H C11 C2 g3 300*F-M 109 1 17L M 9H C111 3L 7L 3M 0H _ C111 g3 1500*F-M ... 3L _ C11 g (F-M) 90 1 3M _ 6L 8M 0H _ 17M 2H . 13M 5H _ 4L M _ 1M 2H _ 0M 6H _ 1M 2H _ 4L M 15H C1 10L 1M 0H 3M 1H _ 2L M 1H _ 12M 7H _ 3L 13M 2H 1L M _ 15H _ 5M 10H _ 1M _ 4L 8M 4H 1H 6M 5H _ 3L M 1H 1M 1H _ 1H 3L M H
13
KOSblogs d=e841 (highest STD). d=UnitSTDVec g>6*avg
gp=1 Ct=8 C outliers. Some of them are substantial MVM gaps>6*avg d=e841 (highest STD). DOC W=841 C0 C1 1 2 C2 C3 C4 C5 C6 C7 C8 C9 C10 C13 otlrs otlrs otlrs otlr otlrs otlr otlr otlr otlr otlr otlr otlr C11 C12 AvgGp.0085 gp>6*avg ROW KOS F GAP CT 0.1=AvgGp 64=#gaps Row# Doc# F 28.2=MxGp .6=GapThreshold Gap 0 ___ ___ gap=.65 Ct=9 C1 ___ ___ gap=.6 Ct=2613 C2 ___ ___ gap=.75 Ct= 502 C3 ___ ___ gap=.6 Ct= 87 C4 Doc F=DPPd Gap 24=MxGp GV on 22 highest STD KOS wds d=( ) ___ ___ gap=.61 Ct=30 C5 ___ ___ gap=.73 Ct=45 C6 ___ ___ gap=.89 Ct=8 C7 ___ ___ gap=.65 Ct=8 C8 ___ ___ gp=.72 Ct= 11 C9 ___ ___ gp=.65 Ct=1 outlr ___ ___ gp=.61 Ct=12 C11 ___ ___ gp=1.2 Ct=6 C12 ___ ___ gp=1.1 Ct=11 C13 ___ ___ gap=.67 Ct=1 utlr Cluster size: d=USTD MVM GV 3 4 5 6 10 42 316 3029 ___ ___ gp=1.1 Ct=3 C15 ___ ___ gp=1.8 Ct=4 C16 ___ ___ gp=1.8 Ct=5 otl;r
14
GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM CONC d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) On these pages we display the variance hill-climb for each of the four datasets (Concrete, IRIS, Seeds, Wine) for a grid of starting unit vectors, d. I took the circumscribing unit non-negative cube and used all the Unitized diagonals. In low dimension (all dimension=4 here) this grid is very nearly a uniform grid. Note that this will work less and less well as the dimension grows. In all cases, the same local max and nearly the same unit vector are reached.
15
GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) 2 IRIS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM SEEDS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM WINE d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM As we all know, Dr. Ubhaya is the best Mathematician on campus and he is attempting to prove three things: 1. That a GV-hill-climb that does not reach the global max Variance is rare indeed. 2. That one is guaranteed to reach the global maximum with at least one of the coordinate unit vectors (so a 90 degree grid will always suffice) That akk will always reach the global max.
16
Finding round clusters that aren't DPPd separable? (no linear gap)
Find the golf ball? Suppose we have a white mask pTree. No linear gaps exits to reveal it. Search a grid of d-tubes until a DPPd gap is found in the interior of the tube (Form mask pTree for interior of the d-tube. Apply DPPd that mask to reveal interior gaps.) Look for conical gaps (fix the the cone point at the middle of tube) over all cone angles (look for an interval of angles with no points). Notice that this method includes DPPd since a gap for a cone angle of 90 degrees is linear.
17
FAUST Gap Revealer Width 24 so compute all pTree combinations down to p4 and p' d=M-p 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f Z z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 F=zod 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 p= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 [ , ] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [ , ]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually 24 (the calculation of the min is a pTree process - no x looping required!) [ , ] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [ , ] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [ , ]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [ , ]= [64, 80). This is clearly a 24 gap. [ , ]= [80, 96). z6od=80, zfod=83 [ , ]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z z7 z z7 z z7 z z7 z z7 z z7 z z8 z z8 z z8 z z8 z z8 z z8 z X1 X2 dX1X2 z9 z z9 z z9 z z9 z z9 z z10 z z10 z z10 z z10 z X1 X2 dX1X2 z11 z z11 z z11 z z12 z z12 z z13 z
18
FAUST Tube Clustering: (This method attempts to build tubular-shaped gaps around clusters)
q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all tubular gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in tubular gaps. a. SquareTubeRadius functional, STR(y) = (y-p)o(y-p) - ((y-p)od)2 b. TubeLength functional, TL(y) = (y-p)od y tube cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same TL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p tube radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)
19
Cone Clustering: (finding cone-shaped clusters)
x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e
20
"Gap Hill Climbing": mathematical analysis
rotation d toward a higher F-STD or grow 1 gap using support pairs: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place. Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 C1<7 (50 Set) d2-gap >> than d1=gap (still not optimal.) Weight mean by the dist from gap? (d-barrel radius) 7<C2<16 (4i, 48e) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? C3>16 (46i, 2e) d1 d1-gap hill-climb gap at 16 w half-space avgs. a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap a b c f e d c b a 9 8 7 6 a j k b c q d e f 2 1 C2uC3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 =p q= No conclusive gaps Sparse Lo end: Check [0,9] i39 e49 e8 e44 e11 e32 e30 e15 e31 i e e e e e e e e i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set d2 d2-gap p q d2 d2-gap There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. C123 p avg=14 q avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 Sparse Hi end: Check [38,47] distances i31 i8 i36 i10 i6 i23 i32 i18 i19 i i i i i i i i i i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Here, gap between C1,C2 is more pronounced Why? Thinning C2,C3 more obscure? It did not grow gap wanted to grow (tween C2 ,C3.
21
CAINE 2013 Call for Papers th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas: Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/Databases Bioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based Systems Computer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural Networks Computer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare Robotics Computer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASE Distributed Systems Visualization Embedded Systems Wireless Networks and Communication Important Dates: Workshop/special session proposal . . May 2.5, Full Paper Submis . .June 5, Notice Accept ..July.5 , 2013. Pre-registration & Camera-Ready Paper Due August 5, Event Dates . . .Sept 25-27, 2013 SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to: . Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems . Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt . Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management . Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep Web May 23, 2013 Paper Submission Deadline June 30, 2013 Notification of Acceptance July 20, 2013 Registration and Camera-Ready Manuscript Conference Website: ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, Special Sessions Proposal June 5, Full Paper Submission July 5, Author Notification Aug. 5, Advance Registration & Camera Ready Paper Due CBR International Workshop Case-Based Reasoning CBR-MD July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, 2013 Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013 Workshop on Data Mining in Marketing DMM' In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of ing Activities and Newsletter Mailing Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013 Workshop Data Mining in Ag DMA Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013
22
Hierarchical Clustering
ABC DEFG Hierarchical Clustering Any maximal anti-chain (maximal set of nodes s.t no 2 directly connected) is a clustering. (dendogram offers many DE FG A BC F G D E B C But horizontal anti-chains are clusterngs from top down (or bottom up) method(s).
23
CONCRETE GV F=(DPP-MN)/4 Concrete(C, W, FA, A) Accuracy=90%
0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 med=10 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 med=14 med=9 med=18 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 med=17 med=21 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 med=23 med=40 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 med=34 med=33 med=56 Accuracy=90% med=61 ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 med=57 med=62 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 med=71 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 med=71 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 med=86 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 What about agglomerating using single link agglomeration (minimum pairwise distance? Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors C1 C2 C3 C4 CONCRETE
24
GV Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 Accuracy=90% ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.