Download presentation
Presentation is loading. Please wait.
Published byJesse Brown Modified over 6 years ago
1
Where are we at? We are working on perfecting FAUST Clustering (using distance dominated functional gap analysis). Our primary choice of functional is the dot product with a unit vector, d (called DPPd(x) ) But what d should one pick? One can always sequence through an entire grid of unit vectors until one is found that produces great gaps (sequence through a gridding on the unit n-1 sphere), but that's time expensive Is there a way of picking a "great" starting d? That's what we are working on right now. First though, let's note that the techniques we develop here will be useful for FAUST Classification too (where we assume a gap or boundary exists between the training classes (usually half way between the projections onto d of the two class means)). A natural starting point for "picking a good d" is to try to find a d that optimizes the variance of F(X). Why? Because if there is low dispersion, there can't be lots of large gaps. But, just because there is high dispersion, does not mean there IS a large gap (simple example follows). The best starting d would be one that maximizes the maximum consecutive difference within the [sorted] array, F(X). A candidate "good" heuristic is to find the d that maximizes | Mean(F(X)) - Median(F(X)) | but the latter (so far) seems difficult to calculate? We can estimate it with F(VectorOfMedians)=F(VOM) which we can calculate.
2
CONCRETE Maximizing the Variance of F (over unit vectors, d) = =
X X1...Xj ...Xn x1 x2 . xi xi,j xN x1od x2od xiod xNod d1 dn = X o d = DPPd(X) V(d)≡VarDPPd(X) = (Xod) ( Xod )2 On Concrete4150(C, W, FA, Ag): = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 F Ct gap (gap5) 0 1 - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) N 1 = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk N 1 2 = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) d=hill-climbed unit_gradient ( starting w (0,0,0,1), arriving at (.34, .13, .84, .4) which [locally] maximizes the variance subject to i=1..ndi2=1 dT o VX o d =VarDPPdX≡V V X1... Xj ... Xn : Xi XiXj-XiX,j XN d1... dj ... dn V(d)=ijaijdidj = d1 di dn + jkajkdjdk V(d)=jajdj2 In this F-value sequence, there are almost no substantial gaps and if I test the classification for any partition of subintervals, none are near pure! Possibly this lends credence to the hypothesis that "We should not maximize variance!" ijaijdidj V(d) = V(d)= 2a1d1 +j1a1jdj 2a2d2 +j2a2jdj : 2andn +jnanjdj Choose do=ek for the k s.t. ak is maximal. Then choose d1≡(V(d0)) Then choose d2≡(V(d1)) until F(dk) stops changing. CONCRETE
3
Should we maximize variance?
median std variance mean consecutive differences avgCD maxCD |mean-median| Should we maximize variance? MEAN-MEDIAN picks out the last two sequences, which have the best gaps (discounting outlier gaps at the extremes). Sooo... Mean(DPPdX) = j=1..n Xjdj Finding a good unit vector, d, for Dot Product functional, DPP. to maximize gaps Mean(DPPdX) = (1/N)i=1..Nj=1..n xi,jdj X X1...Xj ...Xn x1 x2 . xi xi,j xN x1od x2od xiod xNod d1 dn = X mm d = DPPd(X) subjected to i=1..n di2= 1 Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)| Mean(DPPdX) = j=1..n ( (1/N)i=1..N xi,j ) dj But how do we compute Median(DPPd(X) ? We want to use only pTree processing. We want to end up with a formula involving d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) A heuristic is to substitute the Vector of Medians (VOM) for Median(DPPd(X)???
4
IRIS Choosing d=ek to be the k w max std (Here d=e3 = ePL on IRIS150
0 1 5 1 7 2 13 2 16 3 17 1 19 1 20 2 21 1 22 1 23 2 24 5 25 1 26 2 27 1 28 2 29 1 30 2 31 1 32 2 33 1 34 2 35 4 36 3 37 2 38 1 39 2 40 1 41 2 42 1 43 1 44 1 45 4 46 3 47 4 48 3 49 2 50 1 51 2 53 3 54 3 56 3 57 2 58 4 59 1 60 1 61 1 63 1 65 1 66 1 68 1 70 2 76 1 VOM mean on IRIS150(SL,SW,PL,PW) and spread out F values using G=(F-minF)*2 (since F-minF ranges over [0,60]). Choosing d=ek to be the k w max std (Here d=e3 = ePL on IRIS150 10 1 11 1 12 2 13 7 14 12 15 14 16 7 17 4 18 1 19 2 30 1 33 2 35 2 36 1 37 1 38 1 39 3 40 5 41 3 42 4 43 2 44 4 45 8 46 3 47 5 48 3 49 5 50 4 51 8 52 2 53 2 54 2 55 3 56 6 57 3 58 3 59 2 60 2 61 3 63 1 64 1 66 1 67 2 69 1 Splitting at the MaxGap=24 first, [76,100] _________[0.25) setosa virginica CLUS1 CLUS2 Split at next hi MaxGap=6, [7,13], [70,76] (outliers) Split at thinning [39,44] _________[0.39) vericolor virginica CLUS_1.1 5 versicolor virginica CLUS_ [45,76) versicolor virginica CLUS_1.3 _________[25,49) versicolor virginica CLUS2.1 CLUS2.2 _________[49,54) versicolor virginica CLUS2.2.1 CLUS2.2.2 _________[54,70) versicolor virginica CLUS2.2 F[49,69] 47_Virginica with 4_Versicolor errors; CLUS2.1 F[25,48] 46_Versicolor with 2_Virginica errors; CLUS1 F[0, 25] 50_Setosa with 1 Virginica error. So the classification accuracy is 143/150 or 95.3% _________[0.88) versicolor virginica CLUS_1 50 setosa virginica CLUS_2 d=VOMMN DPP on IRIS_150_SEI_(SL,SW,PL,PW). CLUS1.1 F[0,39] 44_Virginica with 2_Versicolor errors; CLUS1.3 F[45,76] 42_Versicolor with 1_Virginica error; CLUS2 F[80,120] 50_Setosa with 2_Virginica errors; and CLUS1.2 F[39,44], if classified as 5_Versicolor has 3_Virginica errors. So the classification accuracy is 142/150 or 94.6% IRIS
5
WINE VOMmean on WINE150hl(FA,FS,TS,AL) STDs=(1.9,9,23,1.2)
7 1 8 4 9 4 10 5 11 4 12 7 13 7 14 8 15 2 16 5 17 4 18 5 19 7 20 3 21 3 23 2 24 9 25 3 26 1 27 4 28 4 29 5 30 1 31 2 32 1 34 2 35 1 36 1 37 1 39 1 41 1 42 3 43 1 44 2 45 1 47 6 48 4 49 2 50 1 51 1 53 1 55 1 59 1 63 1 65 1 67 2 69 1 74 1 75 1 84 1 85 1 86 1 87 1 88 2 89 1 STDs=(1.9,9,23,1.2) maxSTD=23 for d=eTS on WN150hl(FA,FS,TS,AL 0 1 1 4 2 4 3 5 4 3 5 8 6 7 7 1 8 7 9 2 10 5 11 4 12 5 13 7 14 3 15 3 17 2 18 5 19 4 20 3 21 1 22 4 23 4 24 5 25 1 26 2 27 1 29 2 30 1 31 1 32 1 34 1 36 1 37 3 38 1 39 2 40 1 43 6 44 4 46 2 47 1 48 1 50 1 52 1 56 1 60 1 63 1 65 2 67 1 72 1 74 1 82 1 83 1 85 1 86 1 87 2 88 1 99 1 _________[0 . 16) low CLUS [16,22) low high CLUS _________[0 . 10) low high CLUS_ [10,16) low high CLUS _________[0 . 19) low [19,22) low high But no algorithm would pick 19 as a cut! ___ _____[0 . 13) low [13,16) low high But no alg would pick a 13 cut! _________[0 . 22) low high CLUS_ [22,33) low high CLUS _________[0 . 16) low high CLUS [16,31) low high CLUS _________[0 . 33) low high CLUS_ [33,60) low high CLUS _________[0 . 31) low high CLUS_ [31,58) low high CLUS _________[0 .60) low high CLUS_ [60,72) low high CLUS _________[0 .58) low high CLUS_ [58,70) low high CLUS _________[0.72) low high CLUS1.1.1 [72,80] high CLUS1.1.2 _________[0.80) low high CLUS1.1 [80,95] high CLUS1.2 _________[0 .70) low high CLUS_1.1.1 [70,78) low high CLUS 1.1.2 _________[0 .78) low high CLUS_1.1 [78,94) low high CLUS1.2 _________[0.95) low high CLUS1 4 high CLUS2 _________[0.94) low high CLUS_1 0 low high CLUS_2 d=VOMMEAN DPP on WINE_150_HL_(FA,FSO2,TSO2,ALCOHOL). Some agglomeration required: CLUS is LOW_Quality F[0,10], else HIGH Quality F[13,119] with 15 LOW error. Classification accuracy = 90% (if it had been cut 13, 99.3% accuracy!) Identical cuts and accuracy! Tells us that d=eTotal_SO2 is responsible for all separation. WINE
6
SEEDS STDs=(2.9, .6, 1.6, .5)) maxSTD=2.9 for e1
VOMmean w G=DPP(xod*10) SEED4150(AREA,LENKER,ASYMCOEF,LENKERGRV) STDs=(2.9, .6, 1.6, .5)) maxSTD=2.9 for e1 d=eA SEED4150(A,LK,AC,LCG) 0 1 3 1 6 4 9 3 11 1 12 7 14 1 15 3 16 2 17 10 20 10 22 1 24 1 25 1 26 2 27 1 28 4 29 1 31 3 32 3 34 3 35 4 37 2 38 2 39 3 40 1 41 2 42 4 43 5 44 1 46 5 48 2 49 2 51 3 53 1 54 4 56 3 57 3 59 5 60 1 64 1 66 1 67 1 69 1 70 2 71 1 72 1 73 3 75 4 78 8 81 6 84 1 85 2 87 1 88 1 11 18 12 25 13 18 14 18 15 15 16 13 17 8 18 8 19 21 20 2 21 4 [0,14) 1 Kama Canada But no algorithm would pick 14 as a cut! _________[0.19) Kama Rosa 33 Canada CLUS_1.1 [19,62) 50 Kama Rosa 17 Canada CLUS_1.2 [13,14) 10 Kama Canada . That's either 8 or 10 errors and no algorithm would cut at 14. [14,15) 18 Kama But no algorithm would cut at 15. [15,16) 13 Kama 2 Rosa no alg w cut. [16,17) 7 Kama 6 Risa _________[0.17) Kama Rosa 50 Canada CLUS_1 [17,22) 1 Kama Rosa 0 Canada CLUS_2 _________[19,23) 0 Kama Rosa 11 Canada CLUS_1.2.1 [23,62) 50 Kama Rosa 6 Canada CLUS_1.2.2 But that's the only thinning! Therefore, we are unable to separate Kama and Canada at all. _________[23,30) 6 Kama Rosa Canada CLUS_ [30,62) 44 Kama 16 Rosa Canada CLUS_ _________[30,33) 5 Kama Rosa Canada CLUS_ [33,62) 39 Kama Rosa 1 Canada CLUS_ _________[33,36) 6 Kama Rosa Canada CLUS_ [36,62) 33 Kama Rosa 0 Canada CLUS_ _________[36,45) 18 Kama Rosa Canada CLUS_ [45,62) 15 Kama Rosa 0 Canada CLUS_ _________[45,50) 8 Kama Rosa Canada CLUS_ _________ [50,52) 0 Kama Rosa Canada CLUS_ _________ [52,55) 3 Kama Rosa Canada CLUS_ _________ [55,58) 3 Kama Rosa Canada CLUS_ [58,62) 1 Kama Rosa Canada CLUS _________[0.62) Kama Rosa 50 Canada CLUS_1 [62,89) 0 Kama Rosa 0 Canada CLUS_2 11 errors, so accuracy = 93% SEEDS
7
CONCRETE Entirely inconclusive using e1 ! e4 accuracy rate =
VOMmean w F=(DPP-MN)/4 Concrete4150(C, W, FA, Ag) 0 4 6 3 7 2 12 10 13 2 14 3 18 9 20 4 22 5 23 3 24 5 27 3 31 3 36 4 41 2 42 1 43 4 44 3 46 3 48 2 49 2 55 13 58 8 60 6 62 5 65 4 71 16 72 4 74 3 82 4 83 7 97 2 STD=(101,28,99,81) d=e1 Conc4150 0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 [9,16) Low Medium High CLUS_ [16,31) Low Medium 0 High CLUS_ _________[0,9) Low Medium High CLUS_ [9,14) Low Medium High CLUS_ [31,39) Low Medium High CLUS_ [14,18) Low Medium High CLUS_ . [39,52) Low Medium High CLUS_1.1.2 [18,23) Low Medium High CLUS_ ______ [52,80) Low 17 Medium High CLUS_1.2 [23,31) Low Medium High CLUS_ Entirely inconclusive using e1 ! ________[0.80) 43Low 46Medium 55High CLUS_1 [80,101) 0 Low Medium High CLUS_2 [31,36) Low Medium High CLUS_ d=e3 Conc4150 0 15 3 2 5 4 17 3 19 8 21 1 29 3 41 28 46 3 47 8 48 3 52 4 53 15 58 3 62 4 63 4 64 1 65 7 67 3 69 4 72 3 73 12 75 2 78 5 83 1 VOM2,4MN2,4 [36,39) Low Medium High CLUS_ ________ [0.9) Low Medium High CLUS_1.1 [9,32) Low Medium High CLUS_1.2 0 2 1 5 2 6 3 12 4 2 5 1 6 6 7 6 8 1 9 4 10 12 11 11 12 5 13 3 18 2 19 9 20 10 21 4 29 1 30 4 31 9 32 4 34 9 35 4 36 1 62 2 64 5 93 4 = L 0M 0H C14 . [39,52) Low 12 Medium High CLUS_1.1.2 = L 4M 0H C13 [2,4) 11L 6M 1H C12 ________ [0.32) Low 24 Medium High CLUS_1 [32,101) 39 Low 28 Medium 47 High CLUS_2 = L 2M 0H C11 [5,9) 14L 0M 0H C10 ______ [52,90) Low 11 Medium High CLUS_1.2 ________ [32,55) Low 12 Medium 28 High CLUS_2.1 [55,101) 1 Low Medium High CLUS_2.2 [9,11) 3L 0 M 13H C9 0 2 3 4 8 4 10 2 12 8 13 4 15 4 16 14 17 3 18 1 19 3 20 8 22 15 24 4 27 9 29 3 30 6 31 3 32 3 33 7 34 4 35 8 37 5 38 3 53 23 = L 2 M 4H C8 = L 0 M 0H C7 ________[0,5) Lo Med 2 Hi CLUS_2 = L 3 M 0H C6 ________= Lo Med 3 Hi CLUS_3 [15,25) 2L 4 M 19H C5 ________= Lo Med 0 Hi CLUS_4 ________[10,14) 1 Lo 4 Med 9 Hi CLUS_5 [25,33) 0L 0 M 18H C4 _________[0.90) Low 46 Medium High CLUS_1 [90,113) 0 Low Medium High CLUS_2 [33,50) 0L 14 M 0H C3 d=e4 Conc4150 ________[14,21) 9 Lo 9 Med Hi CLUS_6 0 17 1 11 3 12 6 35 13 25 22 25 24 8 44 7 67 4 89 2 91 4 ________= Lo 4 Med 0 Hi CLUS_11 ________[21,25) 3 Lo 1 Med Hi CLUS_7 [50,93) 0L 7 M 0H C2 ________= Lo 9 Med 0 Hi CLUS_10 ________[25,28) 5 Lo 1 Med Hi CLUS_8 e4 accuracy rate = 104/150= 69% ________= Lo 0 Med 0 Hi CLUS_9 e2 accuracy= 93/150=62% [93,m) 0L 10M 0H C1 ________= Lo 5 Med 17 Hi CLUS_8 Inconclusive on e2! _______ = Lo 6 Med 19 Hi CLUS_6 ________= Lo 3 Med 19 Hi CLUS_7 _______ = Lo 8 Med 0 Hi CLUS_5 Accuracy= 127/150=85% ________= Lo 7 Med Hi CLUS_3 d=e2 Conc4150 ________[28,36) 14 Lo 12 Med Hi CLUS_9 [36.40) 7 Lo Med Hi CLUS_10 ________= Lo 4 Med Hi CLUS_1 CONCRETE ________= Lo 4 Med Hi CLUS_2 ________= Lo 4 Med Hi CLUS_4 ________= Lo 22 Med 0 Hi CLUS_1
8
Concrete4150(C, W, FA, Ag): Redo without cheating ;-) Even though the accuracy is high, no algorithm would make all of those cuts. VOM2,4MN2,4 Cut only at gaps 5 on first round. Then we iteratively repeat on each subcluster. C1 and C2 accuracy=100%, so we skip them and concentrate on C3,C4,C5 to see if a second round will purify them. Start with C5: (F-MN)/4 0 2 1 5 2 6 3 12 4 2 5 1 6 6 7 6 8 1 9 4 10 12 11 11 12 5 13 3 18 2 19 9 20 10 21 4 29 1 30 4 31 9 32 4 34 9 35 4 36 1 62 2 64 5 93 4 0 2 6 3 7 2 12 6 13 2 14 2 18 1 19 3 21 3 23 4 24 2 25 1 26 3 27 1 29 2 33 4 37 1 41 2 42 1 43 2 44 3 46 1 49 2 54 3 58 2 60 3 61 2 66 3 69 1 77 4 78 1 87 1 99 2 = L 0M 0H (outliers) C3 next: (F-MN)/4 _______[1,10) 5L 0M 0H C2 [0,15) 41L 17 M 18H C5 0 2 7 1 11 1 13 1 16 1 18 1 21 1 22 2 23 1 25 1 31 2 32 1 34 1 36 1 41 2 47 3 50 1 54 1 68 1 70 1 72 1 94 1 = M 0H (outliers) _______[10,16) 10L 0M 0H C5 = M 1H (outliers) _______[19,20) 3L 0M 0H C9 _______[9,15) 2M 0H C10 _______[15,20) 0M 2H C10 _______[20,24) 0M 4H C10 [15,25) 2L 4 M 19H C4 (gap=5) ________ [31,44) 1M 6H C4 (one 31=M so 1 error!) = M 0H (outliers) _______[20,32) 16L 0M 0H C10 = L 0M 0H c6 [25,50) 0L 14 M 18H C3 (gap=8) ________ [39,45) 0L 8M 0H c8 = L 0M 0H c7 ________ [44,61) 1M 4H C3 (50=M, separated at gap=3) ________ [45,48) 0L 1M 0H outlier ________ [48,52) 0L 2M 0H C3 ________ [61,83) 2M 1H C2 (68=H, separated at gap=2) [50,78) 0L 7 M 0H C2 (gap=26) ________ [52,66) 0L 5M 5H C4 ________ [83,127) 5M 0H C1 [78,m) 0L 10M 0H C1 ((gap=29) [56,59) 0L 2M 0H ' [59,66) 0L 0M 5H (These will show up when we get to gaps of 4 and 2 (actual gap sizes 16 and 8) [52,56) 0L 3M 0H ' Accuracy=100% on C1 and C2! ________ [63,73) 0L 0M 4H C3 1 error on C1,2,3,5 _______ [77,83) 0L 0M 5H C1 = L 0M 1H (outlier) = L 0M 2H (outliers) = L 0M 1H (outlier) Accuracy=100% on C5! C4 next: (F-MN)/3 0 1 5 1 10 2 11 1 13 1 18 1 21 1 27 1 29 1 33 1 35 1 36 1 40 1 42 2 47 1 49 1 51 1 58 1 59 1 64 1 66 1 68 1 So there is but 1 error (in the C3 step) for an accuracy of 149/150=99.3%. However, I realized I am still cheating ;-( How would I know to do as the first round instead of ? VOM2,4MN2,4 VOM1,2,3,4MN1,2,3,4 _______ [0,24) 0L 0M 8H C2 _______ [24,31) 1L 0M 1H C5 (29=L outlier with gaps 2,4) [31,38) 0L 0M 3H C6 _______ [38,45) 0L 0M 3H C4 I need to redo this using all 4 attributes. Another issue is: How can we follow this with an agglomeration step which might glue the intra-class subclusters back together? Agglomerate after FAUST Gap Clustering using "separation of the subcluster medians" [or means?] as the measure?!?! _______ [45,55) 0L 0M 3H C3 _______ [55,83) 1L 3M 1H C1 {58,59} are the L and H: doubleton outlier set bdd by gaps of 7 and 5 = L 1M 0H outlier CONCRETE
9
CONCRETE VOMmean w F=(DPP-MN)/4 Concrete4150(C, W, FA, Ag)
Redo with all 4 attributes and Fgap5 (which is actually gap=5*4=20). 0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 med=10 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 med=14 med=9 med=18 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 med=17 med=21 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 med=23 med=40 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 med=34 med=33 med=56 Accuracy=90% med=61 ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 med=57 med=62 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 med=71 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 med=71 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 med=86 ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Let's review agglomerative clustering in general next (dendograms) ______ gap=6 [74,90) 0L 4M 0H CLUS_2 Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 _____________At this level, FinalClus1={17M} 0 errors C1 C2 C3 C4 CONCRETE
10
Hierarchical Clustering
ABC DEFG DE FG A BC F G D E B C Any maximal anti-chain (maximal set of nodes in which no 2 are directly connected) is a clustering (a dendogram offers many).
11
Hierarchical Clustering
But the “horizontal” anti-chains are the clusterings resulting from the top down (or bottom up) method(s).
12
CONCRETE VOMmean w F=(DPP-MN)/4 Concrete4150(C, W, FA, Ag)
0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 med=10 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 med=14 med=9 med=18 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 med=17 med=21 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 med=23 med=40 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 med=34 med=33 med=56 Accuracy=90% med=61 ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 med=57 med=62 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 med=71 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 med=71 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 med=86 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 What about agglomerating using single link agglomeration (minimum pairwise distance? Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors C1 C2 C3 C4 CONCRETE
13
Agglomerating using single link (min pairwise distance = min gap size
Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 Accuracy=90% ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE
14
meanVOM w F=(DPP-MN)/4 Concrete4150(C, W, FA, Ag)
Redo: Weight d with |MN-VOM|/Len F Ct gap (gap5) 0 2 C49=(f-mn)/5 F Ct gap (same range so gaps3) 0 1 C Ct gap (range is ~2 so gaps3) 0 1 ___ ___ [0,5) L 1M 0H C491 ___ ___ [0,20) L 6M 0H C1 ___ ___ [0,30) L 3M 0H C41 ___ ___ [20,30) 0L 4M 0H C2 ___ ___ [30,40) L 2M 0H C42 ___ ___ [40,43) L 0M 1H C47 ___ ___ [5,15) 2L 4M 1H C492 ___ ___ [30,40) 0L 7M 0H C3 C4 ___ ___ [43,48) L 4M 0H C48 [48,105) 43L 23M 51H C49 ___ ___ [15,62) 41L 17M 47H C493 ___ ___ [62,70) 0L 0M 2H C494 ___ ___ [70,77) 0L 1M 1H C495 1 err This uncovers the fact that repeated applications of meanVOM can be non-productive when each applications basically removes sets of outliers at the extremes of the F-value array (because when outliers are removed, the VOM may move toward the mean). ___ __ [105,110) L 0M 2H C46 ___ __ [110,115) L 1M 0H C45 ___ __ [115,120) L 0M 1H C44 ___ __ [120,123) L 2M 0H C43 CONCRETE
15
APPENDIX Functional Gap Clustering using Fpq(x)=RND[(x-p)o(q-p)/|q-p|-minF] on Spaeth image (p=avg The 15 Value_Arrays (one for each q=z1,z2,z3,...) z z z z z z z z z za zb zc zd ze zf X x1 x a b =q f p d a b b c e c d a e 8 f 7 9 The 15 Count_Arrays z z z z z z z z z za zb zc zd ze zf Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf gap: [F=6, F=10] gap: [F=2, F=5] F=2 F=1 Fp=MN,q=z1=0 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1 The FAUST algorithm: 1. project onto each pq line using the dot product with the unit vector from p to q. 2. Generate ValueArrays (also generate the CountArray and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.
16
Gap Revealer Width 24 so compute all pTree combinations down to p4 and p' d=M-p 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f Z z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 F=zod 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 p= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 [ , ] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [ , ]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually 24 (the calculation of the min is a pTree process - no x looping required!) [ , ] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [ , ] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [ , ]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [ , ]= [64, 80). This is clearly a 24 gap. [ , ]= [80, 96). z6od=80, zfod=83 [ , ]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z z7 z z7 z z7 z z7 z z7 z z7 z z8 z z8 z z8 z z8 z z8 z z8 z X1 X2 dX1X2 z9 z z9 z z9 z z9 z z9 z z10 z z10 z z10 z z10 z X1 X2 dX1X2 z11 z z11 z z11 z z12 z z12 z z13 z
17
Barrel Clustering: (This method attempts to build barrel-shaped gaps around clusters)
q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, SBR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od y barrel cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p barrel radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)
18
Cone Clustering: (finding cone-shaped clusters)
x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e
19
FAUST Classifier Pr=P(xod)<a Pv=P(xod)a
Separate classr, classv using midpoints of means: Pr=P(xod)<a Pv=P(xod)a Set a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d ? where D≡ mRmV d=D/|D| Training amounts to choosing the Cut hyperplane = (n-1)-dimensionl hyperplane (and thus cuts the space in two). Classify with one horizontal program (AND/OR) across the pTrees to get a mask pTree for each class (bulk classification). Improve accuracy? e.g., by considering the dispersion within classes. Use 1. vector_of_medians: (vomv≡ (median(v1), median(v2),...)) instead of means; then use stdev ratio to place the cut. 2. Cut at Midpt of Max{rod}, Min{vod}. If there is no gap, move Cut until r_errors + v_errors is minimized. 3. Hill-climb d to maximize gap (or minimize errors when applied to the training set). 4. Replace mr, mv with the avg of the margin points? 5. Round classes expected? use SDmr < |D|/2 for r-class and SDmv <|D|/2 for v-class. dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a
20
2/2/13 Datamining Big Data big data: up to trillions of rows (or more) and, possibly, thousands of columns (or many more). I structure data vertically (pTrees) and process it horizontally. Looping across thousands of columns can be orders of magnitude faster than looping down trillions of rows. So sometimes that means a task can be done in human time only if the data is vertically organized. Data mining is [largely] CLASSIFICATION or PREDICTION (assigning a class label to a row based on a training set of classified rows). What about clustering and ARM? They are important and related! Roughly clustering creates/improves training sets and ARM is used to data mine more complex data (e.g., relationship matrixes, etc.). CLASSIFICATION is [largely] case-based reasoning. To make a decision we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we made in those cases (we do what worked before for us or others). We let near neighbors vote. "The Magical Number Seven, Plus or Minus Two... Information"[2] cited to argue that the number of objects (contexts) an average human can hold in working memory is 7 ± 2. We can think of classification as providing a better 7 (so it's decision support, not decision making). One can say that all Classification methods (even model based ones) are a form of Near Neighbor Classification. E.g. in Decision Tree Induction (DTI) the classes at the bottom of a decision branch ARE the Near Neighbor set due to the fact that the sample arrived at that leaf. Rows of an entity table (e.g., Iris(SL,SW,PL,PW) or Image(R,G,B) describe instances of the entity (Irises or Image pixels). Columns are descriptive information on the row instances (e.g., Sepal Length, Sepal Width, Pedal Length, Pedal Width or Red, Green, Blue photon counts). If the table consists entirely of real numbers, then the row set can be viewed [as s subset of] a real vector space with dimension = # of columns. Then, the notion of "near" [in classification and clustering] can be defined using a dissimilarity (~distance) or a similarity. Two rows are near if the distance between them is low or their similarity is high. Near for columns can be defined using a correlation (e.g., Pearson's, Spearman's...) If the columns also describe instances of an entity then the table is really a matrix or relationship between instances of the row entity and the column entity. Each matrix cell measures some attribute of that relationship pair (The simplest: 1 if that row is related to that column, else 0. The most complex: an entire structure of data describing that pair (that row instance and that column instance). In Market Basket Research (MBR), the row entity is customers and the columnis items. Each cell: 1 iff that customer has that item in the basket. In Netflix Cinematch, the row entity is customers and column movies and each cell has the 5-star rating that customer gave to that movie. In Bioinformatics the row entity might be experiments and the column entity might be genes and each cell has the expression level of that gene in that experiment or the row and column entities might both be proteins and each cell has a 1-bit iff the two proteins interact in some way. In Facebook the rows might be people and the columns might also be people (and a cell has a one bit iff the row and column persons are friends) Even when the table appears to be a simple entity table with descriptive feature columns, it may be viewable as a relationship between 2 entities. E.g., Image(R,B,G) is a table of pixel instances with columns, R,G,B. The R-values count the photons in a "red" frequency range detected at that pixel over an interval of time. That red frequency range is determined more by the camera technology than by any scientific definition. If we had separate CCD cameras that could count photons in each of a million very thin adjacent frequency intervals, we could view the column values of that image as instances a frequency entity, Then the image would be a relationship matrix between the pixel and the frequency entities. So an entity table can often be usefully viewed as a relationship matrix. If so, it can also be rotated so that the former column entity is now viewed as the new row entity and the former row entity is now viewed as the new set of descriptive columns. The bottom line is that we can often do data mining on a table of data in many ways: as an entity table (classification and clustering), as a relationship matrix (ARM) or upon rotation that matrix, as another entity table. For a rotated entity table, the concepts of nearness that can be used also rotate (e.g., The cosine correlation of two columns morphs into the cosine of the angle between 2 vectors as a row similarity measure.)
21
DBs, DWs are merging as In-memory DBs:
SAP® In-Memory Computing Enabling Real-Time Computing SAP® In-Memory enables real-time computing by bringing together online transaction proc. OLTP (DB) and online analytical proc. OLAP (DW). Combining advances in hardware technology with SAP InMemory Computing empowers business – from shop floor to boardroom – by giving real-time bus. proc. instantaneous access to data-eliminating today’s info lag for your business. In-memory computing is already under way. The question isn’t if this revolution will impact businesses but when/ how. In-memory computing won’t be introduced because a co. can afford the technology. It will be because a business cannot afford to allow its competitors to adopt the it first. Total cost is 30% lower than traditional RDBMSs due to: • Leaner hardware, less system capacity req., as mixed workloads of analytics, operations, performance mgmt is in a single system, which also reduces redundant data storage. [[Back to a single DB rather than a DB for TP and a DW for boardroom dec. sup.]] • Less extract transform load (ETL) between systems and fewer prebuilt reports, reducing support required to run sofwr. Report runtime improvements of up to 1000 times. Compression rates of up to a 10 times. Performance improvements expected even higher in SAP apps natively developed for inmemory DBs. Initial results: a reduction of computing time from hours to seconds. However, in-memory computing will not eliminate the need for data warehousing. Real-time reporting will solve old challenges and create new opportunities, but new challenges will arise. SAP HANA 1.0 software supports realtime database access to data from the SAP apps that support OLTP. Formerly, operational reporting functionality was transferred from OLTP applications to a data warehouse. With in-memory computing technology, this functionality is integrated back into the transaction system. Product managers will still look at inventory and point-of-sale data, but in the future they will also receive,eg., tell customers broadcast dissatisfaction with a product over Twitter. Or they might be alerted to a negative product review released online that highlights some unpleasant product features requiring immediate action. From the other side, small businesses running real-time inventory reports will be able to announce to their Facebook and Twitter communities that a high demand product is available, how to order, and where to pick up. Bad movies have been able to enjoy a great opening weekend before crashing 2nd weekend when negative word-of-mouth feedback cools enthusiasm. That week-long grace period is about to disappear for silver screen flops. Consumer feedback won’t take a week, a day, or an hour. The very second showing of a movie could suffer from a noticeable falloff in attendance due to consumer criticism piped instantaneously through the new technologies. It will no longer be good enough to have weekend numbers ready for executives on Monday morning. Executives will run their own reports on revenue, Twitter their reviews, and by Monday morning have acted on their decisions. Adopting in-memory computing results in an uncluttered arch based on a few, tightly aligned core systems enabled by service-oriented architecture (SOA) to provide harmonized, valid metadata and master data across business processes. Some of the most salient shifts and trends in future enterprise architectures will be: • A shift to BI self-service apps like data exploration, instead of static report solutions. • Central metadata and masterdata repositories that define the data architecture, allowing data stewards to work across all business units and all platforms Real-time in-memory computing technology will cause a decline Structured Query Language (SQL) satellite databases. The purpose of those databases as flexible, ad hoc, more business-oriented, less IT-static tools might still be required, but their offline status will be a disadvantage and will delay data updates. Some might argue that satellite systems with in-memory computing technology will take over from satellite SQL DBs. SAP Business Explorer tools that use in-memory computing technology represent a paradigm shift. Instead of waiting for IT to work on a long queue of support tickets to create new reports, business users can explore large data sets and define reports on the fly. Here is sample of what in-memory computing can do for you: • Enable mixed workloads of analytics, operations, and performance management in a single software landscape. • Support smarter business decisions by providing increased visibility of very large volumes of business information • Enable users to react to business events more quickly through real-time analysis and reporting of operational data. • Deliver innovative real-time analysis and reporting. • Streamline IT landscape and reduce total cost of ownership. In manufacturing enterprises, in-memory computing tech will connect the shop floor to the boardroom, and the shop floor associate will have instant access to the same data as the board [[shop floor = daily transaction processing. Boardroom = executive data mining]]. The shop floor will then see the results of their actions reflected immediately in the relevant Key Performance Indicators (KPI). SAP BusinessObjects Event Insight software is key. In what used to be called exception reporting, the software deals with huge amounts of realtime data to determine immediate and appropriate action for a real-time situation. The final example is from the utilities industry: The most expensive energy a utilities provides is energy to meet unexpected demand during peak periods of consumption. If the company could analyze trends in power consumption based on real-time meter reads, it could offer – in real time – extra low rates for the week or month if they reduce their consumption during the following few hours. This advantage will become much more dramatic when we switch to electric cars; predictably, those cars are recharged the minute the owners return home from work. Hardware: blade servers and multicore CPUs and memory capacities measured in terabytes. Software: in-memory database with highly compressible row / column storage designed to maximize in-memory comp. tech. [[Both row and column storage! They convert to column-wise storage only for Long-Lived-High-Value data?]] Parallel processing takes place in the database layer rather than in the app layer - as it does in the client-server arch.
22
IRIS(SL,SW,PL,PW) DPPMinVec,MaxVEC
ID PL PW SL DPP s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set set ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver ver IRIS(SL,SW,PL,PW) DPPMinVec,MaxVEC vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir vir i i i i i i i i i i i i i i i i i i i e i i i i i i i i i i i
23
Dot Product Projection (DPP) Check F(y)=(y-p)o(q-p)/|q-p| for gaps or thin intervals. Check actual distances at sparse ends. To illustrate the DPP algorithm, we use IRIS to see how close it comes to separating into the 3 known classes (s=setosa, e=versicolor, i=virginica) We require a DPP-gap of at least 4. We also check any sparse ends of the DPP-range to find outliers (using a table of pairwise distances). We start with p=MinVector of the 4 column minimums and q=MaxVector of the 4 col. maxs. Then we replace some of those with the average. CLUS3 outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 5 10 4 11 5 12 2 13 7 14 3 15 2 No Thining. Sparse Lo end: Check [0,8] distances i30 i35 i20 e34 i34 e23 e19 e27 i i i e i e e e i30,i35,i20 outliers because F3 they are 4 from 5,6,7,8 {e34,i34} doubleton outlier set gap>=4 p=nnnn q=xxxx F Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 3 10 8 11 5 12 1 13 2 14 1 15 1 19 1 20 1 21 3 26 2 28 1 29 4 30 2 31 2 32 2 33 4 34 3 36 5 37 2 38 2 39 2 40 5 41 6 42 5 43 7 44 2 45 1 46 3 47 2 48 1 49 5 50 4 51 1 52 3 53 2 54 2 55 3 56 2 57 1 58 1 59 1 61 2 64 2 66 2 68 1 Sparse Lower end: Checking [0,4] distances s14 s42 s45 s23 s16 s43 s3 s s s s s s s s42 is revealed as an outlier because F(s42)= 1 is 4 from 5,6,... and it's 4 from others in [0,4] Thinning=[6,7 ] CLUS3.1 <6.5 44 ver 4 vir LUS3.2 >6.5 2 ver 39 vir No sparse ends CLUS3.1 p=anxa q=axna F Cnt 0 2 3 1 5 2 6 1 8 2 9 4 10 3 11 6 12 6 13 7 14 7 15 4 16 3 19 2 Sparse Upper end: Check [16,19] distances e7 e32 e33 e30 e15 e e e e e e15 outlier. So CLUS3.1 = 42 versicolor Gaps=[15,19] [21,26] Check dis in [12,28] to see if s16, i39,e49,e8,e11,e44 outliers s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31 s s s s s i e e e e e e e So s16,,i39,e49, e11 are outlier. {e8,e44} doubleton outlier. Separate at 17 and 23, giving CLUS1 F<17 ( CLUS1 =50 Setosa with s16,s42 declared as outliers). 17<F CLUS2 F<23 (e8,e11,e44,e49,i39 all are already declared outliers) 23<F CLUS ( 46 vers, 49 virg with i6,i10,i18,i19,i23,i32 declared as outliers) CLUS3.2 = 39 virg, 2 vers (unable to separate the 2 vers from the 39 virg) Sparse Upper end: Checking [57.68] distances i26 i31 i8 i10 i36 i6 i23 i19 i32 i18 i i i i i i i i i i i10,i36,i19,i32,i18 singleton outlies because F 4 from 56 and 4 from each other. {i6,i23} is a doubleton outlier set.
24
"Gap Hill Climbing": mathematical analysis
One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher STDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). We can also try to grow one particular gap or thinning using support pairs as follows: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap =p q= a b c d e f f e d c b a 9 8 7 6 a j k b c q d e f 2 1 d2 d2-gap p q d2 d2-gap
25
HILL CLIMBING GAP WIDTH
On CLUS2unionCLUS3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 CL123 p is avg=14 q is avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 No conclusive gaps Sparse Lo end: Check [0,9] i39 e49 e8 e44 e11 e32 e30 e15 e31 i e e e e e e e e i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 CLUS1< (50 Set) 7 <CLUS2< (4 Virg, 48 Vers) Here, the gap between CLUS1 and CLUS2 is made more pronounced???? (Why?) But the thinning between CLUS2 and CLUS3 seems even more obscure??? Although this doesn't prove anything, it is not good news for the method! It did not grow the gap we wanted to grow (between CLUSTER2 and CLUSTER3. There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. CLUS3> (46 Virg, 2 Vers) Sparse Hi end: Check [38,47] distances i31 i8 i36 i10 i6 i23 i32 i18 i19 i i i i i i i i i i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Next we attempt to hill-climb the gap at 16 using the half-space averages.
26
CAINE 2013 Call for Papers th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas: Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/Databases Bioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based Systems Computer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural Networks Computer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare Robotics Computer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASE Distributed Systems Visualization Embedded Systems Wireless Networks and Communication Important Dates: Workshop/special session proposal . . May 2.5, Full Paper Submis . .June 5, Notice Accept ..July.5 , 2013. Pre-registration & Camera-Ready Paper Due August 5, Event Dates . . .Sept 25-27, 2013 SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to: . Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems . Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt . Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management . Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep Web May 23, 2013 Paper Submission Deadline June 30, 2013 Notification of Acceptance July 20, 2013 Registration and Camera-Ready Manuscript Conference Website: ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, Special Sessions Proposal June 5, Full Paper Submission July 5, Author Notification Aug. 5, Advance Registration & Camera Ready Paper Due CBR International Workshop Case-Based Reasoning CBR-MD July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, 2013 Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013 Workshop on Data Mining in Marketing DMM' In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of ing Activities and Newsletter Mailing Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013 Workshop Data Mining in Ag DMA Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, Notification Date: April 30th, Camera-Ready Deadline: May 12th, Workshop date: July 19th, 2013
27
|d|=1 |dd|=1, so dd is a unit vector iff d is a unit vector). =
X X1...Xj ...Xn x1 x2 . xi xi,j xN x1od x2od xiod xNod d1 dn = X mm d = DPPd(X) If |d|=1 then |dd|=1 |dd| = SQRT( i=1..ndi2d12 + i=1..ndi2d i=1..ndi2dn2 ) |dd| = SQRT( j=1..n(i=1..ndi2)dj2 ) the if |dd| = SQRT( j=1..n dj2 ) |dd| = SQRT( j=1..n dj2 ) = 1 the if VX o dd =VarDPPdX V X1... Xj ... Xn X1 : Xi XiXj-XiX,j XN V d1... dj ... dn d1 di didj dN V = If |dd|=1 then |d|=1 1=|dd| = SQRT( i=1..ndi2d12 + i=1..ndi2d i=1..ndi2dn2 ) 1=|dd| = SQRT( (i=1..ndi2) (j=1..ndj2) ) 1=|dd| = SQRT( (i=1..ndi2)2 ) 1=|dd| = SQRT( (i=1..ndi2) 2 )
28
FAUST Functional-Gap clustering (FAUST=Functional Analytic Unsupervised and Supervised machine Teaching) relies on choosing a distance dominating functional (map to R1 s.t. |F(x)-F(y)|Dis(x,y) x,y; so that any F-gap implies a linear cluster break. Dot Product Projection: DPPd(y)≡ (y-p)od where the unit vector, d, can be obtained as d=(p-q)/|p-q| for points, p and q. Square Distance Functional: SDp(y) ≡ (y-p)o(y-p) Coordinate Projection is a the simplest DPP: ej(y) ≡ yj Dot Product Radius: DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2 Square Dot Product Radius: SDPRpq(y)≡ SDp(y)-DPPpq(y)2 Note: The same DPPd gaps are revealed by DPd(y)≡ yod since ((y-p)od=yod-pod and thus DP just shifts all DPP values by pod. X X1...Xj ...Xn x1 x2 . xi xi,j xN x1od x2od xiod xNod d1 dn = X mm d = DPPd(X) Finding a good unit vector, d, for Dot Product functional, DPP. to maximize gaps subjected to i=1..n di2= 1 Method-1: Maximize VarianceDPPd(X) wrt d. Let upper bar mean column average. VarDPPd(X) = (Xod) ( Xod )2 - ( j=1..nXj dj )2 = (1/N)i=1..N( (j=1..n xi,jdj)2 ) = (1/N)i=1..N(j=1..n xi,jdj) (k=1..n xi,kdk) - (j=1..nXj dj) (k=1..nXk dk) = (1/N)i=1..N( j=1..n xi,j2dj2 + 2j<k xi,jxi,kdjdk ) - (j=1..n Xj2dj2 + 2j<k XjXkdjdk ) VX o dd =VarDPPdX V X1... Xj ... Xn X1 : Xi XiXj-XiX,j XN V d1... dj ... dn d1 di didj dN V = +(2j=1..n<k XjXkdjdk ) = (1/N) j=1..nXj2 dj2 - (j=1..n Xj2dj2 - 2j<k XjXkdjdk ) = (1/N) j=1..nXj2 dj2 - (j=1..n Xj2dj +(2j=1..n<k XjXkdjdk ) - 2j<k XjXkdjdk ) = (1/N) j=1..n( Xj2 - Xj2 ) dj +(2j=1..n<k=1..n (XjXk - XjXk ) djdk ) X12 - X12 Algorithm-1 (a heuristic): Compute the vector ( , ... , ). The unit vector (a1...an)≡A maximizing YoA is A=Y/|Y|. So let D≡( √ ,...,√ ) and d≡D/|D| Remove outliers first? Xn2 - Xn2 Algoritm-2 (a heuristic): Find k s.t is max. Set dk=1, dh=0 hk. We've already done this using ek with max stdev) Xk2 - Xk2 Algorithm-3 (an optimum): Find d producing maximum VarianceDPPd(X)? View the nn matricies, VX, dd as n2-vectors. Then V=VXodd as n2vectors and the dd that gives the maximum V is VX/|VX|. So we want d such that dd forms the minimum angle (angle=0) with VX. Minimize F(d)=VXodd/|VM|. Since |VM| constant, minimize F(d)=VXodd.
29
X12 - X12 Algorithm-1 (a heuristic): Compute the vector ( , ... , ). The unit vector (a1...an)≡A maximizing YoA is A=Y/|Y|. So let D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod Xn2 - Xn2 F Ct 0 1 2 3 3 2 4 4 5 4 6 6 7 8 8 3 9 10 10 5 12 2 13 2 18 1 21 1 22 1 23 1 24 1 28 2 29 1 30 1 31 3 32 3 33 2 34 3 35 5 37 3 38 3 39 3 40 1 41 5 42 4 43 5 44 5 45 7 47 3 48 1 49 3 51 6 52 3 53 3 54 1 55 4 56 2 57 1 58 3 59 1 60 1 62 2 65 1 66 2 68 1 69 1 F Ct on CLUS.2.1 0 1 1 2 2 1 3 4. 4 1 5 4 6 3 7 2 8 1 9 2 10 4 11 1 12 4 13 2 14 5 15 9 16 4 17 3 18 3 10F Ct on CLUS.2.1.2 0 1 4 1 15 1 24 1 27 3 34 1 35 1 36 1 43 1 44 1 46 1 53 1 54 1 55 2 56 1 57 1 58 1 69 1 76 2 88 1 __________ [0.20) 2 virg; 1 vers (all outliers) __________(20,30) 2 virg; 2 vers (all outliers?) F Ct on CLUS.2 0 2 1 1 2 1 3 3 4 2 5 4 6 3 7 2 8 1 9 3 10 2 11 2 12 3 13 4 14 3 15 7 16 5 17 6 18 2 19 1 20 3 21 2 22 2 23 4 24 4 25 1 26 3 27 2 28 1 29 3 30 3 31 1 32 1 35 1 __________(30,40) 0 virg; 3 vers (all outliers?) _________CLUS.1 <15 (50 Setosa) CLUS.2 >15 (50 Versacolor, 50 Verginica __________(40,50) 1 virg; 2 vers (all outliers?) CLUS <14 30 Vers, 2 Virg CLUS 14 12 Vers, 12Virg __________(50,60) 4 virg; 3 vers In the neighborhood of F=15: i39 e49 e8 e44 e11 i e e e e i39,e49,e11 outliers (gap=(13,28) or 15) {e8,e44}doubleton outlier __________(60,88] 1 virg; 3 vers (all outliers) _________CLUS.2.1 <19 (42 Vers, 14 Virg) CLUS.2.2 19 ( 2 Vers, 29Virg) Algorithm-2: Take the ai corresponding to max STD(Yi). STD(PL)=17 over twice others so F(x)=x3 F Ct 10 1 11 1 12 2 13 7 14 12 15 14 16 7 17 4 18 1 19 2 30 1 33 2 35 2 36 1 37 1 38 1 39 3 40 5 41 3 42 4 43 2 44 4 45 8 46 3 47 5 48 3 49 5 50 4 51 8 52 2 53 2 54 2 55 3 56 6 57 3 58 3 59 2 60 2 61 3 63 1 64 1 66 1 67 2 69 1 __________CLUS.1 < virg; 50 seto Sparse Hi end i3 i26 i44 i31 i8 i36 i i i i i i i35 outlier. Sparse Hi end of CLUS.2 i31 i8 i36 i10 i6 i23 i32 i19 i18 i i i i i i i i i i10,i32,i19,i18 outliers {i6,i23}doubleton outlier __________25< CLUS.2 < virg; 46 vers 49< CLUS.3 < virg; 4 vers But, would one pick out 49 as a gap/ thinning?
30
D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod
Algorithm-1 (a heuristic): Compute the vector ( , ... , ). Redo CLUS2, spreading F-values out as 2(F-min) D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod Xn2 - Xn2 F Ct 0 1 2 3 3 2 4 4 5 4 6 6 7 8 8 3 9 10 10 5 12 2 13 2 18 1 21 1 22 1 23 1 24 1 28 2 29 1 30 1 31 3 32 3 33 2 34 3 35 5 37 3 38 3 39 3 40 1 41 5 42 4 43 5 44 5 45 7 47 3 48 1 49 3 51 6 52 3 53 3 54 1 55 4 56 2 57 1 58 3 59 1 60 1 62 2 65 1 66 2 68 1 69 1 CLUS 2 2(F-mn) 0 2 2 1 4 1 5 2 6 1 7 1 8 1 9 1 10 3 12 4 14 2 15 1 17 2 18 1 19 1 20 1 21 2 23 2 24 1 25 1 26 2 27 2 28 2 29 5 30 1 31 4 32 2 33 2 34 4 36 1 37 1 39 2 40 1 42 2 43 1 44 1 45 1 46 2 47 3 48 2 51 3 52 1 54 2 55 1 57 3 59 2 60 1 62 1 64 1 70 1 _________[0,1) 2 Vers, 0 Virg all outliers _________[1,2] 1 Vers, 0 Virg outlier _________CLUS.1 <15 (50 Setosa) CLUS.2 >15 (50 Versacolor, 50 Verginica ________(2,11) 9 Vers, 1 Virg ________[11,12] 4 Vers, 0 Virg quad outlier set? _________[14,16) 3 Vers, 0 Virg all outliers? In the neighborhood of F=15: i39 e49 e8 e44 e11 i e e e e i39,e49,e11 outliers (gap=(13,28) or 15) {e8,e44}doubleton outlier _________(16,22) 7 Vers, 0 Virg || [0,22) 26 Vers, 1 Virg _________(22,31) 13 Vers, 3 Virg || [0,31) 39 Vers, 4 Virg _________[31,35) 4 Vers, 8 Virg _________(35,38) 1 Vers, 1 Virg outliers _________(38,41) 2 Vers, 1 Virg outliers? _________(41,50) 0 Vers, 12 Virg _________(50,53) 0 Vers, 4 Virg _________(53,56) 0 Vers, 3 Virg outliers? _________(56,58) 0 Vers, 3 Virg outliers? _________(58,61) 0 Vers, 3 Virg outliers? Sparse Hi end of CLUS.2 i31 i8 i36 i10 i6 i23 i32 i19 i18 i i i i i i i i i i10,i32,i19,i18 outliers {i6,i23}doubleton outlier _________(61,71) 0 Vers, 3 Virg outlier
31
There are no significant gaps.
X12 - X12 Algorithm-1: Compute ( , ... , ). Applied to Sat150 (Satlog dataset with 150 pixs) D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod Xn2 - Xn2 F Ct 0 1 1 1 2 1 8 1 20 1 25 1 26 1 28 1 30 1 32 2 34 1 36 1 37 2 38 1 39 2 40 2 41 1 42 4 43 4 44 2 45 3 47 1 48 3 49 3 50 2 51 4 52 2 53 4 54 3 55 1 56 4 57 2 58 3 59 1 60 1 62 1 63 1 64 1 65 1 66 2 67 5 69 5 70 1 71 2 72 1 73 3 74 2 75 3 76 1 77 2 78 3 79 2 80 4 81 4 82 2 84 3 85 3 86 2 87 1 88 1 89 1 90 2 92 7 93 1 95 2 96 2 97 2 98 3 _________[0,5) c=5 _________[5,14) c=5 _________[14,23) c=7 This Satlog dataset is 150 rows (pixels) and 4 feature columns (R, G, IR1, IR2) There are 6 row-classes with row counts as follows: Count Class# Class Description 19 c= red soil 32 c=2 cotton crop 50 c=3 grey soil 12 c=4 damp grey soil 10 c=5 soil with vegetation stubble 27 c=7 very damp grey soil _________[23,27) c= c=7 _________[27,31) c=7 _________[31,35) c= c=7 _________[35,46) c= c=4 1c=5 10c=7 There are no significant gaps. There is some localization of classes with respect to F, but in a strictly unsupervised setting, that would be impossible to detect. This is somewhat expected since the changes in ground cover class are gradual and smooth (in general) so that classes butt up against one-another (no gaps between them) . _________[46,61) c=1 19c= c=4 5c=5 7c=7 _________[61.68) c=1 1c= c= c=7 ________ [68,83) c=1 1c=2 17c=3 3c= c=7 _________[83,91) c=3 _________[91,94) c= c=3 _________[94,100) c= c=3 _________[100,106) 1c= c=3 _________[106,115) c=3 _________[115,121) c=3
32
D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod
Algorithm-1: on Concrete149(Strength=ClassLabel,Mix,Water,FineAgregate,Age) D≡( √ ,...,√ ) and d≡D/|D|. F(x)=xod F-MN/4 0 1 7 1 11 1 12 2 14 1 15 2 16 1 17 1 18 3 20 2 21 3 22 3 23 3 24 5 25 3 26 3 27 1 29 1 30 1 32 2 33 4 35 1 36 1 37 4 38 4 39 2 40 2 41 4 42 1 43 10 45 1 46 3 47 6 48 3 49 3 50 2 51 5 52 3 53 1 54 2 56 1 57 2 58 3 59 2 60 1 61 1 62 1 63 2 65 1 66 2 67 1 68 3 69 4 71 3 72 3 73 2 74 1 75 3 76 2 79 1 80 4 82 1 85 1 86 1 87 1 _________[0,4) c=2 _________[4,9) c=4 _________[9,13) 1c= c= c=4 Concrete149 dataset has 149 rows; 1 class column and 4 feature columns (ST,MX,WA,FA,AG)) There are 4 Strength classes with row counts as follows: Count Class# Class Description (Concrete Strength of...) 19 c= [0,10) 32 c=2 [20,30) 50 c=4 [40,50) 12 c=6 [60,100) I deleted Strength= 10's, 30's, 50's so as to introduce gaps to identify. I really didn't find any!! _________[13,19) 1c= c= c= c=6 _________[19,28) 1c= c= c= c=6 _________[28,31) 1c= c= c= c=6 _________[31,34) 1c= c= c=4 _________[34,44) 1c= c= c= c=6 _________[44,55) 1c= c= c= c=6 _________[55,64) 1c= c= c= c=6 _________[64,70) 1c= c= c= c=6 _________[70,78) 1c= c= c= c=6 _________[78,81) 1c= c=2 1c= c=6
33
STD(MX)=101, STD(WA)=28, STD(FA)=99, STD(AG(=81) so we pick MX.
F=MX 0 4 6 3 7 2 12 10 13 2 14 3 18 9 20 4 22 5 23 3 24 5 27 3 31 3 36 4 41 2 43 4 44 3 46 3 48 2 49 2 55 13 58 8 60 6 62 5 65 4 71 16 72 4 74 3 82 4 83 7 97 2 100 1 Algorithm-2: on Concrete149(Strength=ClassLabel,Mix,Water,FineAgregate,Age) STD(MX)=101, STD(WA)=28, STD(FA)=99, STD(AG(=81) so we pick MX. _________[0,5) c= c=2 2c=4 _________[5,10) c= c=2 _________[10,16) c= c= c=4 d=StdVec 0 1 7 1 11 1 12 2 14 1 15 2 16 1 17 1 18 3 20 2 21 3 22 3 23 3 24 5 25 3 26 3 27 1 29 1 30 1 32 2 33 4 35 1 36 1 37 4 38 4 39 2 40 2 41 4 42 1 43 10 45 1 46 3 47 6 48 3 49 3 50 2 51 5 52 3 53 1 54 2 56 1 57 2 58 3 59 2 60 1 61 1 62 1 63 2 65 1 66 2 67 1 68 3 69 4 71 3 72 3 73 2 74 1 75 3 76 2 79 1 80 4 82 1 85 1 86 1 87 1 d = STDVector = (101,28,99,81) divided by length. _________[16,19) c= c= c=4 _________[19,21) c= c= c=4 _________[0,14) c=0 2c=2 2c=4 _________[21,25) c= c=2 3c=4 _________[25,29) c= c=2 1c=4 _________[29,31) c= c=2 3c=4 _________[31,39) c= c=2 3c=4 _________[14,19) 1c=0 4c=2 2c=4 1c=6 _________[39,45) c= c=2 3c=4 4c=6 _________[45,52) c= c=2 3c=4 3c=6 _________[5256,) c= c=2 3c=4 12c=6 _________[19,28) 1c=0 13c=2 6c=4 3c=6 _________[56,64) c= c=2 8c=4 8c=6 _________[28,31) 1c= 0 1c=2 1c=4 1c=6 _________[64,66) c= c=2 1c=4 3c=6 _________[31,34) 1c= c=2 6c= c=6 _________[66,78) c= c=2 8c=4 15c=6 _________[78,90) c= c=2 5c=4 6c=6 _________[90,101) c= c=2 2c=4 1c=6 _________[34,44) 1c=0 12c=2 7c= c=6 F=FA 0 15 3 2 5 4 17 3 19 8 21 1 29 3 41 28 46 3 47 8 48 3 52 4 53 15 58 3 62 4 63 4 65 7 67 3 69 4 72 3 73 12 75 2 78 5 83 1 100 4 _________[0,2) c= c=2 14c=4 1c=6 _________[2,10) c=0 2c=2 2c= c=6 _________[10,25) c=0 1c=2 8c= c=6 _________[25,30) c=0 1c=2 14c=4 3c=6 _________[44,55) 1c=0 1c=2 10c= c=6 _________[30,43) c=0 6c=2 3c=4 19c=6 _________[43,50) c=0 6c=2 4c=4 3c=6 _________[50,55) c=0 6c=2 5c=4 6c=6 _________[55,60) c=0 1c=2 2c=4 3c=6 _________[55,64) c=0 4c=2 6c=4 7c=6 _________[60,70) c=0 10c=2 6c=4 6c=6 _________[64,70) 1c=0 1c=2 6c=4 4c=6 _________[70,80) c=0 4c=2 7c=4 11c=6 _________[70,78) 1c=0 1c=2 3c= c=6 _________[78,81) 1c=0 1c=2 1c= c=6 _________[81,88) 1c=0 1c=2 2c= c=6
34
56 1 59 5 61 2 62 2 63 6 64 8 65 13 66 19 67 17 68 11 69 7 70 15 71 12 72 5 73 7 74 10 75 6 76 11 77 8 78 10 79 9 80 7 81 1 82 3 83 3 84 2 85 3 86 2 87 1 88 2 89 1 90 1 _________[56,58) c= c= c=3 Alg-1: on SEEDS210(CLS123, area, compact, asym_coef, len_kernel_groove) The classes are Class1 = Kama Class2 = Rosa Class3 = Canadian _________[58,60) c= c= c=3 _________[60,72) c= c= c=3 _________[72,76) c= c= c=3 _________[76,81) c= c= c=3 _________[81,91) c= c= c=3
35
85 1 87 2 88 3 89 3 90 1 91 2 92 7 93 11 94 17 95 12 96 13 97 15 98 10 99 10 Alg-1: on SEEDS210 (CLS123, area, perimeter, compact, kern_length. kern_width, asym_coef, len_kernel_groove) _________[85,92) c= c= c=3 _________[92,106) c= c= c=3 _________[106,112) 4c= c= c=3 _________[112,126) 0c= c= c=3
36
6 8 7 2 8 7 9 1 10 6 11 1 12 5 13 2 15 5 16 6 18 4 19 6 20 2 21 3 22 5 24 4 25 3 26 2 27 3 28 4 29 4 31 3 32 2 33 8 34 3 35 2 36 2 37 1 38 1 40 2 41 1 42 1 44 5 45 1 46 2 47 1 48 3 49 1 50 2 52 3 53 2 54 3 56 1 57 2 58 1 60 1 61 2 65 1 67 1 68 2 69 1 72 1 73 1 77 1 78 1 79 1 Alg-1: on WINE_Quality150 (150 wine samples, 4 feature columns, 0-10 quality levels (only 4-7 occur)) _________[6.15) c= c= c= c=7 _________[15,17) 0c= c= c= c=7 _________[17,23) c= c= c= c=7 _________[23,30) c= c= c= c=7 _________[30,39) c= c= c= c=7 _________[39,51) c= c= c= c=7 _________[51,59) c= c= c= c=7 _________[59,63) c= c= c= c=7 _________[63,71) c= c= c= c=7 _________[71,79) c= c= c= c=7
37
4 1 5 2 6 3 7 4 8 9 9 6 10 7 11 8 12 5 13 2 14 3 15 11 16 4 17 9 18 2 19 6 20 5 21 4 22 3 23 6 24 3 25 5 26 3 27 3 28 3 29 3 30 1 31 1 32 5 33 2 34 5 35 2 36 1 37 1 38 1 39 1 42 1 43 2 44 1 46 1 57 1 58 1 60 1 88 1 Alg-1: on WN (149 wine samples, 4 features ( highest std/(max-min) ), 4,6,8 quality levels only) _________[0.31) c= c= c=8 _________[31,41) c= c=6 _________[41,51) c= c= c=8 _________[51,91) c= c=6
38
4 1 10 2 11 4 12 8 13 9 14 13 15 6 16 9 17 8 18 9 25 1 26 4 27 2 28 3 29 6 30 1 31 6 32 2 33 2 34 4 35 1 36 1 38 2 39 1 40 1 41 1 44 1 45 1 46 1 47 1 48 2 49 3 50 1 51 2 53 1 54 2 55 1 56 4 57 1 58 2 62 1 63 1 64 3 75 1 76 2 77 1 85 1 86 1 87 2 89 1 90 1 92 1 94 1 Alg-1: on WN (149 wine samples, 4 features ( highest std/(max-min) ), L=34, H= quality levels only) _________[0.21) c=L c=H _________[21,37) c=L c=H DPPMinVec-MaxVec: on WN (149 wine samples, 4 features ( highest std/(max-min) ), L=34, H= quality levels only) _________[37,42) c=L 3c=H 0 1 1 15 2 18 3 13 4 12 5 9 6 8 7 11 8 11 9 3 10 4 11 1 12 2 13 5 14 5 15 4 16 7 18 4 19 1 22 3 23 1 25 1 26 3 27 2 28 2 33 2 34 1 38 1 _________[42,60) c=L c=H _________[60,72) c=L c=H _________[72,80) c=L c=H _________[0. 11) c=L c=H _________[80,100) c=L c=H _________[11,20) c=L c=H _________[100,122) c=L c=H _________[20,30) c=L c=H
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.