Download presentation
Presentation is loading. Please wait.
Published byEthelbert Ray Modified over 9 years ago
1
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 1 pf 7 8 Machine Teaching big data to reveal its' information FAUST: Fast, Accurate Unsupervised and Supervised Machine Teaching FAUST UMF (Unsupervised, using the Medoid-to-Furthest line) Let C=X be the initial incomplete cluster 1.While an incomplete cluster, C, remains, pick M C, compute F (furthest from M). 2.If ( Density(C) ≡ countC/distance n (F,M) > DT (DensityThreshold)) declared C to be complete and continue; else split C at each PTreeSet(x o FM) gap > GWT (GWT≡GapWidthThreshold) M C 2 ={p5} complete (singleton = outlier). C 3 ={p6,pf} splits (for doubletons, if dist > GT, split) so {p6}, {pf} complete (outliers), C 1 ={p1,p2,p3,p4}, C 4 ={p7,p8,p9,pa,pb,pc,pd,pe} incomplete C 1 is dense ( dens(C 1 )=~4/2 2 =.5>DT=.3), thus C 1 is complete. Applying the algorithm to C 4 : M4 M4 M0 M0 {pa} outlier. C 2 splits {p9}, {pb,pc,pd} complete. M 0 8.3 4.2 M 1 6.3 3.5 M1 M1 f 1 =p3, no C 1 split (complete). C 1 C 2 C 3 C 4 f 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Example-2: Interlocking horseshoes with an outlier X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5
2
X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6 0 1 p5 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p3 1 0 1 0 1 0 p2 0 1 0 1 0 1 0 1 0 1 0 1 0 p1 1 0 1 0 1 0 1 p0 1 0 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p2' 1 0 1 0 1 0 1 0 1 0 1 0 1 p1' 0 1 0 1 0 1 0 p0' 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p4 0 1 0 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 F=p1 and x o FM, T=2 3. Illustration of the first round of finding gaps width = 2 4 =16 gap: [100 0000, 100 1111]= [64,80) width=2 3 =8 gap: [010 1000, 010 1111] =[40,48) width=2 3 =8 gap: [011 1000, 011 1111] =[56,64) width= 2 4 =16 gap: [101 1000, 110 0111]=[88,104) width=2 3 =8 gap: [000 0000, 000 0111]=[0,8) OR between gap 1 & 2 for cluster C 1 ={p1,p3,p2,p4} OR between gap 2 and 3 for cluster C 2 ={p5} between 3,4 cluster C 3 ={p6,pf} Or for cluster C 4 ={p7,p8,p9,pa,pb,pc,pd,pe} f= pTree gap finder using PTreeSet(x o fM) For FAUST SMM (Oblique), do a similar thing on the MrMv line. Record the number of r and v errors if RtEndPt is used to split. Take RtEndPt where sum min Parallelizes easily. Useful in pTree sorting?
3
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c 0 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 M 0 8.6 4.7 M 1 3 1.8 M 2 11 6.2 dis to M 0 8.4 6.7 7.1 5.8 3.7 1.7 7.4 6.0 6.6 4.4 5.7 6.2 6.7 3.6 dis to F(=p1) 0 2 1.41 2.82 5.09 8.24 14 13.0 14.1 12.3 12.0 13.4 12.8 14.1 9.21 dis to G(=pe) 14.1 12.8 12.7 11.3 10.2 8.24 10.7 9.48 8.94 7.28 2.23 1 2 0 5 M0 M0 M1 M1 M2 M2 C 1 = pts closer to f=p1 than g=pe dens(X)= 16/8.4 2 =.23<DT=1.5 incomplete dis to M 2.1 0.8 1.0 1.2 3.0 dens(C 1 )= 5/3 2 =.55<DT, C 1 incomplete dis to F 5.09 3.16 4 3.16 0 dis to G 0 2 1.41 2.82 5.09 C 11 C 12 ↓ C 12 ={p5} complete. C 11 4/1.4 2 = 2>DT, dens complete. 4 6.3 4.9 4.8 2.7 3.1 3.8 5.3 4.8 4.7 6.32 0 1.41 2 3.60 9.43 9.84 11.6 10.7 10.6 8 11.6 10.2 10 8.06 2.23 0 2 3.60 C 21 : closer p7 than pd C 22 ↓ densC 2 ≡ct/d(M 2,F 2 ) 2 =10/6.3 2 =.25<DT, incomplete. M 21 (13.2 2.6) 4.2 2.4 1 1.8 1.4 densC 21 =5/4.2 2 =.28<DT incomplete. 0 6.3 5.0 6 4.1 6.3 0 1.4 2 3.6 1.6 0.5 0.9 1.9 3.6 2.2 0 0 1.4 2 3.6 densC 2121 =3/1 2 =3>DT complete. C 2122 ={pa} complete 0.8 1.4 1.3 1.8 3.1 3.1 4.4 3.6 5 0 2.2 1 2 0 5 C 221 density= 4/1.4 2 = 2.04>DT, complete C 222 = {pf} singleton so complete ( outlier). M 22 ( 9.6 9.8) M 221 ( 10.2 10.2 ) M 212 (14.2 2.5) FAUST UFF (Unsupervised, using the Furthest-to-Furthest line) Let C=X be the initial incomplete cluster 1.While an incomplete cluster, C, remains, pick M C, compute F (furthest from M) and compute G (furthest from F) 2.If ( Density(C) ≡ countC/distance n (F,M) > DT (DensityThreshold)) declared C to be complete and continue; else split C into {C i } at each PTreeSet(x o FG) gap > GWT (GapWidthThreshold)
4
r r vv r m R r v v v r r v m V v r v v r v FAUST SMM Supervised Medoid-to-Medoid version (AKA, FAUST Oblique) P R =P (X o d R ) < a R 1 pass gives classR pTree D≡ m R m V d=D/|D| midpoint of means Separate class R using midpoint of means method: Calc a (m R +(m V -m R )/2) o d = a = (m R +m V )/2 o d (works also if D=m V m R, d Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the vom 1. vectors_of_median, vom, to represent each class, not the mean m V, where vom V ≡(median{v 1 |v V}, midpt_std, vom_std methods 2. midpt_std, vom_std methods : project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between m r and m v median{v 2 |v V},...) vom V v1v1 v2v2 vom R std of distances, v o d, from origin along the d-line dim 2 dim 1 d-line Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees) Next, use "Gap Finder" to find all gaps with different endpts (rv or vr): Record the number of r and v errors if GapMidPt is used to split. Select as split pt, the GPM where errors are minimized.
5
FAUST SMM P R =P (X o d R ) < a R D≡ m R m B Use "Gap Finder" to find all gaps with different EndPoints: Record the number of R and B errors if GapEndPoint is used to split. Select as split pt, the GEP where the sum of the R errors and the B errors are minimized. 1 p2 p7 2 p3 p5 p8 3 p6 pa p9 4 p4 5 pf 6 p1 pb pe 7 pc 8 pd 9 a b c 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 3 6 p2 6 1 p3 4 2 p4 3 4 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 12 3 pa 10 3 pb 10 6 pc 9 7 pd 9 8 pe 12 6 pf 12 5 M M That is, on the M R -M B line, record the sum of the R and B errors if RtEndPt is used to split.
6
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f APPENDIX: FAUST UMF (no density) Initially C=X. 1.While an incomplete C, remains (1 pTree calculation) find M (=mean(C)). 2.Create PTreeSet(D(x,M). (1 pTree calculation) Pick F to be a furthest point from M. 3.Create PTreeSet(x o FM) (1 pTree calculation) 4.Split at each PTS(x o FM)-gap > T (1 pTree calculation). If there are none, continue (declaring C complete). M C 2 ={p5} complete (singleton = outlier). C 3 ={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C 1 ={p1,p2,p3,p4} and C 4 ={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C 1 doesn't split and is complete. Applying the algorithm to C 4 : M4 M4 M0 M0 {pa} outlier. C 2 splits into {p9}, {pb,pc,pd} which doesn't split so complete. M 0 8.3 4.2 M 1 6.3 3.5 M1 M1 f 1 =p3, C 1 doesn't split so complete. C 1 C 2 C 3 C 4 X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 f This algorithm takes 4 pTree calculations only. If we use "any point" rather than M=mean, that eliminates create mean (next slide, M=bottom point rather than the Mean.) 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Example-2: Interlocking horseshoes with an outlier X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5
7
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST UMF (no density, M=bottom point) Initially C=X. 1.While an incomplete cluster, C, remains find M (no pTree calculations). 2.Create S ≡ PTreeSet(D(x,M). Pick f to be a furthest point from M (1 pTree calculation) 3.Split at each PTreeSet(c o FM)-gap > T. If none, continue (C complete) (1 pTree calculation). M X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 f M M No gaps so complete M M M M M This FAUST CLUSTER is minimal, with just 3 pTree calculations. Pick a point (e.g., bottom point- 0 pTree calculations). 1. Find the furthest point (e.g., using ScalarPTreeSet(distance(x,M), 1 pTree calculation) 2. Find gaps (e.g., using ScalarPTreeSet(x o fM), ( 1 pTree calculation). split when the gap>GT ( 1 pTree calculation). Continue when there are no gaps (declare C complete) ( 0 pTree calculations). However, we may want a density-based stop condition (or a combination). Even if we don't create the mean, we can get a "radius" (for n-dim volume = r n ) from the length of the fM. So with a density SC it's 3 pTree calcs + a 1-count. Note: M=bottom pt is likely better then M=mean, because the point M will then be on one side of the Mean and closer to an edge. Therefore the FM line might be more of a diameter than ith FMean. Note on Stop Conditions: "dense" "no gaps, but not vice versa. 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Example-2: Interlocking horseshoes with an outlier X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5
8
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f FAUST UMF (M=top, F=bottom, FM_affinity_splitting) Initially C=X. 1.While an incomplete cluster, C, remains find F=top, M=bottom pts. ( 0 pTree calculations). 2.Split C into C 1 =PTree(x o FM<FM o FM/2) and C 2 =C-C 1 (uses mdpt(F,M) as in Oblique FAUST ( 1 pTree calculation) 3.If C i dense (ct(C i )/dis n FM > DT) declare C i complete. ( 1 pTree 1-count). X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf X p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5 p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf p1 p2 p3 p4 p5 p6 p7 p8 p9 pa pb pc pd pe pf Note: pb is found to be an outlier but isn't. Otherwise the version works. An absolute minimal version with 1pTree calculation (=PT(xoFM Threshold ("large gap" splitting, "no gap" stopping, If density stopping is used, then add a 1-count.). My "best UMF version" choice: Top-to-Furthest_splitting with the pTree gap finder, density stopping. (3 pTree calculations, 1 one-count). Top Research Need: Better pTree "gap finder". It is useful in both FAUST UMF clustering and in FAUST SMM classification. 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Example-2: Interlocking horseshoes with an outlier X x1 x2 p1 8 2 p2 5 2 p3 2 4 p4 3 3 p5 6 2 p6 9 3 p7 9 4 p8 6 4 p9 13 3 pa 15 7 pb 12 5 pc 11 6 pd 10 7 pe 8 6 pf 7 5
9
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f centriod=mean; h=1; DT= 1.5 gives 4 outliers and 3 non-outlier clusters If DensityThreshold=DT=1.1 then{pa} joins {p7,p8,p9}. If DT=0.5 then also {pf} joins {pb,pc.pd,pe} and {p5} joins {p1,p2,p3,p4}. We call the overall method FAUST CLUSTER because it resembles FAUST CLASSIFY algorithmically and k (# of clusters) is dynamically determined. FAUST UMF density parameter settings effects Improvements? Better stop condition? Is UMF better than UFF? In affinity splitting, what if k over shoots its' optimal value? Add a fusion step each round? As Mark points out, having k too large can be problematic?. The proper definition of outlier or anomaly is a huge question. An outlier or anomaly should be a cluster that is both small and remote. How small? How remote? What combination? Should the definition be global or local? We need to research this (give users options and advice for their use). Md: create F=furthest pt from M, d(F,M) while creating PTreeSet(d(x,M)? Or as a separate procedure, start with P=D h (h=High Bit Pos.) then recursively P k <-- P & D h-k until P k+1 =0. Then back up to P k and take any of those points as f and that bit pattern is d(f,M). Note that this doesn't necessarily give the furthest pt from M but gives a pt sufficiently far from M. Or use HOBbit dis? Modify to get absolute furthest pt by jumping (when AND gives zero) to P k+2 and continuing AND from there. (D h gives a decent f (at furthest HOBbit dis).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.