Presentation is loading. Please wait.

Presentation is loading. Please wait.

Make a deal for FAUST_p today!!!

Similar presentations


Presentation on theme: "Make a deal for FAUST_p today!!!"— Presentation transcript:

1 Make a deal for FAUST_p today!!!
FAUST (Fast Attribute-based Unsupervised and Supervised Table) and FAUST_p (FAUST using pTrees) Clustering EIN formula: PAj>c=Pj,m om...ok+1Pj,k oi=& iff bi=1, c = bm bk b0 k=rtmost 0, ops rt binding Faust made a deal with the devil (traded his soul) for unlimited knowledge. Make a deal for FAUST_p today!!! FAUST_pdq: Remaining_Classes RC initially all classes (pTree, PRC, initially pure1) masks points in the classes of RC. For each attribute, A, create Attribute_Table, TA(class, rv(attribute,class), gap ) ordered on rv ascending, where rv is a class representative value in that attribute (e.g., we will use class mean here) and gap = gap to the rv next higher mean for that attr, WHILE RC not empty, DO Find the TA record with maximum gap: Use PA>c (c=mean+gap/2 to divide RC at c=cutpoint into LT and GT and create pTree masks, PLT and PGT for them. If LT is singleton {remove that class from RC and from all TA's} If GT is singleton {remove that class from RC and from all TA's} END_DO The resulting singleton pTree masks ARE THE CLUSTERS! We note that FAUST_pdq can use only one attribute_cut_point at a time, otherwise the division will not [necessarily] result in the same two new clusters for the two (or more) different attribute_cut_points. Next we do a walk through with a simple example of 30 IRIS records (10 from each class)

2 FAUST_pdq epoch1, step1 Remaining_Classes, RC, is initiated to all classes (with pTree, PRC, masking the points in the classes of RC (initially pure1).  attribute create Attribute_Table, TA, ordered on mean asc, with class representative value = class mean and gap = gap to next the mean: WHILE RC not empty, DO se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl mn gp se 49 10 ve 59 7 vi 66 TsWD cl mn gp ve 28 1 vi 29 4 se 33 TpLN cl mn gp se 15 28 ve 43 14 vi 57 TpWD cl mn gp se 2 11 ve 13 7 vi 20 29 Find TA record with max gap: TpLN cl mn gp se 15 28 P{se} =P'pLN>29 P{ve,vi}=PpLN>29 = 29 Use PpLN>c (c=mean+gap/2 = 15+28/2 =29) to divide RC at cutpoint, c, into LT={se} and GT={ve,vi} and create their pTree masks: 1 1 or 1 1 1 1 1 1 GT={ve,vi} has pTree, P{ve,vi} =PpLN>29 & PRC= PpLN>29 LT={se} has pTree, P{se} = = P'pLN>29 PpLN29 &PRC If LT is singelton {remove its contents from RC and from all TA's} If GT is singelton {remove its contents from RC and from all TA's} END_DO

3 FAUST_pdq epoch1, step2 Non-se TA record with maximum gap:
ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl mn gp se 49 10 ve 59 7 vi 66 TsWD cl mn gp ve 28 1 vi 29 4 se 33 TpLN cl mn gp se 15 28 ve 43 14 vi 57 TpWD cl mn gp se 2 11 ve 13 7 vi 20 50 TpLN cl mn gp ve 43 14 Non-se TA record with maximum gap: PpLN>c (c=43+14/2 =50) divides RC={ve.vi} into {vi} with pTree, P{vi} =PpLN>50 & PRC and {ve} with pTree, P{ve} =P'pLN>50 & PRC P{se} 1 P{ve} =P'pLN>50 & PRC 1 P{vi} =PpLN>50 & PRC 1 P{vi} =PpLN>50 & PRC = 50 1 1 1 or 1 or 1 1 or 1 1 1 1 1 1

4 FAUST_pdq epoch 2, step 1 TA rec with max gap
Recompute rv's (Using representative value=median) se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl md gp se 49 11 ve 60 6 vi 66 TsWD cl md gp ve 28 2 vi 30 3 se 33 TpLN cl md gp se 15 30 ve 45 13 vi 58 TpWD cl md gp se 2 11 ve 13 7 vi 20 31 TA rec with max gap TpLN cl md gp se 15 30 Use PpLN>c (c=15+30/2 =~31) separate RC into P{ve,vi} =PpLN>31&Ppure1 P{se} =P'pLN>31&Ppure1 epoch 1 results P{se} 1 P{ve} =P'pLN>53 & PCC 1 P{vi} =PpLN>53 & PCC 1 P{se} =P'pLN>31 1 P{ve,vi}=PpLN>31 1 or = 31 1 1 We stop revising setosa (We consider it converged.) since it did not change in epoch 2 (i.e., in all succeeding epochs the initial RC will exclude setosa).

5 Nearly the same clustering of ve and vi! So they could be
FAUST_pdq epoch 2, step 2 se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl md gp se 49 11 ve 60 6 vi 66 TsWD cl md gp ve 28 2 vi 30 3 se 33 TpLN cl md gp se 15 30 ve 45 13 vi 58 TpWD cl md gp se 2 11 ve 13 7 vi 20 51 TA rec with max non-se gap TpLN cl md gp ve 45 13 P{ve} =P'pLN>51 & PRC 1 P{vi} =PpLN>51 & PRC = 51 1 1 1 or 1 1 1 1 1 1 Use PpLN>c (c=45+13/2 =51) to separate RC={ve.vi} into P{vi}=PpLN>51 & PRC P{ve} =P'pLN>51 & PRC Nearly the same clustering of ve and vi! So they could be considered converged also?

6 FAUST_pdq using std's The 4 Attribute tables with rv=mean and stds and max # stds in gap, n, cut point, cp (cp=the value in the gap which allows the max # of stds, n, to fit forward from mean (using its std) and backward from next mean (using its std). n satisfies: mn+n*std=mnG-n*stdG thus n=(mnG-mn)/(std+stdG) se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TpLN cl rv std n cp se = ve TA rec with max n Note, since there is also a case with n=4.1 which results in the same partition (into {se} and {ve,vi}) we might use both for improved accuracy - certainly we can do this with sequential! = 19 Remove se from RC (={ve, vi} now) and TA's 1 1 1 1 1 1 1 TsLN cl rv std n cp se ve vi TsWD cl rv std n cp ve vi se TpLN cl rv std n cp se ve vi TpWD cl rv std n cp se ve vi se_means se_std se_ve_n se_vi_n se_ve_cp se_vi_cp ve_means ve_std ve_vi_n ve_se_n ve_vi_cp ve_se_cp vi_means vi_std vi_se_n vi_ve_n vi_se_cp vi_ve_cp

7 FAUST_pdq using std's Use the 4 Attribute tables with rv=mean, stds and max_#_stds_in_gap=n, cut value, cp (cp=value in gap which allows max # of stds, n, to fit forward from that mean (using its std) and backward from next mean, meanG, (using stdG). n satisfies mean + n*std = meanG - n*stdG so n=(meanG-mean)/(std+stdG). se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TpWD cl rv std n cp ve vi TA rec with max n 16= P{vi} =PpWD>16 1 1 1 1 1 1 1 Note that we get perfect accuracy with one epoch using stds this way!!! TsLN cl rv std n cp se ve vi TsWD cl rv std n cp ve vi se TpLN cl rv std n cp se ve vi TpWD cl rv std n cp se ve vi

8 FAUST_pdq SUMMARY We conclude that FAUST_pdq will be fast (no loops, one pTree mask per step, may converge with 1 [or just a few] epochs?? and is fairly accurate (completely accurate in this example using the std method!). FAUST_pdq is improved (accuracy-wise) by using standard_deviation-based gap measurements and choosing the maximum number of stds as the attribute relevancy choice. There may be many other such improvements one can think of, e.g., using an outlier identification method (see Dr. Dongmei Ren's thesis) to determine the set of non-outliers in each attribute and class. Within each attribute, order by means and define gaps to be between the maximum non-outlier value in one class and the minimum non-outlier value in the next (allowing these gap measurements to be negative if the max of one exceeds the minimum of the next). Also there are many ways of defining representative values (means, medians, rank-points, ...) In Conclusion, FAUST_pdq is intended to be very fast (if raw speed is the need - as it might be for initial processing of the massive and numerous image datasets that the DoD has to categorize and store). It may be fairly accurate as well, depending upon the dataset, but since it uses only one attribute or feature for each division, it is not likely to be of maximal accuracy compared to other methods (such as the FAUST_pms coming up). Next look at FAUST_pms (pTree-based, m-attribute cut_points, sequential (1 class divided off at a time) so we can explore the various choices for m (from 1 to the table width) and alternate distance measures.

9 FAUST_pms FAUST_p with m_attribute cut_points and sequential class separation (divides off one class at a time)) We do it with m=1 first. Choose TAttribute(class, rv, gap) record with maximum gap. Use cL= rv - previous_gap/2 and cG= rv + gap/2 to separate out that class from RC The pTree masks are, Pclass=PA>cL & PAcG & PRC PRC =P'class & PRC (removes class from RC) NOTE: If class is first in Tattribute (has no previous_gap), then Pclass = PAcG & PRC . If class is last, then Pclass = PA>cL & PRC . 3. Do x-1 times (assuming there are x classes altogether) (These are the x-1 STEPs of the first EPOCH.) 4. Repeat 1,2,3 until means stop changing (much) (The EPOCHs). Remaining_Classes, RC, is initiated to all classes and its mask pTree, PRC, (masking the points in the RC classes) is initially pure1). For each attribute, create Attribute_Tables, TA(class, rv(attribute,class), gap ) ordered on rv asc, where rv is a class representative value in that attribute (e.g., the class mean, which will be used here) and gap = gap to next the rv: WHILE RC not empty, DO If m>1 then we define m pairs, (ckL, ckH), k=1...m as above (after choosing the k TA records with highest gaps) and define Pclass=PRC & PA>cL1 & PAcG2 PRC =P'class & PRC (removes that class from RC). There are lots of other variations possible here too. E.g., we could choose all TA records with gap > threshold each step and then lower threshold for the next step... & PA>c2L & PAc2G ... & PA>cmL& PAcmG

10 FAUST_pms epoch1, step1 PRC =P'class & PRC TA record with maximum gap
ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl rv gp se 51 12 vi 63 7 ve 70 TsWD cl rv gp ve 32 1 vi 33 2 se 35 TpLN cl rv gp se 14 33 ve 47 13 vi 60 TpWD cl rv gp se 2 12 ve 14 11 vi 25 31 TpLN cl rv gp se 14 33 TA record with maximum gap Use PAcG (cG=rv+gap/2 = 14+33/2 = 31) in pLN to separate out the se from RC pTree mask: Pclass= PAcG & PRC PRC =P'class & PRC P{se} =P'pLN>31 P{ve,vi}=PpLN>31 = 31 1 1 or 1 1 And so forth. Another way to order the dividing of singleton classes off from RC, is to use the class that is adjacent to the maximum number of attribute-maximal gaps and use all attributes (or all relevant attributes - with some definition of relevant - first all attributes on the next slide. This is tantamount to an L-infinity like Nearest Neighbor set but the radius differs in each dimension depending upon the gaps)

11 Pvi_sLN=PsLN>011 1001&PsLN100 0011&Ppure1
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl rv gp se 51 12 vi 63 7 ve 70 TsWD cl rv gp ve 32 1 vi 33 2 se 35 TpLN cl rv gp se 14 33 ve 47 13 vi 60 TpWD cl rv gp se 2 12 ve 14 11 vi 25 Tvi is adjacent to most maximal gap, so take it. In sLN (cL=63-12/2=57, cG=63+7/2=67) Pvi_sLN=PsLN> &PsLN &Ppure1 (pure1 reserves so it need not be ANDed here) In sWD (cL=33-1/2=32, cG=33+2/2=34) Pvi_sWD=PsWD> &PsWD In pLN (cL=60-13/2=53) Pvi_pLN= PpLN> In pWD (cL=25-11/2=20) Pvi_pWD= PpWD>1 0100

12 Pvi_sLN=PsLN>011 1001&PsLN100 0011&Ppure1
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl rp gp se 51 12 vi 63 7 ve 70 TsWD cl rp gp ve 32 1 vi 33 2 se 35 TpLN cl rp gp se 14 33 ve 47 13 vi 60 TpWD cl rp gp se 2 12 ve 14 11 vi 25 In sLN (c1=63-12/2=57, c2=63+7/2=67) Pvi_sLN=PsLN> &PsLN &Ppure1 (pure1 preserves so it need not be ANDed here) Pvi_sLN=PsLN > 1 1 1 1 1 1 1 1 or or

13 Pvi_sLN=PsLN>011 1001&PsLN100 0011&Ppure1
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TsLN cl rp gp se 51 12 vi 63 7 ve 70 TsWD cl rp gp ve 32 1 vi 33 2 se 35 TpLN cl rp gp se 14 33 ve 47 13 vi 60 TpWD cl rp gp se 2 12 ve 14 11 vi 25 In sLN (c1=63-12/2=57, c2=63+7/2=67) Pvi_sLN=PsLN> &PsLN &Ppure1 (pure1 preserves so it need not be ANDed here) Pvi_sLN=PsLN> &PsLN Pvi_sLN=PsLN > 1 1 1 1 1 1 1 1 1 There are lots of miss classified points! Rather than finish the next 3 substeps, we will, after summarizing, try medians instead of means. or

14 Pvi_sWD=PsWD>10 0000&PsWD10 0010
TsLN cl rp gp se 51 12 vi 63 7 ve 70 TsWD cl rp gp ve 32 1 vi 33 2 se 35 TpLN cl rp gp se 14 33 ve 47 13 vi 60 TpWD cl rp gp se 2 12 ve 14 11 vi 25 So I now consider FAUST_pdq to be the best quick and dirty method - very fast, but may produce errors (The error rate and convergence rate still need assessing). FAUST_pms should be just about as fast with lots of additional possibilities (as detailed at left - others???). It may be tempting to think that the number of steps within an epoch is smaller in pdq (divisive). I believe there are the same number of steps: divisive sequential 2-classes 1 step 1 step 3-classes 2 steps 2 steps 4-classes 3 steps (1st cut yields either 1,3 or 2,2 then 2nd cut on : 1, 2 2nd-3rd 3rd on 2: 1, ,1 1,1 ... Another way to view it: There has to be a cut made between every two consecutive classes eventually (assuming some ordering) and each cut requires a step in either sequential or divisive. Thus the only potential speed advantage of divisive over sequential is that only one pTree is ever needed for each cut in a divisive step, while two are needed in most sequential steps. Accuracy advantages either way???? The obvious one is that only sequential allows multiple-attribute cutpoints (multi-attribute L-infinity neighborhoods). Note, these are L-infinity-based methods, in terms of distance metrics. We should explore L1 and L2 also. There is a wide collection of methods defined here. Once we have the collection defined and parameterized (and coded) the task will be to match the method to the data situation and task requirements. Do this: sWD (c1=33-1/2=32, c2=33+2/2=34) Pvi_sWD=PsWD> &PsWD Do this: pLN (c1=53) Pvi_pLN= PpLN> Do this pWD (c1=20) Pvi_pWD= PpWD>1 0100 Then AND all of them together. Will an optimizing compiler be able to use deMorgan's Laws, etc, to combine operations? Along the lines of effectively using the capabilities of modern compilers, one can write a program to create, for every class, the pTree programs to perform FAUST_pms where m is the table width (or involves some reduction using a relevancy analysis method). In that collection of programs (involving ANDs, ORs, Comps) would a compiler optimize (both loading and AND/OR/Comp). Another idea calculate a pTree mask for each class using FAUST_pms in parallel. Then consider all samples that are in multiple classes (set to 1 in multiple class mask pTrees) as suspect. Use another method on those.

15 Using the Vertical Set Square Distance (VSSD) and Vertical Set Inner Product (VSIP) technology (see Perera and Abidin theses), we have other interesting options (essentially to try to differentiate between such classes as white_cars and white_roofs or trees and grass). Since we are given a set, Ak of 10 training points for class k (k=1...c) we could consider, VSSD(x,Ak)., then place x with the cluster that minimizes this measurement. Of course this is VPHD and therefore slow. We could do it only for "suspicious points, x??? We could do this at any epoch (not only the first one) since it doesn't matter too much how big Ak is. E.g., on the second epoch, Ak could be the entire kth cluster. Next we consider ways to get an idea of the shape of the values distribution curve (the modes, etc,) using HPVD.

16 FAUST_hob to get the modes in an attribute?
The idea is to AND successive bit positions (from the high end downward) counting 1's for each 2k-part (half the first time; quarter the 2nd, eighth the 3rd, etc. Any time rc(2k-part) is significantly larger than it's neighbors, there's a mode there? In fact, at each ANDing the set of root-counts produces a value distribution function approximation which is more accurate than the previous

17 APPENDIX: FAUST_hob (separate classes C from D using high order bits distributions)
This method is specifically directed at non-normal distributions? The idea is to separate classes that are difficult to separate by other means (e.g., white roofs and white cars in Aurora). This can be applied divisively (e.g., other methods could result in an accurate "white object" class and then this method could be used to separate white cars and white roofs within this "white objects" class at the bottom of the division tree) or it can be applied sequentially. Let PRC be the mask pTree of the remaining records to be clustered. In attribute A of bitwidth b+1 let k = b and let  = a small "leeway" parameter 1. P = P & PA,k 2. If ( rc(P) > rc(PRC) -  ) { P = PRC & PA,k if ( mean(C)k=1 && mean(D)k=0 ) PC = P; if ( mean(D)k=1 && mean(C)k=0 ) PD = P; } If ( rc(P) <  ) { if ( mean(C)k=0 && mean(D)k=1 ) PC = P; if ( mean(D)k=0 && mean(C)k=1 ) PD = P; } 3. reduce  (by half?); k=k-1; 4. repeat 1., 2., 3. until k<0

18 FAUST_hob RC={ve,vi} A=pWD b=k=4 =2 P=PRC 1. P = P & PA,k
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. P = P & PA,k 2. If (rc(P)>rc(PRC)-) if ( mean(C)k=1 && mean(D)k=0 ) PC = P; if ( mean(D)k=1 && mean(C)k=0 ) PD = P; If(rc(P)<) if ( mean(C)k=0 && mean(D)k=1 ) PC = P; if ( mean(D)k=0 && mean(C)k=1 ) PD = P; 3. =/2; k=k-1; repeat 1,2,3 til k<0 TpWD cl mn se 2 11 ve 13=01101 vi 20=10100 rc=20 =2, rc-=18 sLN sWD pLN pWD 1 P=PRC 1 P=P&PA4 rc=11 1 FAUST_hob

19 2. If (rc(P)>rc(PRC)-)
k=3 se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. P = P & PA,k 2. If (rc(P)>rc(PRC)-) if ( mean(C)k=1 && mean(D)k=0 ) PC = P; if ( mean(D)k=1 && mean(C)k=0 ) PD = P; If(rc(P)<) if ( mean(C)k=0 && mean(D)k=1 ) PC = P; if ( mean(D)k=0 && mean(C)k=1 ) PD = P; 3. =/2; k=k-1; repeat 1,2,3 til k<0 TpWD cl mn se 2 11 ve 13=01101 vi 20=10100 =2, rc-=18 sLN sWD pLN pWD P rc=11 1 P=P&PA3 rc=1 1 1 If(rc(P)<=2) if ((mn(vi)3=0 && mn(ve)3=1) Pvi=P;}


Download ppt "Make a deal for FAUST_p today!!!"

Similar presentations


Ads by Google