60% 1-bits, true iff >60% of the A-values are odd. The 150 level_0 raw bits level_1 = s10gt60_PPW, level_2 =s150_s10_gt60_PPW, level_0 (The level_2 bit strides 150 level_0 bits) (Each level_1 bit (15 of them) strides 10 raw bits) level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Level-1 mn setosa versicolor virginica level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap"> 60% 1-bits, true iff >60% of the A-values are odd. The 150 level_0 raw bits level_1 = s10gt60_PPW, level_2 =s150_s10_gt60_PPW, level_0 (The level_2 bit strides 150 level_0 bits) (Each level_1 bit (15 of them) strides 10 raw bits) level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Level-1 mn setosa versicolor virginica level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap">
Download presentation
Presentation is loading. Please wait.
1
pTrees predicate Tree technologies
provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. Applications: PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification. FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTtrees for classification and clustering of spatial data. 13 12 1 document 2 3 4 5 course Text person Enroll Buy MYRRH ManY-Relationship-Rule Harvester uses pTrees for association rule mining of multiple relationships. PGP-D Pretty Good Protection of Data protects vertical pTree data. 5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ... key=array(offset,pad) ConCur Concurrency Control uses pTrees for ROCC and ROLL concurrency control. DOVE DOmain VEctors Uses pTrees for database query processing.
2
FAUST using impure pTrees (ipTrees)
FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTrees for classification and clustering of spatial data. E.g., to cluster the IRIS dataset of 150 iris flower samples, (50 setosa, 50 versicolor, 50 virginica iris's) using 2-level 60% ipTrees with each upper level bit representing the predicate truth applied to 10 consecutive iris samples), level-1 is shown below. FAUST clusters perfectly using only this level (order of magnitude smaller bit vectors - so faster processing!). FAUST using impure pTrees (ipTrees) All pTrees defined by Row Set Predicates (T/F on any row-sets). E.g.: On T(A,B,C), "units bit slice pTree of T.A, using predicate, > 60% 1-bits, true iff >60% of the A-values are odd. The 150 level_0 raw bits level_1 = s10gt60_PPW,1 level_2 =s150_s10_gt60_PPW,1 1 level_0 (The level_2 bit strides 150 level_0 bits) (Each level_1 bit (15 of them) strides 10 raw bits) level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Level-1 mn setosa versicolor virginica level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap
3
FAUST using impure pTrees (ipTrees) page 2
SL mn gap SW mn gap PL mn gap PW mn gap ve vi 70.6 ve vi 27.8 vi 51.2 ve ve vi 19.2 CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 PW mn gap SL mn gap SW mn gap PL mn gap se se ve vi 70.6 se 37.2 ve vi se vi 51.2 ve ve vi 19.2 cH = /2 = 57.8 4. choose best class and attribute for cutting gapL is gap on low side of a mean. gapH is high 2. Remove record with max gapRELATIVE. (perfect classification of the rest!) CLASS PW setosa versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 cH = /2 = 7.8 FAUST (simplest version) For each attribute (column), 1. calculate mean of each class; 2. sort those means asc; 3. calc mean_gaps=differences_of_means; 4. choose largest (relatively) mean_gap to cut. (perfect on setosa!) done on previous slide
4
FAUST using impure pTrees (ipTrees)
page 3 In the previous two FAUST slides, three-level 60% ipTrees were used (leaves are level=0, root is level=2) with each level=1 bit representing the predicate truth applied to 10 consecutive iris samples (leaf bits, i.e., the level=1 stride=10). Below, instead of taking the entire 150 IRIS samples, 24 are selected from each class as training samples; the 60% is replaced by 50% and level=1 stride=10 is replaced with level=1 stride=12 first, then level=1 stride=24. Note: The means (averages) are almost the same in all cases. level_1 s24gt50_PSL,j s24gt50_PSW,j s24_gt50_PPL,j s24gt50_PPW,j se se ve ve vi vi se ve vi level=1 stride=12, each of the 2 level=1 bits strides 12 of 24 level=1 stride=24, each of the level=1 bits strides 24 of 24 24 samples from each class as training (every other one in the list of 50), first form 3-level gt50%ipTrees with level=1 stride=12. second form 3-level gt50%ipTrees, level=1 stride=24 (i.e., just a root above 3 leaf strides, 1 for each class). Conclusion: Uncompressed 50%ipTrees (with root truth values) root values are close to the mean?
5
(perfect classification of the rest!)
3. Rough pTrees A pTrees is defined by a Tuple Set Predicate (T/F on every set of tuples). E.g., for bit-slices, roughly pure1 might have predicate: " x% 1-bits", 0<x<100. Pure1 pTrees "100% 1-bits" and Pure0 pTrees with predicate "0% 1-bits". To be a little more complete, given a table, T(A,B,C), and given the units bit-slice on T.A (1 iff the A-value is odd) the rough predicate, " 75% 1-bits" on a set of tuples, S, is 1 (true) if 75% of the A-values in S are odd. pTree creation is a 1-time cost. Storage is infinite (many pTrees is fine). In fact, our security shuffle will benefit from the added pTrees. Research problem: combine multiple pTree levels and roughness. Multi-level pTree upper levels can be info sparse (mostly 0s?). The rougher the predicate, the more upper level 1-bits metadata of inode: fanout, segment length it strides, roughness %. SL mn gap SW mn gap PL mn gap PW mn gap ve vi 70.6 ve vi 27.8 vi 51.2 ve ve vi 19.2 CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 cH = /2 = 57.8 (perfect classification of the rest!) FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Alternatively for last step (PW): 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means PW mn gap SL mn gap SW mn gap PL mn gap se se ve vi 70.6 se 37.2 ve vi se vi 51.2 ve ve vi 19.2 versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. cH= /2=16.4 CLASS PW setosa versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 (One mistake only!) versicolor 45 versicolor 32 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 cH = /2=7.8 cH= /2=46.5 (perfect on setosa!) SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Another alternatively-last step (PL): (perfect!)
6
FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a std of these distances from origin along the d-line
7
P(mrmv)/|mrmv|oX<a
4. FAUST Oblique: length, std, rkK for selecting best gap and multiple attrs. formula: P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ). r r r v v r mr r v v v r r v mv v r v v r v P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv PX dot d>a = PdiXi>a E.g.,? Let D=vector connecting class means and d= D/|D| To separate r from v: D = (mvmr), a = (mv+mr)/2 o d NOTE:!!! The picture on this page could be misleading. See next slide for a clearer picture FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need stdj (or variancej) in TBL. Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier? a AND 2 pTrees masks P(mbmr)oX>(mr+m)|/2od P(mvmr)oX>(mr+mv)/2od masks vectors that makes a shadow on mr side of the midpt "outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex. g b grb grb grb grb grb grb grb grb grb In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them. r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b For classes r and b bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr r D
8
4. FAUST Oblique: midpt, std, rkK for selecting best gap and multiple attrs. formula: P(X dot D)>a X any set of vectors. D≡ mrmv is the oblique vector (Note: if D=ei, PXi>a ) and let d=D/|D| To separate r from v: Using means_midpoint, calculate a as follows: PX dot d>a = PdiXi>a Viewing mr and mv as vectors ( e.g., mr≡originpoint_mr ), a = ( mr + (mv-mr)/2 ) o d = (mr+mv)/2 o d r r r v v r mr r v v v r r v mv v r v v r v d a
9
3. Rough pTrees pTrees defined by Tuple Set Predicates (T/F on every set of tuples). E.g., pred for bit-slices, roughly pure1 might be " x% 1-bits", 0<x<100. We note that rough pTrees coincide with pure pTrees unless they are multi-level (compressed). The lowest level of a rough pTree is identical to that of the corresponding pure1 pTree (assuming x>0). Pure1 pTrees can be viewed in the same way - as pTrees with predicate: " 100% 1-bits" and Pure0 pTrees with predicate " 0% 1-bits". Given a table, T(A,B,C), and given the units bit-slice on T.A (1 iff the A-value is odd) the rough predicate, " 75% 1-bits" on a set of tuples, S, is 1 (true) if 75% of the A-values in S are odd. pTree creation is a 1-time cost. Storage is infinite (many pTrees fine). Our security shuffle benefits from added pTrees. Research prob: combine multiple pTree levels and roughness. Multi-level pTree upper levels are info sparse (mostly 0s?). The rougher the predicate, the more upper level 1-bits. metadata of inode: fanout, segment length it strides, roughness %. Can't cluster (classify image pixels) at level-k, if level-k "pts" (tuplesets of level-k segments) substantially span >= 2 image training classes). For each level-k point that substantially spans clusters 1 and 2 about equally, one would expect that the method applied at level-k would not make a clear choice. If it did, there would be something wrong because the info is just not there. Here's the point (regarding image classification): IRIS results suggest: If 150 tuples were given for classification into 3 classes (50 training samples for each class, setosa, versicolor and virginica), then knowing the classes in the training set, we can adjust our level_strides so that the upper level pTrees see the same training classes (and just as clearly - that's what's startling and great!) as the full training set does. We have done that (witness: setosa training samples are rows 1-50, versicolor are and virginica are ; and all strides fit those boundaries. level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Level-1 mn setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap
10
(perfect classification of the rest!)
3. (cont) FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). SL mn gap SW mn gap PL mn gap PW mn gap ve vi 70.6 ve vi 27.8 vi 51.2 ve ve vi 19.2 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 cH = /2 = 57.8 PW mn gap SL mn gap SW mn gap PL mn gap se se ve vi 70.6 se 37.2 ve vi se vi 51.2 ve ve vi 19.2 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Alternatively for last step (PW): CLASS PW setosa versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 cH = /2 = 7.8 versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 (perfect on setosa!) cH= /2=16.4 (One mistake only!) versicolor 45 versicolor 32 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 cH= /2=46.5 SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Another alternatively-last step (PL): (perfect!)
11
P(mrmv)/|mrmv|oX<a
4. FAUST Oblique: using length, std or rankK to determine best gap and/or using multiple attrs We have a pTree ALGEBRA (pTree operators, AND, OR, COMP, XOR, ... and their algebraic properties) We have a pTree CALCULUS (functions that produce the pTree mask for just about any pTree-defining predicate). Multi-attribute "FAUST-Oblique" mask pTree formula: P(X dot D)>a X is any set of vectors D is an oblique vector (if D=ei=(0,...,1,...0) then this is just the existing EIN formula for the ith dimension, PXi > a ). FAUST-Oblique based heuristic: Instead of finding the best D, take as D, the vector connecting a given class mean to another class mean as D ( and d= D/|D| ) PdoX>a = PdiXi>a Where a can be calculated either as (mr is a medoid for class r, i.e., the mean or vector_of_medians) 1. a = ( domr + domv )/2 2. Letting ar=max{dor}; av=min{dov} (when domr<domv, else reverse max and min). Take a = av 3. Using variance gap fits.(or rankK gap fits) as detailed in appendix slides. Apply to other classes in a particular order (by quality of gap)? r r r v v r mr r v v v r r v mv v r v v r v P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv FAUST-Oblique: For isolating a class 1. Create table, TBL(classi, classj, medoid_vectori, medoid_vectorj) 2. Apply the pTree mask formula at left. Notes: 1. If we take the fastest route and just pick the one class which when paired with r, gives the max gap, then we can use max gap or maximum_std_Intersection_point instead of max_gap_midpoint. Then we need stdj (or variancej) in TBL.
12
= SPD1oX = SPD1,iXi SPF(X)-min F(a)= F(b)= F(c)= F(d)= F(e)= F(f)=
FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) = SPD1oX = SPD1,iXi SPF(X)-min pD1,0 pD1,-1 pD1,-2 pD1,-3 pe1,0 pe1,-1 pe1,-2 pe1,-3 pe2,0 pe2,-1 pe2,-2 pe2,-3 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1* ½*3.0 = 1* ½*3.0 = 1* ½*2.4 = 1* ½*2.4 = 1* ½*2.1 = 1* ½*3.0 = 1* ½*2.4 = 1* ½*2.4 = SPe1oX h1 [.6,1.5] h2 [2,2. 5] SPe2oX h1 [2.4,3] h2 [2.1,3] -.5 -.6 1.15 .8 1.3 SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] 0.1 0.6 1.75 1.4 1.9 1 1 1 1 1 1 1 1 1 1 1 1 Idea: Incrementally build clusters one at a time using all F values. E.g., start with one pt, x. Recall F dis dominated, which means actual distance ≥ F difference. If the hull is close to convex hull, max Fdiff approximates distance? Then 1st gap in maxFdiss isolates x-cluster? (1,3) a (1.5,3) b 2.3,3) f d (.6,2.4) c (1.2,2.4) g (2,2.4) h (2.5,2.4) e (2.2,2.1) F(a) F(d) F(b=F(c)) F(f)=F(g) F(h) F(e) D1=(1 , -½)
13
= SPD1oX = SPD1,iXi SPF(X)-min .5 .6 1.65 1.3 1.8 .5 .6 .9 1.15 .8
FAUST Oblique, F(x)=D1ox: Scalar pTreeSet (column of reals) , SPF(X) pTree calculated: mod( int(SP F(X)/(2exp) , 2 ) = SPD1oX = SPD1,iXi SPF(X)-min mxFdf(a) .5 .6 1.65 1.3 1.8 {a,b,c,d} a-cluster. Gap=.7 mxFdf(b) .5 .6 .9 1.15 .8 1.3 All in b-cluster mxFdf(c) .6 1.15 1.1 .8 1.3 All in c-cluster. mxFdf(d) .6 .9 1.75 1.7 1.4 1.9 {a,b,c,d} d-cluster Gap=.5 mxFdf(e) 1.65 1.15 1.75 .9 .35 .3 {e,g,h} e-cluster Gap=.55 mxFdf(f) 1.3 .8 1.1 1.7 .6 all in f-cluster. mxFdf(g) 1.4 .5 {b,c,e,f,g,h} g-cluster. Gap=.5 mxFdf(h) 1.8 1.9 {e,f,g,h} h-cluster. Gap=.7 F(a)= F(b)= F(c)= F(d)= F(e)= F(f)= F(g)= F(h)= 1* ½*3.0 = 1* ½*3.0 = 1* ½*2.4 = 1* ½*2.4 = 1* ½*2.1 = 1* ½*3.0 = 1* ½*2.4 = 1* ½*2.4 = SPe1oX h1 [.6,1.5] h2 [2,2. 5] SPe2oX h1 [2.4,3] h2 [2.1,3] SPD1oX -mn h1 [0,.6] h2 [1.4,1.9] -.5 -.6 1.15 .8 1.3 0.1 0.6 1.75 1.4 1.9 Incrementally build clusters 1 at a time with F values. E.g., start with 1 pt, x. Recall F dis dominated, which means actual separation ≥ F separation. If the hull is well developed (close to convex hull) max Fdiff approximates distance? Then 1st gap in maxFdis isolates x-cluster? F(a) F(h) F(f)=F(g) F(b=F(c)) (1,3) a c (1.2,2.4) (1.5,3) b d (.6,2.4) 2.3,3) f e (2.2,2.1) g (2,2.4) h (2.5,2.4) F(d) F(e) D1=(1 , -½)
14
P(mvmr)oX>|mr+mv|/2
4. cont: Multi-attribute Oblique (FAUST-O) heuristic: Instead of finding best D, take vector connecting a class means as D To separate r from v: D=(mvmr) and a=|mv+vr|/2 To separate r from b: D=(mbmr) and a=|mb+vr|/2 r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b Best cutpt? mean, vector_of_medians, outmost, outmost_non-outlier? ANDing the two pTrees masks the region (which is r) P(mbmr)oX>|mr+mb|/2 P(mvmr)oX>|mr+mv|/2 masks vectors that makes a shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b For classes r and b g b grb grb grb grb grb grb grb grb grb In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} can find them (consider greenish-redish-blue and bluish-greenish-red): bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr r "outermost, = "furthest from means (their projections of the D-line); By "outermost non-outlie" I mean the furthest non-outlier points; Other possibilities: the best rankK points, the best std points, etc. "medoid-to-mediod" close to optimal provided the classes are convex. Final note: I should say "linearly separable instead of convex (slightly weaker condition). D
15
Consider Node 2.2, which is a s5_s5_gt60 node as described above.
IRIS (3,2,5,5)-leveled 60% rough pTrees (level_4 each bit strides 3 bits at level_3) s3_s2_s5_s5_gt60_PPW,1 (level_3 each bit strides 2 bits at level_2) 100 s2_s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) s5_s5_gt60_PPW,1 (level_1 each bit strides 5 bits at level_0) s5_gt60_PPW,1 PPW,1 11 1's out of 30, not 15 (>=60%) Consider Node 2.2, which is a s5_s5_gt60 node as described above. Note that there are only 11 1-bits out of 30 at the leaf (level-0) of its subtree which is well short of the 15 required for gt60% thus the node truth value is at least misleading (It does correctly indicate that gt60% of the next level bits are 1-bits, but it incorrectly suggests that gt60% of the raw level-0 bits are 1-bits.). One way around this problem is to use pure1 above level-1. That way, a bit would indicate that all 5 level-1 bits are 1's and thus all level-0 5-bit strings have a majority of 1-bits or at least 3. Thus the level-0 stride of 2.2 has at least 15 1-bits and thus the "true" at 2.2 correctly indicates that there are a majority of 1-bits strided by it at level-0 (as well as at level-1). However, what do we do if (as is the case above) 2.2 strides a majority of level-1 1-bits but a minority of level-0 1-bits? The use of either a 0 or a 1 bit at 2.2 is misleading. I suggest: residualize all rough pTree bit-vectors (as done for gt60 predicate above) and then for each inode, residualize the level count arrays (for level-1 and up): 86
16
Rough pTrees ARE pTrees in which the predicate gives definition of roughly or nearly pure. Recall that all pTrees are defined by a Tuple Set Predicate (TSP) which evaluates to True or False on every set of tuples (rows) of the horizontal table which is being represented vertically by those pTrees. e.g., for a bitslice (which is a 1-column table) roughly pure1 might be defined by TSP: "at least 75% of the bits are 1-bits". Rough pTrees can be raw (uncompressed) or Multi-Level (with any number of levels from 1 on up - as can any pTree) since they are bonafide pTrees, albeit with a different predicate - a "roughly pure" predicate. These rough pTrees (in which we tune the choices of the "roughness" to the data characteristics or statistics??) would be created and residualized along with the pure pTrees (which are also rough pTrees at the extremes -100% and 0% 1-bits). Creation is a one-time cost. The extra storage space is a non-issue in this age of infinite storage. And the addition of many, many more pTrees redundantly representing a data table, means, among other things, that we can apply our "security shuffle" much more effectively (needing fewer, if any, bogus pTrees?). We can use multiple levels of roughness together in the same algorithm (e.g., FAUST). A research problem: effectively combine multiple pTree levels and roughness. When we create multi-level pTrees, we often see the upper levels become "info sparse or even info free (all zeros)". The consequences include the fact that those levels may be of no data mining value, and sometimes, only the leaf level is of value. Using a rougher pTree predicate instead, populates any pTree with more upper level 1-bits. For a given table or data area, how many levels and which definition(s) of roughness provide the most data mining advantage? The metadata of an inode would include (in general) its fanout, the segment length it strides, and its roughness percentage. Another way to include that option is to require that any pTree be built using a constant global roughness. We could switch to pTrees of a different roughness (for the same bit slice) in our data mining as our algorithm reaches a given inode level. It would be impossible to accurately cluster (or classify image pixels) at, say, level-k of a pTree, if level-k "points" (the tuple sets making up level-k segments) substantially spanned two or more of the clusters (or image training classes). For each level-k point that substantially spans clusters 1 and 2 about equally, one would expect that the method applied at level-k would not make a clear choice. If it did, there would be something wrong because the info is just not there. The intent of the previous slides is to demonstrate that we can get the same accuracy from level-3 as level-0 (at least sometimes) if the above is true. Another way to look at it is that we cannot mine information out of a set of upper level pTrees if there is no information at that level (keeping in mind that the information may be there at lower levels). So here's the point (regarding image classification): The IRIS results suggest that: If the 150 tuples were given to us as training for classification of other unclassified IRIS tuples into one of the three classes (50 training samples for each class, setosa, versicolor and virginica), then what we have shown (only suggested in general, but proved in this particular case) is that, knowing the classes in the training set, we can adjust our level_strides so that the upper level pTrees see the same training classes (and just as clearly - that's what's startling and great!) as the full training set does. We have done that (witness: setosa training samples are rows 1-50, versicolor are and virginica are ; and all strides fit those boundaries.
17
FAUST: means-sequential
Initially, let PREMAINING be pure1 (all records still remains to be processed). 1. For each attribute, calculate the mean for each class and sort asc on mn. Calculate all mean_gaps = difference_between_consecutive_means. Create MT(attr, class, mean, gapL, gapH, gapREL) sorted on gapREL =(gapL+gapH)/mn) gapL is on lo side of mean. apH, hi) 2. Choose and remove the MT record with max gapRELATIVE. Use cL=mean-gapL/2 and cH=mean+gapH/2 for PL=PA>cL PH=P'A>cH Class mask PCLASS=PL&PH&PREM , update PREM=PREM&P'CLASS 3. Repeat 2 until all classes have a pTree mask 4. Repeat 1,2,3 until ?. There are 150 level_0 bits level_1 = s10gt60_PPW,1 level_2 (root) = s15_s10_gt60_PPW,1 1 level_0 (The level_2 bit strides 15 level_1 bits) (Each level_1 bit (15 of them) strides 10 level_0 bits) level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Lev1 means setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap
18
(perfect classification of the rest!)
FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). SL mn gap SW mn gap PL mn gap PW mn gap ve vi 70.6 ve vi 27.8 vi 51.2 ve ve vi 19.2 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 cH = /2 = 57.8 PW mn gap SL mn gap SW mn gap PL mn gap se se ve vi 70.6 se 37.2 ve vi se vi 51.2 ve ve vi 19.2 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Alternatively for last step (PW): CLASS PW setosa versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 cH = /2 = 7.8 versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 (perfect on setosa!) cH= /2=16.4 (One mistake only!) versicolor 45 versicolor 32 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 cH= /2=46.5 SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Another alternatively-last step (PL): (perfect!)
19
Consider Node 2.2, which is a s5_s5_gt60 node as described above.
IRIS (3,2,5,5)-leveled 60% rough pTrees (level_4 each bit strides 3 bits at level_3) s3_s2_s5_s5_gt60_PPW,1 (level_3 each bit strides 2 bits at level_2) 100 s2_s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) s5_s5_gt60_PPW,1 (level_1 each bit strides 5 bits at level_0) s5_gt60_PPW,1 PPW,1 11 1's out of 30, not 15 (>=60%) Consider Node 2.2, which is a s5_s5_gt60 node as described above. Note that there are only 11 1-bits out of 30 at the leaf (level-0) of its subtree which is well short of the 15 required for gt60% thus the node truth value is at least misleading (It does correctly indicate that gt60% of the next level bits are 1-bits, but it incorrectly suggests that gt60% of the raw level-0 bits are 1-bits.). One way around this problem is to use pure1 above level-1. That way, a bit would indicate that all 5 level-1 bits are 1's and thus all level-0 5-bit strings have a majority of 1-bits or at least 3. Thus the level-0 stride of 2.2 has at least 15 1-bits and thus the "true" at 2.2 correctly indicates that there are a majority of 1-bits strided by it at level-0 (as well as at level-1). However, what do we do if (as is the case above) 2.2 strides a majority of level-1 1-bits but a minority of level-0 1-bits? The use of either a 0 or a 1 bit at 2.2 is misleading. I suggest: residualize all rough pTree bit-vectors (as done for gt60 predicate above) and then for each inode, residualize the level count arrays (for level-1 and up): 86
20
From the discussion on the previous slide, it seem practical to have the same fanout through out the tree?. Otherwise it is very difficult to even identify inodes (e.g., what does 2.2 mean) global_fanout= 4 for images, 8 for solids, 64 for sparse numeric non-spatial data cols?, for very sparse numeric data columns and for high cardinality bitmapped categorical columns????? On the other hand, maybe a database_global_fanout so that the processing code is simpler??? global_fanout=5: global_fanout=4: 86 86 86 As the table grows: 89 94
21
se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se Sepal Length. Sepal Width Pedal Length. Pedal Wth vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi Sepal Length. Sepal Width Pedal Length. Pedal Wth ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve ve Notational note: We develop IRIS 6_5_5 pTrees: level_3 (root) segment_stride=6, level_2 segment_stride=5; level_1 segment stride=5; for roughly_pure1 predicate: ">60% 1-bits". PPW,1, as 3-level 60% rough pTree with seg strides of 6,5,5. root = s6_s5_s5_gt60_PPW, s5_s5_gt60_PPW,1 s5gt60_PPW,1
22
level_2 (root) = s15_s10_gt60_PPW,1
There are 150 level_0 bits level_1 = s10gt60_PPW,1 level_2 (root) = s15_s10_gt60_PPW,1 1 level_0 (The level_2 bit strides 15 level_1 bits) (Each level_1 bit (15 of them) strides 10 level_0 bits) level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j setosa versicolor virginica level-1 values: SL SW PL PW setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica Lev1 means setosa versicolor virginica se se ve vi 70.6 SL mn gap SW mn gap se 37.2 ve vi se vi 51.2 ve PL mn gap ve vi 19.2 PW mn gap
23
(perfect classification of the rest!)
FAUST means-seq, level_1 Rough pTrees (60%, 40%). Initially PREMAINING =pure1 (all records yet to be processed). SL mn gap SW mn gap PL mn gap PW mn gap ve vi 70.6 ve vi 27.8 vi 51.2 ve ve vi 19.2 1. For each attr, calculate the mean for each class and sort asc. Calculate all mean_gaps=diff_between_consec_means CLASS SL versicolor 1 versicolor 56 versicolor 57 versicolor 54 virginica 73 virginica 64 virginica 72 virginica 77 virginica 67 cH = /2 = 57.8 PW mn gap SL mn gap SW mn gap PL mn gap se se ve vi 70.6 se 37.2 ve vi se vi 51.2 ve ve vi 19.2 (perfect classification of the rest!) choose best class and attribute for cutting gapL is gap on low side of a mean. apH is high 2. Remove record with max gapRELATIVE. SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Alternatively for last step (PW): CLASS PW setosa versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 cH = /2 = 7.8 versicolor 15 versicolor 14 versicolor 13 versicolor 12 virginica 17 virginica 22 virginica 16 virginica 19 (perfect on setosa!) cH= /2=16.4 (One mistake only!) versicolor 45 versicolor 32 virginica 58 virginica 51 virginica 49 virginica 48 virginica 50 cH= /2=46.5 SW mn gap PL mn gap PW mn gap ve vi 27.8 vi 51.2 ve ve vi 19.2 Another alternatively-last step (PL): (perfect!)
24
IRIS 3 level rough pTrees
level_1 s5gt60_PSL,j PSW,j PPL,j PPW,j IRIS 3 level rough pTrees To the right of the pTrees are the corresponding numbers in decimal Can we do effective clustering (classification) at level_1? Yes. Not surprisingly, since we have already demonstrated that ability for IRIS s5gt60 (level_1 of the previous 2 level IRIS pTrees). But more importantly, can we go to level_2, s5_s5_gt60, and still get reasonable clustering? It's interesting-we can (25-fold data reduction). Note in PW, the first two values (2s from setosa) separate from the next four (cut_point=7) Within the final four values, the first two (15, 12 from versicolor) separate from the final two (18, 21 from virginica) with cutpoint 17. s6_s5_s5_gt (level_3) s5_s5_gt (level_2) s5gt (level_1) s5_s5_gt60_PPL,j s5_s5_gt60_PPW,j s5_s5_gt60_PSL,j s5_s5_gt60_PSW,j The same PW cutpoints, 7 and 17, separate the classes perfectly at level_1. Note: except for SW, the two other attributes cluster the 3 classes perfectly. (SL_cutpts=54,63; PL_cutpts=30,48 and if fact {se} {ve,vi} SW_cutpt=32 works!) Can you explain the fact that level_2 clusters as well or better than level_1??
25
IRIS 4 level rough pTrees
s5gt60_PSL,j j= s5gt60_PSW,j s5gt60_PPL,j s5gt60_PPW,j IRIS 4 level rough pTrees s5gt60 level_1 s5_s5_gt level_2 s2_s5_s5_gt level_3 s3_s2_s5_s5_gt level_4 PW cutpts, 7,16, separate classes perfectly at levels 1, 2, and 3. PL cutpts, 27,48, separate classes perfectly at levels 1, 2, and 3 also. Note level_4 can't separate since all classes are entirely spanned by the 1 node at that level. However, the values are close to the global means Level_3 values are very good estimates of the means: s5_s5_gt60_PSL,j s5_s5_gt60_PSW,j s5_s5_gt60_PPL,j s5_s5_gt60_PPW,j s2_s5_s5_gt60, s2_s5_s5_gt60 s2_s5_s5_gt60 s2_s5_s5_gt60 (level_4 each bit strides 3 bits at level_3) 1 level_4 pTree = s3_s2_s5_s5_gt60_PPW,1 (level_3 each bit strides 2 bits at level_2) 100 3 level_3 pTrees = s2_s5_s5_gt60_PPW,1 (level_2 each bit strides 5 bits at level_1) 6 level_2 pTrees = s5_s5_gt60_PPW,1 (level_1 each bit strides 5 bits at level_0) 30 level_1 pTrees = s5_gt60_PPW,1 150 level_0 pTrees=PPW,1
26
P(mrmv)/|mrmv|oX<a
FAUST (2011_06_11) Using length, std or rankK to determine best gap and/or using multiple attrs to improve accuracy. We have a pTree ALGEBRA (pTree operators, AND, OR, COMP, XOR, ... and their algebraic properties) We have a pTree CALCULUS (functions that produce the pTree mask for just about any pTree-defining predicate). Multi-attribute "FAUST-Oblique" mask pTree formula: P(X dot D)>a X is any set of vectors D is an oblique vector (if D=ei=(0,...,1,...0) then this is just the existing EIN formula for the ith dimension, PXi > a ). FAUST-Oblique based heuristic: Instead of finding the best D, take as D, the vector connecting a given class mean to another class mean as D ( and d= D/|D| ) PdoX>a = PdiXi>a Where a can be calculated either as (mr is a medoid for class r, i.e., the mean or vector_of_medians) 1. a = ( domr + domv )/2 2. Letting ar=max{dor}; av=min{dov} (when domr<domv, else reverse max and min). Take a = av 3. Using variance gap fits.(or rankK gap fits) as detailed in appendix slides. Apply to other classes in a particular order (by quality of gap)? r r r v v r mr r v v v r r v mv v r v v r v P(mrmv)/|mrmv|oX<a For classes r and v D = mrmv FAUST-Oblique: For isolating a class 1. Create table, TBL(classi, classj, medoid_vectori, medoid_vectorj) 2. Apply the pTree mask formula at left. Notes: 1. If we take the fastest route and just pick the one class which when paired with r, gives the max gap, then we can use max gap or maximum_std_Intersection_point instead of max_gap_midpoint. Then we need stdj (or variancej) in TBL.
27
Some topics to consider
With all the "break ins" occuring (e.g., citibank, etc.) how can data be protected? Can vertical data be protected more easily than horizontal data? Can pTree representation be useful in protecting data? Some modification of standard pTrees? Some preliminary ideas: 1. with pTrees you need to know the ordering to have any information. 2. You also need to know which pTree stands where (in which column and which bit slice) to have any info. 3. If all pTrees are made the same length (using the max file length over the database). then we can shuffle/scramble/alter the ordering of columns/slices and even of the ordering to conceal information. With pTree representations, there are no horizontal data records (as opposed to indexes which are vertical structures which accompany the horizontal data files). pTrees ARE the data as well as the indexes. My thoughts include: pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. Therefore we want to devise a mechanism based on the above notions (or others?) in which the "scambled" pTree data can be processed without unscrambling it? So I'm thinking, for data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone but anyone qualified could issue a datamining request (a classification/ARM/clustering request) and get the answer even though the actual data would never be exposed. I suppose that's not much different really from encrypting the data, but encrypting massive data stores is never a good options and decryption is usually necessary to mine info from the store.
28
D-line mean for the b class D-line mean for the r class
Cut-HyperPlane, CHP APPENDIX: Using a quadratic hyper-surface? (instead of a hyper-plane) Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: b b b b b b b r b r b r r b r r r r r r > gap D-line mean for the b class D-line mean for the r class D Take the r and the b points that project closest to the D-line as the "best" support pair. similarly for the "next best" or "second best" support pair similarly for the "third best" pair. Form the quadratic support curve from the three r-support points for class-r Form the quadratic support curve from the three b-support points for class-b (or move each point in each pair 1/3 of the way toward the other and then do the above) or ????.
29
Fitting a parabolic hyper-surface
Cut-HyperPlane, CHP Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: b b b b b b b r b r b r r b r r r r r r > Fitting a parabolic hyper-surface D Fitting a parabola with focus=p=b-mean and directrix = line_perpendicular_to_the_D-line through mean midpoint, with pTree mask, Letting M=mrmb, X a point, we want the mask pTree, P MoX > d(p,X) MoX > d(p,X) (MoX)2 > d2(p,X) and (MoX)2 = (m1x1 + m2x2)2 d2(p,X) = (p1-x1)2 + p2-x2)2 P(mrmb)oX=|mr+mb|/2 m12x12 + 2m1m2x1x2 + m22x22 > p12x12 +2p1x1 + p p22x22 +2p2x2 + p22 (m12-p12)x m1m2x1x (m22-p22)x p1x p2x2 > p12 + p22 P should do it. (m12-p12)x12 + 2m1m2x1x2 + (m22-p22)x22 - 2p1x1 - 2p2x2 > p12 + p22
30
FAUST is a Near Neighbor Classifier
FAUST is a Near Neighbor Classifier. It is not a Voting NNC like pCkNN (where for each unclassified sample pCkNN builds around that sample, a neighborhood of TrainingSet voters, who then classify sample through majority, plurality or weighted (in PINE) vote. pCkNN classifies one unclassified sample at a time. FAUST is meant for speed and therefore FAUST attempts to classify all unclassified samples at one time. FAUST builds a Big Box Neighborhood (BBN) for each class and then classifies all unclassified samples in the BBN into that class (constructing said class-BBNs with one EIN pTree calculation per class). R aG bG G B aR bR aB bB The BBNs can overlap, so the classification needs to be done one class at a time sequentially, in maximum gap, maximum number of std's in gap, or minimum rankK in gap order.) The whole process can be iterated as in k-means classification using the predicted classes [or subsets of] as the new training set. This can be continued until convergence. R aG bG G B aR bR aB bB A BBN can be a coordinate box: for coord R, cb(R,class,aR,bR) is all x such that aR<xR<bR Either or both of the < can be or . aR and bR are what were called cut_points of the class. Or BBNs can be multi-coordinate boxes, which are INTERSECTIONs of the best k (kn-1, assuming n classes) cb's for a given class ("best" can be wrt any of the above maximizations). And instead of using a fixed number of coordinates, k, we could use only those in which the "quality" of its cb is higher than a threshold, where "quality" might be measured involving the dimensions of the gaps (or other ways?). FAUST could be combined with pCkNN (probably in many ways) as follows; FAUST multi-coordinate BBN could be used first to classify the "easy points" (that fall in an intersection of high quality BBNs and are therefore fairly certain to be correctly classified). Then for the remaining "difficult points" could be classified using the original training set (or the union of each original TrainingSet class with the new "easy points" of that same class) and using L or Lp , p = 1 or 2.
31
P(mbmr)oX>|mr+mb|/2
A Multi-attribute Oblique (FAUST-O) based heuristic: Instead of finding the best D, take the vector connecting a class mean to another class mean as D To separate r from v: D=(mvmr) and a=|mv+vr|/2 r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b To separate r from b: D=(mbmr) and a=|mb+vr|/2 Question: What's the best as cutpt? mean, vector_of_medians, outermost, outermost_non-outlier? ANDing the two pTrees masks the region (which is r) P(mbmr)oX>|mr+mb|/2 By "outermost, I mean the "furthest points away from the means in each class (in terms of their projections of the D-line); By "outermost non-outlie" I mean the furthest non-outlier points; Other possibilities: the best rankK points, the best std points, etc. Comments on where to go from here (assuming we can do the above): I think the "medoid-to-mediod" method on this page is close to optimal provided the classes are convex. If they are not convex, then some sort of Support Vector Machines, SVMs, would be the next step. In SVMs the space is translated to higher dimensions in such a way that the classes ARE convex. The inner product in that space is equivalent to a kernel function in the original space so that one need not even do the translation to get inner product based results (the genius of the method). Final note: I should say "linearly separable instead of convex (slightly weaker condition). P(mvmr)oX>|mr+mv|/2 masks vectors that makes a shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b For classes r and b
32
Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: blue ^ | rb rb | rb rb | rb rb rb | br rb | br rb | br br rb | br br | br br | br br >red D-line mean for the rb class D-line mean for the br class etc. gap Consecutive class mean mid-point = Cut_Point D Cut-HyperPlane, CHP (what we are after) Clearly we would want to find a ~45 degree unit vector, D, then calculate the means of the projections of the two training sets onto the D-line then use the midpoint of the gap between those two means as the cut_point (erecting a perpendicular bisector "hyperplane" to D there - which separates the space into the two class big boxes on each side of the hyperplane. Can it an be masked using one EIN formula??): ^ blue | rb rb | rb rb | rb rb rb | br rb | br rb | br br rb | br br | br br | br br >red The above "diagonal" cutting produces a perfect classification (of the training points). If we had considered only cut_points along coordinate axes, it would have been very imperfect!
33
D-line mean for the rb class D-line mean for the br class
How do we search through all possible angles for the D that will maximize that gap? We would have to develop the formula (pTree only formula) for the class means for any D and then maximize the gap (distance between consecutive D-projected means). Take a look at the formulas in the book, think about it, take a look at Mohammad’s formulas, see if you can come up with the mega formula above. Let D = (D1, …, Dn) be a unit vector (our “cut_line direction vector) D dot X = D1X1+ …+DnXn is the length of the perpendicular projection of X on D (length of the high noon shadow that X makes on the D line, as if D were the earth). So, we project every training point, Xc,i (class=c, i=1..10), onto D (i.e., D dot Xc,i). Calculate D-line class means, (1/n)(D dot Xc,i), select the max consecutive mean gap along D, (call it best_gap(D)=bg(D). Maximize bg(D) over all possible D. Harder? Calculate it for a [polar] grid of D’s! Maximize over that grid. Then use continuity and hill climbing to improve it. blue ^ | rb rb | rb rb rb | rb rb rb | rb rb | | br br | br br | br br br | br br br >red D gap etc. Cut_point Cut-HyperPlane, CHP More likely the situation would be: rb's are more blue than red and br's are more red than blue. Suppose there are just 2 attributes (red and blue) and we (r,b)-scatter plot the 10 reddish-blue class training points and the 10 bluish-red class training points: rb rb rb rb rb rb rb br rb br rb br br rb br br br br br br > red gap blue rb rb rb rb rb rb rb rb br rb br br br br br br br br blue red D-line mean for the rb class D-line mean for the br class gap D D What if the training points are shifted away from the origin? This should convince you that it still works.
34
g b grb grb grb grb grb grb grb grb grb In higher dimensions, nothing changes (If there are "convex" clustered classes, FAUST{div,oblique_gap} can find them (consider greenish-redish-blue and bluish-greenish-red): bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr r D Before considering the pTree formulas for the above, we note again that any pair of classes (multi-classes, as in divisive) that are convex, can be separated by this method. What if they are not convex? A 2-D example: A couple of comments. FAUST resembles the SVD (Support Vector Machine) method in that it constructs a separating hyperplane in the "margin" between classes. The beauty of SVD (over FAUST and all other methods) is that it is provable that there is a transformation to a higher dimensions that renders two non-hyperplane seperable classes to being hyperplane seperable (and you don't actually have to do the transformation - just determine the kernel that produces it.). The problem with SVD is that that it is computationally intensive. I think we want to keep FAUST simple (and fast!). If we can do this generalization, I think it will be a real winner! How do we search over all possible Oblique vectors, D, for the one that is "best"? Of if we are to use multi-box neighborhoods, how do we do that? A heuristic method follows:
35
FAUST_pdq_std (using std's) 1.1
Create attribute tables with cl=class, mn, std, n=max_#_stds_in_gap, cp=cut_point (value in the gap which allows the max # of stds, n, to fit forward from mean (using its std) and backward from next mean (using its std)). n satisfies: mean+n*std=meanG-n*stdG so n=(mnG-mn)/(std+stdG) se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TpLN cl mn std n cp se = ve TA rec with max n Note, since there is also a case with n=4.1 which results in the same partition (into {se} and {ve,vi}) we might use both for improved accuracy - certainly we can do this with sequential! = 19 1 1 1 1 1 1 1 TsLN cl mn std n cp se ve vi TsWD cl mn std n cp ve vi se TpLN cl mn std n cp se ve vi TpWD cl n std n cp se ve vi se_means se_std se_ve_n se_vi_n se_ve_cp se_vi_cp ve_means ve_std ve_vi_n ve_se_n ve_vi_cp ve_se_cp vi_means vi_std vi_se_n vi_ve_n vi_se_cp vi_ve_cp Remove se from RC (={ve, vi} now) and TA's
36
FAUST_pdq using std's Use the 4 Attribute tables with rv=mean, stds and max_#_stds_in_gap=n, cut value, cp (cp=value in gap which allows max # of stds, n, to fit forward from that mean (using its std) and backward from next mean, meanG, (using stdG). n satisfies mean + n*std = meanG - n*stdG so n=(meanG-mean)/(std+stdG). se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TpWD cl mn std n cp ve vi TA rec with max n 16= P{vi} =PpWD>16 1 1 1 1 1 1 1 Note that we get perfect accuracy with one epoch using stds this way!!! TsLN cl mn std n cp se ve vi TsWD cl mn std n cp ve vi se TpLN cl mn std n cp se ve vi TpWD cl mn std n cp se ve vi
37
FAUST_pdq SUMMARY We conclude that FAUST_pdq will be fast (no loops, one pTree mask per step, may converge with 1 [or just a few] epochs?? and is fairly accurate (completely accurate in this example using the std method!). FAUST_pdq is improved (accuracy-wise) by using standard_deviation-based gap measurements and choosing the maximum number of stds as the attribute relevancy choice. There may be many other such improvements one can think of, e.g., using an outlier identification method (see Dr. Dongmei Ren's thesis) to determine the set of non-outliers in each attribute and class. Within each attribute, order by means and define gaps to be between the maximum non-outlier value in one class and the minimum non-outlier value in the next (allowing these gap measurements to be negative if the max of one exceeds the minimum of the next). Also there are many ways of defining representative values (means, medians, rank-points, ...) In Conclusion, FAUST_pdq is intended to be very fast (if raw speed is the need - as it might be for initial processing of the massive and numerous image datasets that the DoD has to categorize and store). It may be fairly accurate as well, depending upon the dataset, but since it uses only one attribute or feature for each division, it is not likely to be of maximal accuracy compared to other methods (such as the FAUST_pms coming up). Next look at FAUST_pms (pTree-based, m-attribute cut_points, sequential (1 class divided off at a time) so we can explore the various choices for m (from 1 to the table width) and alternate distance measures.
38
K=10 1 1 1 1 1 1 1 1 For i=4..0 { c=rc(Pc&Patt,i);
LO 1 HI 16 1 1 25 1 36 1 44 1 serc= seRK= verc= veRK= virc= viRK= seps= veps= vips= K=10 For i= { c=rc(Pc&Patt,i); if (cps){ rankK+= 2i; Pc=(Pc&Patt,i)} [rank(n-K+1)+=2i;] else { ps=ps-c; Pc=Pc&P'att,i }} 4 25 pWD_vi_LO=16 pWD_se_HI=0, pWD_ve_HI=0. So the highest pWD_se_HI and pWD_ve_HI can get is 15 and lowest pWD_vi_LO will ever be is 16. So cutting 16 will separate all vi from {se,ve}. This is, of course, with reference to the training set only and it may not carry over to the test set (much bigger set?) especially since the gap may be small (=1). Here we will use pWDcutpt16 to peal off vi! We need a theorem proof here!!! 1 16 1 25 36 44 26 25 1 16 1 25 1 36 1 44 26 25 26 24 sLN= sWD= pLN= pWD=4 16 1 25 36 44 15' 16 1 16' 1 15 1 13 1 14 1 12 1 12' 1 13' 1 14' 1 10' 1 11' 1 11 1 10 1 25' 1 25 1 21' 1 20 1 21 1 22' 1 20' 1 22 1 23' 1 23 1 24 1 24' 36 1 35' 35 1 36' 1 30 1 30' 1 34 1 33' 1 33 1 32 1 31 1 31' 1 34' 1 32' 1 41' 44 1 40 1 42 1 40' 1 41 43 1 42' 1 43' 1 44' 1 16 1 25 36 44 1 15' 1 16 1 16' 1 15 1 14 1 12 1 12' 1 13 1 14' 1 13' 1 11 1 10' 1 11' 1 10 1 25' 1 25 1 21 1 21' 1 23 1 24 1 24' 1 22 1 22' 1 20 1 20' 1 23' 1 36' 35' 1 35 36 1 30 1 30' 1 34' 1 32' 1 31' 1 31 1 33' 1 34 1 32 1 33 1 40 44 1 40' 1 41' 1 42 1 44' 1 41 1 43 1 42' 1 43' 1 36 1 16 1 25 1 44 1 16' 1 15' 1 16 1 15 1 13' 1 12 1 14' 1 14 1 12' 1 13 1 11 1 10 1 11' 1 10' 1 25' 1 25 1 21' 1 20 1 24 1 21 1 24' 1 23 1 23' 1 22 1 20' 1 22' 1 36' 1 35' 1 35 1 36 1 30' 1 30 1 33' 1 33 1 34' 1 31' 1 32 1 31 1 32' 1 34 1 44 1 40 1 41' 1 40' 1 42 1 41 1 43 44' 1 43' 1 42' 24 sLN=1 sWD=2 pLN=3 pWD=4
39
For i=4..0 { c=rc(Pc&Patt,i); if (cps){ rankK+= 2i; Pc=(Pc&Patt,i)}
10 LO 1 HI 1 16' 1 15 1 1 25 1 24 1 1 36' 35 1 1 44' 43 1 serc= seRK= verc= veRK= virc= viRK= seps= veps= vips= For i= { c=rc(Pc&Patt,i); if (cps){ rankK+= 2i; Pc=(Pc&Patt,i)} [rank(n-K+1)+=2i;] else { ps=ps-c; Pc=Pc&P'att,i }} 3 25 25 +24 1 16 1 15 1 25 1 24 1 36' 1 35 1 44' 1 43 pLN_ve_LO=32 pLN_se_HI=0. So the highest pLN_se_HI can get is 31 and lowest pLN_ve_LO will ever be is 32. So cutting 32 will separate all ve from se! Greater accuracy can be gained by continuing the process for all i and for all K then looking for the best gaps! (all gaps?) (all gaps weighted?) 26 25 +24 25 23 1 16' 1 15 1 25' 1 24 1 36' 35 1 44' 43 4 15' 1 15 1 12' 1 12 1 14 1 14' 1 13 1 13' 1 11 1 10' 1 10 1 11' 1 21 1 21' 1 20 1 22' 1 22 1 23' 1 24 1 24' 1 23 1 20' 1 35' 35 1 30 1 30' 1 31 1 33' 1 32 1 33 1 34 1 31' 1 32' 1 34' 1 41' 1 40 1 42 1 40' 1 41 43 1 42' 1 43' 25 1 16' 1 15 1 25' 1 24 1 36' 1 35 6 8 1 43 1 44' 1 15' 1 15 1 12' 1 13 1 14' 1 14 1 12 1 13' 1 11 1 11' 1 10' 1 10 1 21' 1 21 1 23 1 22' 1 24' 1 20' 1 24 1 22 1 20 1 23' 35' 1 35 1 30 1 30' 1 34' 1 32 1 33 1 34 1 31' 1 33' 1 32' 1 31 1 40 1 40' 1 41' 1 41 1 42 1 42' 1 43 1 43' 25 25
40
FAUST{pms,std} (FAUST{pms} using # gap std
FAUST{pdq,mrk} (FAUST{pdq} w max rank_k) rank_k(S) is smallest kth largest value in S. FAUST{pdq,gap} divisive, quiet (no noise) with gaps attr, A TA(class, md, k, cp) its attribute table ordered on md asc, where 0. attr, A TA(class, rv, gap) ord on rv asc (rv=cls rep, gap=dis to next rv. k s.t. it's max k value s.t. set_rank_k of class and set_rank_(1-k)' of the next class. (note: the rank_k for k=1/2 is median, k=1 is maximum and k=0 is the min. Same alg can clearly be used as pms FAUST{pms,mrk} WHILE RC not empty, DO 1. Find the TA record with maximum gap: 2. PA>c (c=rv+gap/2) to div RC at c into LT, GT (pTrees, PLT and PGT). 3. If LT or GT singleton {remove class) END_DO FAUST{pdq,std} (FAUST{pdq} using # of gap standard devs) 0. For each attribute, A TA(class, mn, std, n, cp) is its attribute table ordered on n asc, where cp=val in gap allowing max # of stds, n. n satisfies: mean+n*std=meanG-n*stdG so n=(mnG-mn)/(std+stdG) WHILE RC not empty, DO 1. Find the TA record with maximum n: 2. Use PA>cp to divide RC at cp=cutpoint into LT and GT (pTree masks, PLT and PGT). 3. If LT or GT singleton {remove that class from RC and from all TA's} END_DO FAUST{pms,gap} (FAUST{p} m attr cut_pts, seq class separation (1 class at time, m=1 0. For each A, TA(class, rv, gap, avgap), where avgap is avg of gap and previous_gap (if 1st avgap = gap). If x classes. DO x-1 times 1. Find the TA record with maximum avgap: 2. cL=rv-prev_gap/2. cG=rv+gap/2, masks Pclass=PA>cL&PAcG&PRC PRC=P'class&PRC (If 1st in TA (no prev_gap), Pclass=PAcG&PRC. Last, Pclass=PA>cL&PRC. 3. Remove that class from RC and from all TA's END_DO FAUST{pms,std} (FAUST{pms} using # gap std 0. attr, A TA(class, mn, std, n, avgn, cp) ordered avgn asc cp=cut_point (value in gap which allows max # of stds, n, (n satisfies: mn+n*std=mnnext-n*stdnext so n=(mnnext-mn)/(std+stdt) DO x-1 times 1. Find the TA record with maximum avgn: 2. cL=rv-prev_gap/2. cG=rv+gap/2 and pTree masks Pclass=PA>cL& PAcG&PRC PRC =P'class&PRC (If class 1st in TA (has no prev_gap), then Pclass =PAcG&PRC. If last, Pclass =PA>cL&PRC.) 3. Remove that class from RC and from all TA's END_DO
41
Near Neighbor Classifiers and FAUST 2011_04_23
Faust is really a Near Neighbor Classifier (NNC) in which, for each class, we construct a big box neighborhood (bbn) which we think, based on the training points, is most likely to contain that class and least likely to contain the other classes. R aR bR In the current FAUST, each bbn is a coordinate box, i.e., for coordinate (band) R, coordinate_box cb(R,class,aR,bR) is the set of all points, x, such that aR < xR < bR (either of aR or bR can be infinite). Either or both of the < can be . The values, aR and bR are what we have called the cut_points for that class. bbn's are constructed using the training set and applied to the full set of unclassified pixels. The bbn's are always applied sequentially, but can be constructed either sequentially or divisively. In case the construction is sequential, the application sequence is the same as the construction sequence (and the application for each class, follows the construction for that class immediately. i.e., before the next bbn construction): All pixels in the first bbn are classified into that first class (the class of that bbn). All remaining pixels which are in the second bbn are classified into the second class (class of that bbn), etc. Thus, iteratively, all remaining unclassified pixels which are in the next bbn are classified into its class. The reason cn's are applied sequentially is that they intersect. Thus, the first bbn should be the strongest in some sense, then the next strongest, then the next strongest, etc. In each round, from the remaining classes, we construct FAUST cn's by choosing the attribute-class with the maximum gap_between_consecutive_mean_values, or the maximum_number_of_stds_between_consecutive_means or the gap_between_consecutive_means allowing the minimum rank (i.e., the "best remaining gap"). Note that mean can be replaced by median or any representer. R aR bR aG bG G We could take the bbn's to be "multi-coordinate_band" or mcb, of the form, the INTERSECTION of the "best" k (k n-1, assuming n classes ) cb's for a given class (where "best" can be with respect to any of the above maximizations). And instead of using a fixed number of coordinates, k, we could use only those coordinates in which the "quality" of its cb is higher than a threshold, where "quality" might be measured many ways involving the dimensions of the gaps (or other ways?). Many pixels may not get classified (this hypothesis needs testing!). It should be accurate though.
42
Near Neighbor Classifiers and FAUST-2
We note that mcb's are used for vegetation indexing: high green ( aG high and bG = , i.e., all x such that xG > aG ) and low red ( aR = - and bR low, i.e., all x such that xR < bR) is the standard "vegetation index" and measures crop health well. So, if in instead of predicting grass if we were predicting lush grass, we could use vi, which involves mcb bbn's Similarly mcb bbn's would be used for any color object which is not pure (in the bands provided). Therefore a "blue-red" car would ideally involve a bbn that is the intersection of a red cn and a blue cn. Most paint colors are not pure. Worse yet, what does pure mean? Pure only makes sense in the context of the camera taking the image in the first place. The definition of a pure color in a given image is a color entirely within one band (column) of that image dataset (with all other bands showing zero values only). So almost all actual objects would be multi-color objects and would require, or at least benefit from, a multi-cn bbn approach.
43
Appendix Note on problems:
se se se se se se se se se se ve ve ve ve ve ve ve ve ve ve vi vi vi vi vi vi vi vi vi vi 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Appendix Note on problems: Difficult separations problem: e.g., white cars from white roofs. Include as feature attributes, the pixel coordinate value columns as well as the bands. If the color is not sufficiently different to make the distinction (and no other non-visible band makes the distinction either) and if the classess are contiguous objects (as they are in Aroura), then because the white car training points are [likely to be] far from the white roof training points, FAUST may still work well, using x and y pixel coordinates as additional feature attributes (and attributes such as "shape", edge_sharpness, etc., if available). CkNN applied to nbrs taken from the training set, should work also. Noise Class Problem: In pixel classification, there's may be a Default_Class or NOise, N) (Aurora classes are Red_Cars, White_Cars, Black_Cars, ASphalt, White_Roof, GRass, SHadow and in the "Parking Lot Scene" case at least, there does not appear to be a NOise_class - i.e., every pixel is in one of the 7 classes above). So, in some cases, we may have 8 classes {RC, WC, BC, AS, WR, GR, SH, NO}. Picking out NO may be a challenge for any algorithm if it contains pixels that match training pixels from several of the legitimate classes - i.e., if NO is composed of tuples with values similar to other classes (Dr. Wettstein calls this the "red shirt" problem - if a person has a red shirt and is in the field of view, those pixels may be electromagnetically indistinguishable from Red_Car pixels. In that case, no correct algorithm will distinguish them electromagnetically (using only reflectance bands). Such other attributes as x and y position, size and shape (if available) etc. may provide a distinction. Using FAUST{seq}, where we maximize the: 1. size of gap between consecutive means or maximize the number of stds in the gap between consecutive means or minimize the K which produces no overlap (betweeen the rankK set and the rank(n-K+1) set of the next class) in the gap between consecutive classes instead taking as cut_point, the point produced by that maximization, we should back off from that and narrow the interval around that class mean by going only a fraction either way (some parameterized fraction), which would remove many of the NC points from that class prediction. Inconsistent ordering of classes over the various attributes (columns) may be an indicator of something?
44
Attr-Class-Set, ACS(sWD, vi)
An old version of the basic alg. I took the first 40 of setosa, versicolor and virginica and put the other 30 tuples in a class called "noise". 1. Sort ACS's asc by median gap=rankK(this class)-rank(n-K+1)(next class) 2. Do Until ( rankK(ACS) rank(n-K+1)(next higher ACS in same A) | K=n/2 ) 3. Find gap, except Kth 4. K=K-1; END DO; return K for each Att, Class pair. Build ACS tables (gap>0). cut_pt=rankK+S*(gap), S=1. Minimize K. TsLN cl md K rnK gap se no ve vi 64 TsWD cl md K rnK gap ve vi no se TpLN cl md K rnK gap se no ve vi 56 TpWD cl md K rnK gap se no ve vi 20 1st pass produces a tie for min K, in (pLN, vi) and (pWD, vi) (note: in both vi doesn't have higher gap since it's highest). Thus we can take both - either AND the conditions or OR the conditions. If we OR the conditions ( PpLN,vi 48) | (PpWD,vi 16) get perfect classification [and if AND get 5 mistakes]: recompute TsLN cl md K rnK gap se no ve 60 TsWD cl md K rnK gap ve no se TpLN cl md K rnK gap se no ve 44 TpWD cl md K rnK gap se no ve 14 min K in (pWD, vi). PpWD,vi5 get 9 mistakes. TsLN cl md K rnK gap no ve 60 TsWD cl md K rnK gap ve no 30 TpLN cl md K rnK gap no ve 44 TpWD cl md K rnK gap no ve 14 min K in (sLN, no). PpWD,vi51 get 12 mistakes. 49 47 46 54 50 44 30 32 31 36 39 34 29 37 14 13 15 17 2 4 3 1 64 69 55 65 57 63 66 52 23 28 33 24 27 20 45 40 35 10 58 71 76 73 67 72 25 51 59 56 61 19 21 18 22 se ve vi sLN sWD pLN pWD attributes or columns classes Attr-Class-Set, ACS(sWD, vi) FAUST{seq,mrk} VPHD Set of training values in 1 col and 1 class called Attribute-Class-Set, ACS. K(ACS)=|ACS| (all |ACS|=n=10 here). In the alg below, c=root_count and ps=position (there's a separate root_count and position for each ACS and each of K and n-K+1 for that ACS. So c=c( attr, class, K|(n-K+1) ). S=gap enlargement parameter (It can be djusted to try to clip out Noise Class, NC) 1. Sort ACS's asc by median gap = rankK(this class) - rank(n-K+1)(next class) 2. Do Until ( rankK(ACS) rank(n-K+1)(next ACS) | K=0 ) 3. Find rankK and rank(n-K+1) values of each ACS (except 1st an and Kth) 4. K=K-1; END DO; return K for each Attribute, Class pair. 5. Cut_pts placed above/below that class (using values in attr): hi cut_pt=rankK+S*(higher_gap) low cut_pt=rank(n-K+1)S*(lower_gap)
45
1. For every attr and every class, sort the values asc.
44 46 47 49 50 54 20 23 24 27 28 29 31 32 33 13 14 15 17 1 2 3 4 2011_04_09 FAUST{pdq,mrk} algorithm, demonstrated with VPHD, Vertical Processing, Horizontal Data first : 1. For every attr and every class, sort the values asc. 2. Find and order the medians asc in TA tables. 3. Find max k s.t. rank_k_setrank_(1-k)_set =. rank_.7 rank_.7 rank_.8 rank_.9 rank_1 rank_1 rank_1 4. Proceed as in all FAUST algorithms - cut accordingly (pdq or pms or ???). With VPHD, sort each class in each attr, find medians (needed?), find rank_k_sets (combine this with sorting?) ... so O(n). With HPVD, we can avoid the sorting, find rank_k_sets (median is rank_.5), fill TAs entirely with a pTree program O(0). 49 50 52 55 57 63 64 65 66 69 rank_0 25 27 29 30 32 36 33 35 39 40 45 46 47 49 rank_0 10 13 14 15 16 rank_0 rank_.1 rank_.2 rank_.3 rank_.3 rank_.7 rank_.8 rank_.9 rank_.9 49 58 63 65 67 71 72 73 76 29 30 31 32 34 36 37 39 45 51 56 58 59 61 63 66 17 18 19 20 21 22 25 rank_.1 rank_.1 rank_.2 rank_.3 HPVD_mrk could be made optimal since we could record exactly which k and cp gives min error (as we work toward empty rank_k_set intersection) and we could know the error set. We could use CkNN or ? on each errant sample. To see this, go through the first k/cp animation. In that looping procedure it's clear we could determine se<55 with 3 errors to be the best cp (se<54, 6 errors; se<52, 5; se<50, 5; se<49, 6 ). Note: mrk above is lazy. It takes cp to be the average of the rank values - in this case cp=53 which has 6 errors. TsLN cl md k cp se ve vi 66 TsWD cl md k cp ve vi se 33 TpLN cl md k cp se ve vi 58 TpWD cl md k cp se ve vi 20 .7 53 .7 29 1.0 25 1.0 7 .7 64 .8 30 .9 49 .9 16 One can see from this animation that MaxGap is probably a pretty good method most of the time (provided there is at least one good gap each step) and the MaxGapStd is even better (same proviso). This method is intended to be optimal and to deal with, e.g., non-normal distributions.
46
maximum c=0; max=0;Pc=pure1; For i=4..0 { c=rc(Pc&Patt,i) if (c>0)
se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pc = Ppw1 1 1 Ppw3 1 Ppw0 1 Ppw2 1 Ppw4 1 & Pc rc=10 max = 24 + 23 + 20 rc=1 rc=0 rc=1 c=0; max=0;Pc=pure1; For i=4..0 { c=rc(Pc&Patt,i) if (c>0) Pc=Pc&Patt,i max=max+2i } return max; maximum
47
minimum c=0; min=0;Pc=pure1; For i=4..0 { c=rc(Pc&P'att,i) if (c>0)
se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pc = P'pw1 1 1 P'pw3 1 P'pw0 1 P'pw4 1 P'pw2 1 & Pc rc>0 rc=0 min = 20 c=0; min=0;Pc=pure1; For i=4..0 { c=rc(Pc&P'att,i) if (c>0) Pc=Pc&P'att,i else min=min+2i } return min; minimum
48
rank5 (5th largest) c=0; rank5=0; pos= 5; Pc=pure1;
se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pc = 1 P'pw3 1 Ppw4 1 Ppw0 Ppw1 1 1 P'pw2 1 Ppw3 P'pw1 1 1 Ppw2 1 & Pc rc=10 rc=1 rc=1 rc=3 rc=2 rc=4 c=0; rank5=0; pos= 5; Pc=pure1; For i= //current_i = 4 { c=rc(Pc&Patt,i); if (cpos) rankK = rankK + 2i; Pc=Pc&Patt,i ; else pos = pos - c; Pc=Pc&P'att,i ; } } return rankK; 3 4 1 rankK =0 + 24 +22 1 2 3 return rank5 = 20 rank5 (5th largest)
49
rank25 (25th largest) rc=10 rc= 1 rc= 1 rc= 8 rc= 9
se se se se se se se se se se se se se se se se se se se se se se se se se se se se se se 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pc = 1 P'pw4 1 Ppw3 1 P'pw3 Ppw1 1 1 P'pw2 1 Ppw0 1 Ppw2 1 Ppw4 1 & Pc P'pw1 1 rc=10 rc= 1 rc= 1 rc= 8 rc= 9 rankK =0 + 21 c=0;rank25=0; pos=25; Pc=pure1; For i= //current_i = 4 { c=rc(Pc&Patt,i); if (cpos) rankK = rankK + 2i; Pc=Pc&Patt,i ; else pos = pos - c; Pc=Pc&P'att,i ; } } return rankK; 15 5 6 3 2 1 rank25=2 rank25 (25th largest)
50
LO all other HIs or a HI all other LOs :
1 43' s e 43 s e 1 42 s e 1 44' s e 44 s e 1 40 s e 1 41 s e 1 41 s e 1 42 s e 44 s e 1 40 s e 1 44' s e 43 s e 1 43' Check HI and LO values in each class (over each attr., in general) for a LO all other HIs or a HI all other LOs : s e 44 s e 1 44' s e 43 s e 1 43' s e 1 42 s e 1 42' s e 1 41 s e 1 41' s e 1 40 s e 1 40' 1 1 LOvi=17 HIse=4 v e 1 44' v e 1 42 v e 1 40 v e 44 v e 1 43 v e 1 41 v e 1 40 v e 1 44' v e 1 42 v e 1 43 v e 1 43' v e 44 v e 1 41 LOvi=17 HIve=15 v e 44 v e 1 44' v e 1 43 v e 1 43' v e 1 42 v e 1 42' v e 1 41 v e 1 41' v e 1 40 v e 1 40' 1 1 So attr4=pedal_Width cutpoint at 16 separates vi and {se,ve}. Note: This cutpt appears early in loop (i=4). Can a gap be concluded at i=4? v i 1 40 v i 1 41 v i 1 42 v i 1 43 v i 1 44 v i 1 43' v i 1 40 v i 1 44 v i 1 42 v i 1 43 v i 1 41 v i 1 44 v i 44' v i 1 43 v i 1 43' v i 1 42 v i 1 42' v i 1 41 v i 1 41' v i 1 40 v i 1 40' 1 1 Do concurrently over all attributes for each K until 1st gap is found This finds 1st hi or low gap, but there may be none. It could find any gap pair separating 1 class from rest (change the or to and), but there may be none either. Then take best neg gap. Can be divisive. K=1 Pc n-K+1=10 Pc se1rc= ve1rc= vi1rc= se1pos= 1 ve1pos= 1 vi1pos= 1 se10rc= ve10rc= vi10rc= se10pos= 10 ve10pos= 10 vi10pos= 10 9 1 n=10,K= rankK rank(n-K+1) att/cl, exit when class in att w same gap (hi/lo) w all other classes in att. Peal cls Rept. 1 1 8 2 4 7 9 3 1 9 1 1 10 1 5 4 1 10 1 9 5 For i=4..0 { c=rc(Pc&Patt,i); if(cpos){rankK=+=2i; Pc=Pc&Patt,i} [rank(n-K+1)+=2i;] else {pos=pos-c; Pc=Pc&P'att,i} }return 1 2 3 4 HI se1rnk=0 ve1rnk=0 vi1rnk=0 LO se10rnk=0 ve10rnk=0 vi10rnk=0 22 20 23 +22 +21 +20 20 +24 +23 +20 +24 +21
51
In FAUST_pdq the mrk alg should be at least as good as gap and std but should also be better in the following situation: mrk should beat gap (and std?) when taking the midpoint of a negative gap will be wrong because of the difference in distributions (e.g., in one class is a normal and in the other is a power distribution). gap or std cut_pt rank_.9 cut_pt? In FAUST_pdq If there is a gap, mrk will always find it. If there isn't and the distributions are different, it will (should?) perform much better than the other two.
52
D≡ mrmv. 2012_02_04: FAUST Oblique formula: P(Xod)<a X any set of vectors (e.g., a training class) Let d = D/|D|. To separate rs from vs using means_midpoint as the cut-point, calculate a as follows: Viewing mr, mv as vectors ( e.g., mr ≡ originpt_mr ), a = ( mr+(mv-mr)/2 ) o d = (mr+mv)/2 o d What if d points away from the intersection, , of the Cut-hyperplane (Cut-line in this 2-D case) and the d-line (as it does for class=V, where d = (mvmr)/|mvmr| ? Then a is the negative of the distance shown (the angle is obtuse so its cosine is negative). But each vod is a larger negative number than a=(mr+mv)/2od, so we still want vod < ½(mv+mr) o d r r r v v r mr r v v v r r v mv v r v v r v d d a
53
FAUST Oblique vector of stds D≡ mrmv , d=D/|D| PX o d < a
= PdiXi<a To separate r from v: Using the vector of stds cutpoint , calculate a as follows: Viewing mr, mv as vectors, a = ( mr mv ) o d stdr+stdv stdr stdv What are the purple stds? approach-1: for each coordinate (or dimension) calculate the stds of the coordinate values and for the vector of those stds. Let's remind ourselves that the formula given Md's formula, does not require looping through the X-values but requires only one AND program across the pTrees. PX o d < a = PdiXi<a r r r v v r mr r v v v r r v mv v r v v r v d
54
FAUST Oblique D≡ mrmv , d=D/|D|
PXod<a = PdiXi<a FAUST Oblique D≡ mrmv , d=D/|D| Approach 2 To separate r from v: Using the stds of the projections , calculate a as follows: a = pmr (pmv-pmr) = pstdr+pstdv pstdr pmr*pstdr + pmr*pstdv + pmv*pstdr - pmr*pstdr pstdr +pstdv next? pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*2pstdr + pmr*pstdv + pmv*2pstdr - pmr*2pstdr 2pstdr +pstdv In this case the predicted classes will overlap (i.e., a given sample point may be assigned multiple classes) therefore we will have to order the class predictions. r r r v v r mr r v v v r r v mv v r v v r v By pmr, we mean this distance, mrod, which is also mean{rod|rR} r | v | d r | pmr | By pstdr, std{rod|rR} | r | r | r v | pmv | | v | v | v
55
FAUST for Satlog(landsat) NonOblique lev-0 1's 2's 3's 4's 5's 7's
R G ir ir2 means R G irR ir stds NonOblique lev-0 1's 's 's 's 's 's True Positives: Class Totals-> NonOblq lev-1 50% 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using midpoint of means 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using means and stds of projections (w/o class elimination) 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projs (with class elimination in 2,3,4,5,6,7,1 order) Note that no elimination occurs! 1's 's 's 's 's 's True Positives: False Positives:
56
FAUST X any set of vectors. D≡ mrmv , d=D/|D| PX dot d<a
= PdiXi<a To separate r from v: Using the mean and std of the projections cutpoint , and: a = pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*2pstdr + pmr*pstdv + pmv*2pstdr - pmr*2pstdr pstdr +2pstdv Oblique level-0 using means and stds of projections 1's 's 's 's 's 's True Positives: False Positives: Class Totals-> Oblique level-0 using means and stds of projections, doubling pstdr as above 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projs, doubling pstdr, classify, eliminate in 2,3,4,5,7,1 order 1's 's 's 's 's 's True Positives: False Positives: So the number of FPs is drastically reduced and TPs somewhat reduced. Is that better? If we parameterize the 2 (doubling) and adjust to max TPs and min FPs, what is the optimal multiplier parameter value? Next slide shows low-to-high std elimination ordering. r r r v v r mr r v v v r r v mv v r v v r v pmr | d r pmv v
57
FAUST Oblique: X any set of vectors. D≡ mrmv , d=D/|D| PX dot d<a
= PdiXi<a To separate r from v: Using the mean and std of the projections cutpoint , and: a = pmr (pmv-pmr) = pstdv+2pstdr 2pstdr pmr*2pstdr + pmr*pstdv + pmv*2pstdr - pmr*2pstdr pstdr +2pstdv Class Totals-> Oblique level-0 using means and stds of projections 1's 's 's 's 's 's True Positives: False Positives: Oblique level-0 using means and stds of projections, doubling pstdr as above 1's 's 's 's 's 's True Positives: False Positives: Oblique lev-0, means, stds of projs, doubling pstdr, classify, eliminate in 2,3,4,5,7,1 order 1's 's 's 's 's 's True Positives: False Positives: low-to-high std elimination ordering: Oblique lev-0, means, stds of projs, doubling pstdr, classify, eliminate in 3,4,7,5,1,2 order 1's 's 's 's 's 's True Positives: False Positives:
58
FAUST Oblique lev-0 std1/(std1+std2) 1's 2's 3's 4's 5's 7's
True Positives: False Positives: lev0 2std1/(2std1+std2) TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: above=(std+stdup)/gap below=(std+stddn)/gapdn which suggest elim order 425713: above below above below above below above below avg |cls avg | | | | | | red green ir ir2 best to use s1/(s1+s2) total TP s1/(s1+s2) FP TP 2s1/(2s1+s2) FP TP 2s1/(2s1+s2) FP TP 2s1/(2s1+s2) FP TP 2s1/(2s1+s2) FP TP s1/(s1+s2) level-1 50% FP FAUST Oblique
59
FAUST Oblique: D≡ mrmv , d=D/|D|
lev-0 std1/(std1+std2) 1's 's 's 's 's 's True Positives: False Positives: lev0 2std1/(2std1+std2) TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: 2s1/(2s1+s2) elim ord: TP: FP: total TP s1/(s1+s2) FP TP 2s1/(2s1+s2) FP NonOblq lev-1 50% 1's 's 's 's 's 's True Positives: False Positives:
60
But, depending on defintions 3 count=1_thin_intervals,
p x y So in this case there are zero gaps (count=0_thin_intervals) on the fM line, But, depending on defintions 3 count=1_thin_intervals, allowing us to declare the points p12, p16 and p18 as anomalies first round. my VOM (34, 35) (my MEAN=(28,30), using the point values at left.) Round 2, etc. are straight forward in this example. So the two questions are, 1. How to determine count=k_thin_intervals, given a projection line. 2. How to pick a productive projection line from the nearly infinite number of possibilities. (I like to always start with fM in case it reveals the one anomaly of interest right away, but then it gets very difficult, especially in high dimensions.)
61
Thin interval finder on the fM line using the scalar pTreeSet, PTreeSet(xofM) (the pTree slices of these projection lengths) Looking for Width24_Count=1_ThinIntervals or W16_C1_TIs 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c a b c d e f X x1 x2 p p p p p p p p p pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= p5' 1 C=3 C=2 p5 C=8 p4' 1 C=1 p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 W=24_C=1_TI [ , ]=[0,16) We check how close p1ofM is to bdrys, 0, 16 (5, too close), so p1 not declared anomaly. W=24_C=1_TI [ , ] =[32, 48). Dist of p4ofM to a bdry pt is 2. p4 not declared an anomaly. W=24_C=1_TI [ , ] =[48, 64). Dis of p5ofM to p4ofM or to the bdry pt 64 is 11 so p5 is an anomaly and we cut through p5. W=24_C=1_TI [ , ] =[64, 90). Ordinarily we would cut through the interval midpoint, but in this case it is unnecessary since it would duplicate the p5 cut.
62
1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score= Nothing comes to minds as to what we would do here. MapReduce.Hadoop is a key-value approach to organizing complex BigData. In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2. pTree Text Mining: Current_Relevancy_Score =10 Killer_Idea_Score=9 I I think Oblique FAUST is the way to do this. Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9 Killer_Idea_Score=9 No No one has taken up the proof that this is a break through method. The applications are unlimited! 4. Secure pTreeBases: Current_Relevancy_Score =9 Killer_Idea_Score=10 This seems straight forward and a certainty (to be a killer advance)! It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs. Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9 Killer_Idea_Score= No one done a complete analysis of this is a break through method. The applications are unlimited here too! 6. pTree Algorithmic Tools: Current_Relevancy_Score =10 Killer_Idea_Score= This is Md’s work. Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful. It helps us all! 7. pTree Alternative Algorithm Impl: Current_Relevancy_Score =9 Killer_Idea_Score= This is Bryan’s work. Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8. pTree O/S Infrastructure: Current_Relevancy_Score =10 Killer_Idea_Score= This is Matt’s work. I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,… I give it a 10/10 because I know the people – they do double digit work always! From: Sent: Thurs, Aug Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach – these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time. I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going. I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday.
63
pTree Text Mining data Cube layout: tePt=again tePt=all tePt=a
lev2, pred=pure1 on tfP1 -stide 1 hdfP t=a t=again t=all lev-2 (len=VocabLen) 8 1 3 df count <--dfP3 <--dfP0 t=a t=again t=all . . . tfP0 1 tfP1 lev1tfPk eg pred tfP0: mod(sum(mdl-stride),2)=1 2 term=a t=a t=a doc=1 d=2 d=3 t=again t=again t=again d= d= d=3 tf t=all t=all t=all ... d=1 d= d=3 tePt=again t=a d=1 t=a d=2 t=a d=3 1 tePt=a d= d= d=3 t=again t=again t=again tePt=all t=all d=1 t=all d=2 t=all d=3 lev1 (len=DocCt*VocabLen) lev0 corpusP (len=MaxDocLen*DocCt*VocabLen) t=a d=1 t=a d=2 t=a d=3 t=again d=1 1 Math book mask Libry Congress masks (document categories move us up document semantic hierarchy ptf: positional term frequency The frequency of each term in each position across all documents (Is this any good?). 2 d=1 Preface 1 d=1 commas d=1 References Reading position masks (pos categories) move us up position semantic hierarchy (and allows puncutation etc., placement.) 1 te ... tf2 1 ... tf1 1 ... tf0 3 2 tf are April apple and an always. all again a Vocab Terms 1 3 2 df . . . . . . 1 JSE HHS LMM documnet Corpus pTreeSet data Cube layout: 1 2 3 4 5 6 7 Position
64
11 docs of the 15 (11 survivors of the content word reduction).
In this slide section, the vocabulary is reduce to content words (8 of them). mdl=5, vocab={baby,cry,dad,eat,man,mother,pig,shower}, VocabLen=8 and there are 11 docs of the 15 (11 survivors of the content word reduction). First Content Word Mask, FCWM Level-1 (rolled vocab of level-0) d= d= d= d= d= d= d= d= d= d= d= doc=73 doc=71 doc=54 Level-1 (roll up position of level-0) doc=53 doc=46 doc=29 te doc=27 tf Level-2 (roll up document of level-1) doc=09 df1 1 tf doc=08 df0 1 doc=05 tf df 2 3 VOCAB baby cry dad eat man mother pig shower doc=04 Level-0 POSITION
65
Level-0 (ordered by position, document, then vocab)
baby 04LMM term doc tf tf1 tf0 te 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM 09HBD 27CBC 46TTP dad 04LMM 27CBC 29LFW eat 04LMM 08JSC man 04LMM 05HDS 53NAP mother04LMM pig 04LMM 46TTP 54BOF shower04LMM 71MWA 73SSW df 2 3 df1 1 df0 1 5 reading positions for doc=04LMM (Little Miss Muffet) baby cry dad eat man mother pig shower 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 1 baby 1 cry 1 dad 1 eat 1 man 1 mother 1 pig 1 shower Level-2 (roll up doc) Level-1 (roll up pos) Level-0 (ordered by position, document, then vocab)
66
Applying the algorithm to C4:
FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.