Download presentation
Presentation is loading. Please wait.
Published byEsmond Gray Modified over 9 years ago
Revisiting FAUST for One-class Classification (one class C): I. Let x be an unclassified sample. Let D x be the vector from VoM C to x. Use UDR to construct the count distributions of D x o C (down to 2 k intervals for some small k). Use as a cut point, the point where the D x o C count drops below a threshold (e.g., 0) starting from the VOM C side (or use as the cutpoint, the last Precipitous Count Decrease?) Classify x according to where D x o x falls wrt that cut point. There may be a trick or short cut that would speed this up markedly! Also we might classify x "not in C" if it is gapped away from C in the D x -distribution? e.g., 2 3 7 6 4 8 0 0 D x o x 0 0 0 4 5 7 8 9 To classify a large batch, X, this may be slow, since we'd start over with a new D x for each x X. If we have X={x 1..x n } unclassified sample set, deriving the SPTSs, D x k oC may be doable as a batch (one loop through C)? II. (I is lazy. II is model-based Approximates the entire boundary of C first and then use it as our model-based classifier for X. C might be approximated as the set of points "inside" the intersection of half-spaces each of which is the C side of a hyper- plane, i.e., get a series of (d,a) pairs, each of which defines a half-space as, e.g., {z | d o z>a} (simplest? d k =e k ) Next AND mask X o d>a, giving X points which get classified into C (> above will be a, giving X points which get classified into C (> above will be < for some of the half-spaces.). The hard question remaining here is how to determine the series of (d.a) pairs??? 1. Choose the next d to be perpendicular to all previous (e.g., use as the series e 1, e 2,...e n ) 2. User the diagonals, e's, mean-to-median, mean-to-furthest,... 3. Start with {e i }. Add a finer and finer grid of unit vectors until diameter of the C-approximation is close to the diameter of C III. For very high value and durable training sets (e.g., 10 years of normal communications known not to be associated with a terrorist plot - because there was no terrorist activity over those 10 years), we might want to try to build a better model than II by analyzing the corners more carefully Let C 1 be the 1st approx by circumscribing hyperplanes ( k, lower bd hypl, L k ={x|x k =minX k } and higher bd H k ={x|x k =maxX k } Classify x into C iff x is in C 1 (minX k x k maxX k. Can replace minX k by lowest large count change and maxX k by highest or use some other outlier elimination process). Does C fill the corners of C 1 ? (for high dimensions, corners can be huge and C can have very different shape in each corner!) Could try to cap each corner with a round cap (r 2, r 4,...). diagonal and sub-diagonal, cap to it: D's; D 12 =e 1 +e 2 (note Y o D 12 =Y 1 +Y 2 ), D 123 =e 1 +e 2 +e 3 (Y o D 123 =Y 1 +Y 2 +Y 3 ) etc. This general method (of enclosing classes with piecewise linear boundaries to sums of dimensional unit vectors and their negatives, may be a good model based method for multiclass classification as well??!!
THESES Mohammad and Arjun are approaching the deadline for finishing and since they are working in related areas, I thought I would try to lay out my understanding of what your theses will be (to start the discussion). Mohammad’s thesis might be titled something like, “Horizontal Operators and Operations for MiningVertical Data” and could detail and compare the performance of all the SPTS operators we use and the various implementation methods. Keep in mind, that “best” may vary depending upon lots of things, such as the type of data, the type of data mining, the size of the data, the complexity of the data, etc. Mohammad’s thesis might be titled something like, “Horizontal Operators and Operations for MiningVertical Data” and could detail and compare the performance of all the SPTS operators we use and the various implementation methods. Keep in mind, that “best” may vary depending upon lots of things, such as the type of data, the type of data mining, the size of the data, the complexity of the data, etc. Even though I often recommend paper type thesis, It seems that non-paper theses are more valuable (witness how many times I refer to Yue Cui’s) Arjun’s thesis could be titled “Performance Evaluation of the FAUST Methodology for Classification, Prediction and Clustering” and will compare the performance of all data mining methods in the FAUST genre (to the others in the FAUST genre and at least roughly to the other main methods out there). The point should be made up front that for big vertical data, there aren’t many implementations and the issue is speed because applying traditional methods (to the corresponding horizontal version of the data) takes much too long. The comparison to traditional horizontal data methods can be explained to be limited to showing that pTree methods compare favorably to those others on accuracy, and with respect to speed, the comparison can be a rough Big-O comparison (and might also bring in the things Dr. Wettstein pointed out to us (see the 1_4_14 notes). Of course give reference if you do. The structure chart for FAUST might be: ________________________FAUST_________________________________________________ ________________________FAUST_________________________________________________ / \ \ / \ \ _____Classification Method (Cut point goes where?) ______Clustering Method_____ ARM? _____Classification Method (Cut point goes where?) ______Clustering Method_____ ARM? / | \ \ / \ / | \ \ / \ Midpt of Means Midpt of VoMs STD ratio of Means STD ratio of Medians sequence of D-lines? Cut pt goes where? / | \ / | \ / | \ / | \ Mean-VoM Cycle_diags Mean-furthest gap count_change others Mean-VoM Cycle_diags Mean-furthest gap count_change others (we did some others that I can’t recall at the moment) (we did some others that I can’t recall at the moment) Then any of these modules might call any or all Mohammad’s SPTS procedures and some of my stuff as well as Dr. Wettstein’s procedures Then any of these modules might call any or all Mohammad’s SPTS procedures and some of my stuff as well as Dr. Wettstein’s procedures These procedures include: Dot product add/subtr/mult/mult_by_constant SPTSs ….. My thinking was that you would performance analyze the structure chart stuff above and Mohammad would detail his 2’s comp stuff and then performance analyze it (and various implementations of his stuff) as well as the other lower level procedural stuff. Both of you would consider the various dataset types and sizes and both would quote the results of the other probably Both of you would consider the various dataset types and sizes and both would quote the results of the other probably
Here's the kind of thing that Md's thesis will detail (essentially on SPTS Operations) Computing Squared Euclidean Distance, SED, from a point, p: (Y o p) 2, Y is a set and p is a fixed pt in n-space Y o p = i=1..n (y i p i ) ED(y,p) = SQRT( i=1..n (y i – p i ) 2 ) SED(y,p) = i=1..n (y i – p i ) 2 = i=1..n (Y i -p i )(Y i -p i ) = i=1..n (Y i Y i – 2p i Y i + p i 2 ) = i=1..n Y i Y i – 2 i=1..n p i Y + p o p Md: I can calculate (Y i -p i ) using 2's complement and then multiply (Y i -p i ) with (Y i -p i ) to get the (Y i -p i ) 2, then add them for i=1..n which will give me SED (Squared Euclidian Distance). But if we break up: i=1..n (Y i -p i ) 2 = i=1..n (Y i 2 - 2Y i p i + p i 2 ) = i=1..n Y i 2 - 2 i=1..n Y i p i + i=1..n p i 2 I think we need more multiplication than addition which is an expensive operation. I have a little example comparing these two methods.
Improved Oblique FAUST Cuts are made at count changes, not just at gaps. Count changes reveal the entry or exit of a cluster by the perpendicular hyper-plane. This improves Oblique FAUST's ability to cluster big data (compared to cutting only at gaps.). We tried Improved Oblique FAUST on the Spaeth dataset successfully (produces a full dendogram of sub-clusterings by recursively taking the dot product with the vector from the Mean to the VOM (Vector-Of-Medians) and by cutting at each 25% count change in the interval count distribution produced by the UDR procedure with interval widths of 2 3. We claim that an appropriate count change will reveal cluster boundaries almost always. i.e., almost always a precipitous count decrease will occur as the cut hyper-plane enters a cluster and a precipitous count increase will occur as the cut hyper-plane exits a cluster. We also claim that Improved Oblique FAUST will scale up for big data, because entering and leaving clusters "smoothly" (without noticeable count change) is no more likely for big data than for small. (since it's a measure=0 phenomenon). For the count changes to reveal themselves, it may be necessary in some data settings to look for a change pattern over a distribution window because entering a round cluster may not produce a large abrupt change in counts but may produce a noticeable change pattern over a window of counts. It may be sufficient for this purpose to just use a naive windowing in which we stop the UDR count distribution generation process at intervals of width=2 k for some small value of k and look for consecutive count changes in that rough count distribution. This approach appears to be effective and is fast. We built the distribution down to intervals of width 2 3 =8 for the Spaeth dataset, which has diameter=114. So, for Spaeth we stopped UDR at interval widths equal to 7% of the overall diameter (8/114=.07). Outliers, especially exterior outliers, can produce a bad diameter estimate. To get a good cluster diameter estimate, we should identify and mask off exterior outliers first (before applying the Pythagorean diameter estimation formula). Cluster outliers can be identified as singleton sub-clusters that are sufficiently gapped away from the rest of the cluster. Note that pure outlier or anomaly detection procedure need not use the Improved Oblique FAUST method since outliers are always surrounded by gaps and they do not produce big count changes. Points furthest from [or just far from] the VOM are high probability candidates for exterior outliers. These can be identified and then checked for outliers by creating SPTS, (Y o VOM) 2 and use just the high end of the UDR to mask those candidates. Of course points that project at the extremes of any dot product projection set are outlier candidates too.
FAUST Technology for Clustering and Classification is built for speed improvements so that big data can be mined in human time. Improved Oblique FAUST places cuts at all large Count Changes, each of which reveals a cluster boundary almost always (i.e., almost always a large count decrease occurs iff we are exiting a cluster on the cut hyper-plane and a large count increase occurs iff we are entering a cluster. IO FAUST makes a cut at each large count change in the y o d values (A gap is a large decr followed by a large incr, so gaps are included) IO FAUST is Divisive Hierarchical Clustering which builds a cluster dendogram. IO FAUST will scale up, because entering and leaving a cluster "smoothly" (w/o noticeable count change) is no more likely for large datasets than for small. (It's a measure=0 phenomenon). Do we need BARREL FAUST at all now? A radius estimate for a set, Y, is SQRT( (width(Y o d)/2) 2 + (max d-barrel radius) 2 ), assuming all outer edge outliers have been removed Density Uniformity (DU) of a sub-cluster might be defined as the reciprocal of the variance of the counts. A cluster dendogram should have a Density=count/volume label and a Density Uniformity=reciprocal_of_count_variance label on each edge. We can end a dendogram branch as soon as Density and Density Uniformity are high enough (> thresholds, DT and DUT) to save time. We can [quickly] estimate Density as count/c n r n. We have the count a radius estimate and n. c n is a known constant (e.g., c 1 = , c 2 =4 /3...).. In advance, we decide on a density threshold, DET, and a Density Uniformity Threshold DUT. To choose the "best" clustering, we proceed depth first until the DET and DUT thresholds are met. Oblique FAUST Code Layering? A layer (or object or black box or procedure) in the code called the CUTTER: INPUTS: I.1. SPTS I.2.method: Cut_at? I.2.a. p%_CountChange), I.2.b. non-uniform thresholds? I.2.c. centers of gaps only I.3. Return sub-cluster masks (Y/N), since it is an expensive step and therefore we wouldn't want to do it unless the count was needed.. OUTPUTS: O.1. A pointer to a mask pTree for each new "sub-cluster" (i.e., identifying each set of points separated by consecutive cuts). O.2. The 1-count of each of those mask pTrees GRAMMER: INPUTS: I.1. An existing Labeled Dendogram (labeled with e.g., the unit vector that produced it, the density of each edge sub- cluster...) including the tree of pointers to a mask pTrees for each node (incl. the root, which need not be all of the original set) I.2 The new threshold levels (if, e.g., the density threshold is lower than that of the existing, GRAMMER prunes the dendogram OUTPUTS: O.1. The new labeled Dendogram TREEMINER UPDATE Mark has a Hadoop-MapReduce verison going with Oblique FAUST to do classification and 1-class classification. He uses a Smart Distributed File System which turns tables on their side so columns (SPTSs and therefore bit slices) are Map Reduce rows. Then each node has access to a section of rows. So each node gets a section of the original column set. Those columns are also cut into sections. WHAT IS NEEDED: 1. An Auto K Clusterer, when there is no preconceived idea as to how many clusters there should be. Improved Oblique FAUST should help. 2. A New Cluster Finder (e.g., for finding anomalies). Improved Oblique FAUST should help. Need to track clusters over time (e.g., in a corpus of documents with new ones coming in). If a new batch of rows are added (e.g., documents), and if IO FAUST has already established a cluster dendogram from a tree of dot product vectors and density settings, etc., we just apply those to the new batch. We establish the new dendogram (or just the new version of the single cluster being watched) with: a. Establish a new set of count changes based on count changes in the new batch and those in the original (count changes in the new batch that are significant enough to be count changes of the composite and, rarely, count decreases of the batch that coincide with count increases of the original an vice versa (However, I don't think this incremental method will work for us!) b, Redo UDR from scratch on the composite distribution 3. A real-time Cluster Analyzer (If I change this parameter, how does this cluster change?) The user should be able to isolate a cluster, use sliders to tune weightings (e.g., rotate the D-line) and to change density and DU levels.
Choosing a clustering from a DEL and DUL labeled Dendogram A B C D E F G The algorithm for choosing the optimal clustering from a labeled dendogram is as follows: Let DET=.4 and DUT=½ DEL=.1DUL=1/6DEL=.2DUL=1/8 DEL=.4DUL=1 DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL= DUL= DEL=.5 DUL=½ DEL=.3 Since a full dendogram is far bigger than the original table, we set threshold(s), We build a partial dendogram (ending a branch when threshold(s) are met) Then a slider for density would work as follows: The user set the threshold(s). We give the clustering. The user increases threshold(s). We prune the dendogram and give clustering. The user decreases threshold(s). We build each branches down further until the new threshold(s) are exceeded and give the new clustering. We might want to also display the dendogram to the user and let him select a "root" for further analysis, etc.
1 3 1 0 2 0 6 2 1234 5 67 8 9 a bc def 1 y1y2 y7 2 y3 y5 y8 3 y4 y6 y9 4 ya 5 6 7 8 yf 9 yb a yc b yd ye c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f MA cut at 7 and 11 APPLYING CC FAUST TO SPAETH Density Count/ r 2 labeled dendogram for LCC FAUST on Spaeth with D=AvgMedian DET=.3 Y (.15) {y1,y2,y3,y4,y5} (.37) {y6,yf} (.08) {y7,y8,y9,ya,} (.07) {y7,y8,y9,ya} (.39) {yb,yc,yd,ye} (1.01) {y6} ( ) {yf} ( ) D=AM DET=.5 {y1,y2,y3,y4} (.63) {y5} ( ) {y7,y8,y9} (1.27) {ya} ( ) D=AM DET=1 {y1,y2,y3} (2.54) {y4} ( ) D Count/ r 2 labeled dendogram for LCC FAUST on Spaeth w D=cylces thru diagonals nnxx,nxxn,nnxx,nxxn..., DET=.3 Y (.15) {y1,y2,y3,y4,y5} (.37) {y6,y7,y8,y9,ya,,yf} (.09) {y6,y7,y8,y9,ya} (.17) {yb,yc,yd,ye,yf} (.25) {yf} ( ) {yb,yc,yd,ye} (1.01) {y7,y8,y9,ya} (.39) {y6} ( ) D-line labeled dendogram for LCC FAUST on Spaeth w D=furthestAvg, DET=.3 Y (.15) Y (.15) y1,2,3,4,5 (.37 {y6,yf} (.08) {y7,y8,y9,ya,} (.07) {y7,8,9,a} (.39) {yb,yc,yd,ye} (1.01) {y6} ( ) {yf} ( ) 1 y1y2 y7 2 y3 y5 y8 3 y4 y6 y9 4 ya 5 6 7 8 yf 9 yb a yc b yd ye 0 1 2 3 4 5 6 7 8 9 a b c d e f
p6' 1 0 5/64 [0,64) p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6' 1 0 p6 0 1 10/64 [64,128) p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 p6 0 1 Y y1 y2 y1 1 1 y2 3 1 y3 2 2 y4 3 3 y5 6 2 y6 9 3 y7 15 1 y8 14 2 y9 15 3 ya 13 4 pb 10 9 yc 11 10 yd 9 11 ye 11 11 yf 7 8 yofM 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6 0 1 p5 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 p3 1 0 1 0 1 0 p2 0 1 0 1 0 1 0 1 0 1 0 1 0 p1 1 0 1 0 1 0 1 p0 1 0 1 0 1 0 1 p6' 1 0 p5' 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 p2' 1 0 1 0 1 0 1 0 1 0 1 0 1 p1' 0 1 0 1 0 1 0 p0' 0 1 0 1 0 1 0 p3' 0 1 0 1 0 1 0[0,8) p3 1 0 1 0 1 0 1[8,16) p3' 0 1 0 1 0 1 1[16,24) p3 1 0 1 0 1 0 1[24,32) p3' 0 1 0 1 0 1 1[32,40) p3 1 0 1 0 1 0 0[40,48) p3' 0 1 0 1 0 1 1[48,56) p3 1 0 1 0 1 0 0[56,64) p3' 0 1 0 1 0 1 p3 1 0 1 0 1 0 p3' 0 1 0 1 0 1 2[80,88) p3 1 0 1 0 1 0 0[88,96) p3' 0 1 0 1 0 1 0[96,104) p3 1 0 1 0 1 0 2[194,112) p3' 0 1 0 1 0 1 3[112,120) p3 1 0 1 0 1 0 3[120,128) p4' 1 0 1 0 1 0 1 0 1/16[0,16) p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 2/16[16,32) p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 1[32,48) p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 1[48,64) p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 0[64,80) p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 2[80,96) p4 0 1 0 1 0 1 0 1 p4' 1 0 1 0 1 0 1 0 2[96,112) p4' 1 0 1 0 1 0 1 0 p4 0 1 0 1 0 1 0 1 6[112,128) p4 0 1 0 1 0 1 0 1 p5' 1 0 1 0 1 3/32[0,32) p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 2/32[64,96) p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5' 1 0 1 0 1 p5 0 1 0 1 0 2/32[32,64) p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 ¼[96,128) p5 0 1 0 1 0 p5 0 1 0 1 0 p5 0 1 0 1 0 f= UDR Univariate Distribution Revealer (on Spaeth:) Pre-compute and enter into the ToC, all DT(Y k ) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector. Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in fact are often all we need to know about the pTree to get the answers we are after.). 0 1 1 1 1 0 1 0 0 0 2 0 0 2 3 3 0 1 1 1 1 0 1 0 0 0 2 0 0 2 3 3 1 2 1 1 0 2 2 6 3 2 2 8 5 10 depthDT(S) b≡BitWidth(S) h=depth of a node k=node offset Node h,k has a ptr to pTree{x S | F(x) [k2 b-h+1, (k+1)2 b-h+1 )} and its 1count applied to S, a column of numbers in bistlice format (an SpTS), will produce the DistributionTree of S DT(S) 15depth=h=0 depth=h=1 node 2,3 [96.128)
Similar presentations
© 2025 Inc.
All rights reserved.