Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets.

Similar presentations


Presentation on theme: "FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets."— Presentation transcript:

1 FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets (SPTS) of the valueTrees, X o D   k=1..n X k D k FC (FAUST Count Change clusterer) Choose Density(DT), DensityUniformity(DUT) and PrecipitousCountChange(PCCT) thresholds. If DT (and DUT) are not exceeded at a cluster C, partition C by cutting at each gap and/or PCC in C o D using nextD. FCG cuts in the middle of C o D gaps (only) (This is the old version. It might be faster, but it usually chokes on big data.) FCP cuts at PCCs (gap are PCC-cuts, of course). Outlier Mining: Find the top k objects dissimilarity from the rest of the objects. This might mean: 1.a Find {x h | h=1..k} such that x h maximizes distance(x h, X-{x j | j  h}) 1.b Find the top set of k objects, S k, that maximizes distance(X-S k.S k ) 2. Given a Training Set, X, identify outliers in each class (correctly classified but noticeably dissimilar from classmates) or Fuzzy cluster X, i.e., assign a weight for each (object, cluster) pair. Then x isa outlier iff w(x,k) < OutlierThreshold  k 3. Examine individual new samples for outlierhood, assuming they come in after normalcy has been established by 1 o 2. FDO (FAUST Distance-based Outlier Miner) uses D 2 NN = SquareDistance(x, X-{x}) = rankN(x-X) o (x-X) D 2 NN provides an instantaneous k-slider for 1.a. (useful for the others too. Instantaneous? UDR on D 2 NN takes log 2 n time (and is a 1-time calculation), then a k-slider works instantaneously off that distribution - there is no need to sort D 2 NN) NextD is a sequence of D's, used when recursively partitioning X into a Clusters (constructing a Cluster Dendogram for X) e.g. a. recursively, take the diagonal maximizing Standard Deviation (STD(C o D)) [or maximizing STD(C o D)/Spread(C o D).] b. recursively, take the AM(C o D)Avg-to-Median; AFFA(C o D)Avg-FurthestFromAvg; FFAFFFFA(C o D)FFA-FurthFromFFA c. recursively cycle thru diagonals: e 1,...,..e n, e 1  e 2.. or cycle thru AM, AFFA, FFAFFFFA or cycle through both sets FP (FAUST Polygon for k-class classification, k=1..  X n+1 = ClassLabel.  D  Dset, l D,k  mnC k o D (1 st PCI?); h D,k =h D,k  mxC k o D (last PCD?) y is declared to be class=k iff y  Hull k where Hull k ={z| l D,k  D o z  h D,k all D}. (If y is in multiple hulls, H i 1..H i h, y isa C k for the k maximizing OneCount{P C k &P H i..&P H i h } or fuzzy classify using those OneCounts as k-weights) FCO uses FC as an outlier miner. It identifies and removes large clusters using FCP, so outliers reveal themselves. Dset is a set of D s used to build a model for fast classification (1-class or k-class) by circumscribing each class with a hull. The larger the Dset the better (for accuracy).  D, there is, however, the 1-time construction cost of L D,k and H D,k below. Dset should include DA i,j  Avg(C i )  Avg(C j )  i>j=1..k [and also the Median connectors?]. Should Dset include all D  nextD? (Note: The old version used Dset  {DA i,j | i>j=1..k} only.)

2 X o D =  k=1..n X k *D k  k=1..n ( = 2 2B + 2 2B-1 D k,B p k,B-1 + D k,B-1 p k,B + D k,B-1 p k,B + 2 2B-2 D k,B p k,B-2 + D k,B-1 p k,B-1 + D k,B-1 p k,B-1 + D k,B-2 p k,B + 2 2B-3 D k,B p k,B-3 + D k,B-1 p k,B-2 + D k,B-1 p k,B-2 + D k,B-2 p k,B-1 +D k,B-3 p k,B + 2 3 D k,B p k,0 + D k,2 p k,1 + D k,2 p k,1 + D k,1 p k,2 +D k,0 p k,3 + 2 2 D k,2 p k,0 + D k,1 p k,1 + D k,1 p k,1 + D k,0 p k,2 + 2 1 D k,1 p k,0 + D k,0 p k,1 + D k,0 p k,1 + 2 0 D k,0 p k,0 D k,B p k,B  k=1..n (......  k=1..n ( X o D=  k=1,2 X k *D k with pTrees: q N..q 0, N=2 2B+roof(log 2 n)+2B+1 N=2 2B+roof(log 2 n)+2B+1  k=1..2 ( = 2 2 + 2 1 D k,1 p k,0 + D k,0 p k,1 + D k,0 p k,1 + 2 0 D k,0 p k,0 D k,1 p k,1  k=1..2 ( 132101 XpTrees011110101000 1 2 D D 1,1 D 1,0 0 1 D 2,1 D 2,0 1 0 B=1 ( = 2 2 + 2 1 + 2 1 D 1,1 p 1,0 + D 1,0 p 11 + D 1,0 p 11 + 2 0 D 1,0 p 1,0 D 1,1 p 1,1 (( + D 2,1 p 2,1 ) + D 2,1 p 2,0 + D 2,0 p 2,1 ) + D 2,0 p 2,1 ) + D 2,0 p 2,0 ) ( = 2 2 + 2 1 + 2 1 D 1,1 p 1,0 + D 1,0 p 11 + D 1,0 p 11 + 2 0 D 1,0 p 1,0 D 1,1 p 1,1 (( + D 2,1 p 2,1 ) + D 2,1 p 2,0 + D 2,0 p 2,1 ) + D 2,0 p 2,1 ) + D 2,0 p 2,0 ) 000 011101110 q 0 = p 1,0 = no carry 110 q 1 = carry 1 = q 1 = carry 1 =110 001 q 2 =carry 1 = no carry 001 3 3 D D 1,1 D 1,0 1 1 D 2,1 D 2,0 1 1 q 0 = carry 0 = 011100 ( = 2 2 + 2 1 + 2 1 1 p 1,0 + 1 p 11 + 1 p 11 + 2 0 1 p 1,0 1 p 1,1 (( + 1 p 2,1 ) + 1 p 2,0 + 1 p 2,1 ) + 1 p 2,1 ) + 1 p 2,0 ) 011 110 011 110101 000 101 q 1 =carry 0 +raw 1 = carry 1 = 111 211 A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?) 1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added) 2. carryTree as a SPTS, s 1 ? (next level rawTree=SPTS, s 2, then s 10 & s 20 = q next_level and carry next_level ? q 2 =carry 1 +raw 2 = carry 2 = 111 111 q 3 =carry 2 = carry 3 =  q 3 =carry 2 = carry 3 =  111 FC Clusterer If DT (and/or DUT) are not exceeded at C, partition C further by cutting at each gap and PCC in C o D For a table X(X 1...X n ), the SPTS, X k *D k is the column of numbers, x k *D k. X o D is the sum of those SPTSs,  k=1..n X k *D k X k *D k = D k  b 2 b p k,b = 2 B D k p k,B +..+ 2 0 D k p k,0 = D k (2 B p k,B +..+2 0 p k,0 ) = (2 B p k,B +..+2 0 p k,0 ) (2 B D k,B +..+2 0 D k,0 ) + 2 2B-1 (D k,B-1 p k,B +..+2 0 D k,0 p k,0 = 2 2B ( D k,B p k,B ) +D k,B p k,B-1 ) So, DotProduct involves just multi-operand pTree addition. (no SPTSs and no multiplications) Engineering shortcut tricka would be huge!!!

3 FO Table, X(X 1...X n ) D 2 NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}). We install in D 2 NN, each min[D 2 NN(x)] (It's a one-time construction but for a trillion x s it's slow. Parallelization?) D 2 NN(x)=  k=1..n (x k -X k )(x k -X k )=  k=1..n (  b=B..0 2 b x k,b -2 b p k,b )( (  b=B..0 2 b x k,b -2 b p k,b ) =  k=1..n (  b=B..0 2 b (x k,b -p k,b ) ) ( ----a k,b --- ----a k,b ---  b=B..0 2 b (x k,b -p k,b ) ) (2 B a k,B + 2 B-1 a k,B-1 +..+ 2 1 a k, 1 + 2 0 a k, 0 ) (2 B a k,B + 2 B-1 a k,B-1 +..+ 2 1 a k, 1 + 2 0 a k, 0 ) =k=k=k=k ( 2 2B a k,B a k,B + 2 2B-1 ( a k,B a k,B-1 + a k,B-1 a k,B ) + { which is 2 2B a k,B a k,B-1 } 2 2B-2 ( a k,B a k,B-2 + a k,B-1 a k,B-1 + a k,B-2 a k,B ) + { which is 2 2B-1 a k,B a k,B-2 + 2 2B-2 a k,B-1 2 2 2B-3 ( a k,B a k,B-3 + a k,B-1 a k,B-2 + a k,B-2 a k,B-1 + a k,B-3 a k,B ) + { 2 2B-2 ( a k,B a k,B-3 + a k,B-1 a k,B-2 ) } 2 2B-4 (a k,B a k,B-4 +a k,B-1 a k,B-3 +a k,B-2 a k,B-2 +a k,B-3 a k,B-1 +a k,B-4 a k,B )... {2 2B-3 ( a k,B a k,B-4 +a k,B-1 a k,B-3 )+2 2B-4 a k,B-2 2 } =2 2B ( a k,B 2 + a k,B a k,B-1 ) + 2 2B-1 ( a k,B a k,B-2 ) + 2 2B-2 ( a k,B-1 2 2 2B-3 ( a k,B a k,B-4 +a k,B-1 a k,B-3 ) 2 2B-4 a k,B-2 2... + a k,B a k,B-3 + a k,B-1 a k,B-2 ) Does D2NN involve just multi-operand pTree addition? (or SPTSs, multiplication) Notes: When x k,b =1, a k,b =p' k,b and when x k,b =0, a k,b = -p k.b So D2NN has just multi- op pTree multiplications/additions/subtractions! Of course, each entry in D2NN (each x  X) is a separate [parallelizable] calculation. Should we pre-compute all p k,i *p k,j p' k,i *p' k,j p k,i *p' k,j Is subtraction just a matter of flipping sign bit and adding, Md? U.S. Library of Congress is archiving all tweets sent since 2006. USLOCTweetTable may have 1 million trillion rows and 50 columns. Volume 172 billion tweets in 2013 alone (~300 each from 500 million tweeters). Currently > 20 million tweets/hour, 24 hours/day, seven days/week. a tweet is 140 characters. There are 50 fields (Who wrote it. Where. When To Whom...) Enron Dataset Volume 16GB. 1,000,000 rows. 100,000 columns (terms) Drone data? Maybe just RGB (3 columns) and trillions of rows (one for each pixel each hour for 10 years. Each pixel is GPS located (would be sort by location then before pTree-izing?

4 pTree Rank(K) computation: (Rank(n-1) gives 2 nd smallest which is very useful in outlier analysis?) X P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {0} {1} {0} {1} (n=3) c=Count(P&P 4,3 )= 3 < 6 p=6–3=3; P=P&P’ 4,3 masks off highest 3 (val  8) (n=2) c=Count(P&P 4,2 )= 3 >= 3 P=P&P 4,2 masks off lowest 1 (val  4) (n=1) c=Count(P&P 4,1 )=2 < 3 p=3-2=1; P=P&P' 4,1 masks off highest 2 (val  8-2=6 ) (n=0) c=Count(P&P 4,0 )=1 >= 1 P=P&P 4,0 10 5 6 7 11 9 3 {0}{1}{0}{1} RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&P i ); If (c>=p) {RankVal=RankVal+2 i ; P=P&P i }; else {p=p-c; P=P&P' i }; return RankKval, P; /*Below K=n-1=7-1=6 (looking for the 6 th highest = 2 nd lowest value)*/ Cross out the 0-positions of P each step. 5 P=MapRankKPts= ListRankKPts={2} 01000000100000 2 3 * + 2 2 * + 2 1 * + 2 0 * = RankKval=

5 Suppose MinVal is duplicated (occurs at two points). What does the algorithm return? P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01100000110000 10111011011101 01011110101111 {0} {1} 1. P = P 4,3 Ct (P) = 3 < 6 P = P’ 4,3 masks off highest 3 (Val  8) p = 6 – 3 = 3 2. Ct(P&P 4,2 ) = 2 < 3 P = P&P' 4,2 p=3-2=1 masks off highest 2 (val  4) 3. Ct(P&P 4,1 )=2 >= 1 P=P&P 4,1 4. Ct (P&P 4,0 )=1 >= 1 P=P&P 4,0 10 5 6 3 11 9 3 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0} {1} 3=MinVal=rank(n-1)Val. Pmask MinPts=rank(n-1)Pts{#4,#7} RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&P i ); If (c>=p) {RankVal=RankVal+2 i ; P=P&P i }; else {p=p-c; P=P&P' i }; ret RankKval, P;

6 P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 00100000010000 11111011111101 01011110101111 {0} {1} 1. P = P 4,3 Ct (P) = 3 < 6 P = P’ 4,3 (masks off the highest 3 val  8) p = 6 – 3 = 3 2. Ct(P&P 4,2 ) = 1 < 3 P = P&P' 4,2 p=3-1=2 (masks off highest 1 val  4) 3. Ct(P&P 4,1 )=3 >= 2 P=P&P 4,1 4. Ct (P&P 4,0 )=3 >= 2 P=P&P 4,0 10 3 6 3 11 9 3 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0} {1} 3=MinVal. P c mask MinPts #4,#5,#7 Suppose MinVal is triplicated (occurs at three points). What does the algorithm return? RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/ For i=n to 0 {c=Count(P&P i ); If (c>=p) {RankVal=RankVal+2 i ; P=P&P i }; else {p=p-c; P=P&P' i }; return RankKval, P;


Download ppt "FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets."

Similar presentations


Ads by Google