o 0 r 1 v 1 r 2 v 2 r 3 v 3 v 4 dim2 dim1 Algorithm-1: Look for dimension where clustering best. Below, dimension=1 (3 clusters: {r 1,r 2,r 3,O}, {v 1,v 2,v 3,v 4 } and {0}). How to determine? 1.a: Take each dimension in turn working left to right, when d(mean,median)>¼ width, declare a cluster. 1.b: Next take those clusters one at a time to the next dimension for further sub-clustering via the same algorithm. mean median mean median At this point we declare {r 1,r 2,r 3,O} a cluster and start over. mean median mean median mean median At this point we need to declare a cluster, but which one, {0,v 1 } or {v 1,v 2 }? We will always take the one on the median side of the mean - in this case, {v 1,v 2 }. And that makes {0} a cluster (actually an outlier, since it's singleton). Continuing with {v 1,v 2 }: mean median mean median Declare {v 1,v 2,v 3,v 4 } a cluster. Note we have to loop. However, rather than each single projection, delta can be the next m projs if they're close. Next we would take one of the clusters and go to the best dimension to subcluster... Oblique version: Take grid of Oblique direction vectors, e.g., For 3D dataset, a DirVect pointing to center of each PTM triangle. With projections onto those lines, do 1 or 2 above. Ordering = any sphere surface grid: S n ≡{x≡(x 1...x n ) R n | x i 2 =1}, in polar coords, {p≡(θ 1...θ n-1 ) | 0 θ i 179}. Can skip doubletons since mean always same as median. Algorithm-3: Another variation of this is to calculate the dataset mean and vector of medians. Then on the projections of the dataset onto the line connecting the two, do 1a or 1b. Then repeat on each declared cluster, but use projection line other than the one through the mean and vom, this second time, since the mean-vom-line would likely be in approx in the same direction as the first round) Do until no new clusters? Adjust? e.g., proj lines and stop cond,... Algorithm-2: 2.a Take each dim in turn, working left to right, when density>Density_Threshold, declare a cluster (density≡count/size). 2b=1b Algorithm-4: Proj onto line of dataset mean, vom, mn=6.3,5.9 vom=6,5.5 (11,10=outlier). 4.b, Repeat on any perp line thru mean. (mn, vom far apart multi-modality. Algorithm-4.1: 4.b.1 In each cluster, find 2 points furthest from line? (Require projection be done one point at a time? Or can we determine those 2 points in one pTree formula?) Algorithm-4.2: 4.b.2 use a grid of unit direction lines, {dv i | i=1..m}. For each, calc mn, vom of projs of each cluster (except singletons). Take the one for which the separation is max. 4,9 2,8 5,8 4,6 3,4 dim2 dim1 11,10 10,5 9,4 8,3 7,2 6.3,5.9 6,5.5 Use lexicographical polar coords? 180 n too many? Use e.g., 30 deg units, giving 6 n vectors, for dim=n. Attrib relevance important Analysis of Affinities and Anomalies through pTrees
mean=(8.18, 3.27, 3.73)vom=(7,4,3) 1. no clusters determined yet. 924 b43 e43 c f72 2. (9,2,4) determined as an outlier cluster. 3. Using red dim line, (7,5,2) is determined as an outlier cluster. maroon pts determined as cluster, purple pts too. 3.a However, continuing to use line connecting (new) mean and vom of the projections onto this plane, would the same be determined? Other option? use (at some judicious point) a p-Kmeans type approach. This could be done using K=2 and a divisive top down approach (using a GA mutation at various times to get us off a non-convergent track)? Notes:Each round, reduce dim by one (low bound on the loop.) Each round, just need good line (in remaining hyperplane) to project cluster (so far). 1. pick line thru proj'd mean, vom (vom is dependent on basis used. better way?) 2. pick line thru longest diameter? ( or diam 1/2 previous diam?). 3. try a direction vector. Then hill climb it in direction increase in diam of proj'd set. From: Mark Silverman April 21, :22 AM Subject: RE: oblique I’ve been doing some tests, so far not so accurate (I’m still validating the code – I “unhardcoded” it so I can deal with arbitrary datasets and it’s possible there’s a bug, so far I think it’s ok). Something rather unique about the test data I am using is that it has four attributes, but for all of the class decisions it is really one of the attributes driving the classification decision (e.g. for classes 2-10, attribute 2 is dominant decision, class 11 attribute 1 is dominant, etc). I have very wide variability in std deviation in the test data (some very tight, some wider). Thus, I think that placing “a” on the basis of relative deviation makes a lot of sense in my case (and probably in general). My assumption is that all I need to do is to modify as follows: Now: a[r][v] = (Mr + Mv) * d / 2 Changes to a[r][v] = (Mr + Mv) * d * std(r) / (std(r) + std(s)) Is this correct?
r r vv r m R r v v v r r v m V v r v v r v FAUST Oblique (our best classifier?) P R =P (X o d R ) < a R 1 pass gives classR pTree D≡ m R m V d=D/|D| midpoint of means ( mom ) Separate class R using midpoint of means ( mom ) method: Calc a (m R +(m V -m R )/2) o d = a = (m R +m V )/2 o d (works also if D=m V m R, d Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the vom 1. vectors_of_median, vom, to represent each class, not the mean m V, where vom V ≡(median{v 1 |v V}, mom_std, vom_std methods 2. mom_std, vom_std methods : project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between m r and m v median{v 2 |v V},...) vom V v1v1 v2v2 vom R std of distances, v o d, from origin along the d-line dim 2 dim 1 d-line Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)
The PTreeSet Genius for Big Data Big Vertical Data: PTreeSet (Dr. G. Wettstein's) perfect for BVD! (pTrees both horiz and vert) PTreeSets incl methods for horiz querying and vertical DM, multihopQuery/DM, and XML. T(A 1...A n ) is a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negs?) bitsliced (pt_pos schema) and category attr bitmapped; coded then bitmapped; num coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet? A 1..A k num w bitwidths=bw 1..bw k ; A k+1..A n categorical w counts=cc k+1...cc n, PTreeSet is bitmatrix: A 1,bw row number N A 1,bw A 1, A 2,bw A k+1,c A n,cc n Methods for this data structure can provide fast horizontal row access, e.g., an FPGA could (with zero delay) convert each bit-row back to original data row. Methods already exist to provide vertical (level-0 or raw pTree) access. Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g., 50% 1-bits ). Add "level-1 only" DM meth, e.g., FPGA converts unclassified rowsets to equiwidth=64, 50% level1 pTrees, then entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN A 1,bw inteval number roof (N/64) A 1,bw A 1, A 2,bw A k+1,c A n,cc n pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit- column would makes it impossible to break the code by simply focusing on the first bit row. Relationships (rolodex cards) are 2 PTreeSets, AHGPeoplePTreeSet (shown) and AHGBasePairPositionPTreeSet (rotation of shown). Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table (cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination? Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree) More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed... Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search). A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?). AHG [THG/GHG/CHG] is relationship between People and adenine(A) [thymine(T)/guanine(G)/cytosine(C)] (1/0 for yes/no) Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining? Create both People and BPP-PTreeSet w human health records feature table (training set for classification and multi-hop ARM.) comprehensive decomp (ordering of bpps) FOR cross species genomic DM. If separate PTreeSets for each chrmomsome (even each region - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTrees. pc bc lc cc pe age ht wt AHG(P,bpp) P 7B bpp B genechromosome The red person features used to define classes. AHG p pTrees for data mining. We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else.
pc bc lc cc pe age ht wt Multi-hop Data Mining (MDM): relationship1 (Buys= B(P,I) ties table1 (People=P) to table2 (Items) P=People 2345 F(P,P)=Friends P B(P,I)=Buys I=Items 2345 Define NearestNeighborVoterSet of {f} using strong R-rules with F in consequent? A strong cluster based on several self-relationships (different relationships, so it's not just strong implic both ways) strongly implies itself (or strongly implies itself after several hops (or when closing a loop). Find all strong, A C, A P, C I Frequent iff ct(P A ) > minsup and Confident iff ct(& p A P p AND & i C P i ) / ct(& p A P p ) > minconf Says: "A friend of all A will buy C if all A buy C." (the AND is always AND) Closures: A freq then A + freq. A C not conf, then A C - not conf ct(| p A P p AND& i C P i )>mncf ct(| p A P p ) friend of any in A will buy C if any in A buy C. ct(| p A P p AND | i C P i )>mncf ct(| p A P p ) Change to "friend of any in A will buy something in C if any in A buy C. tied by relationship2 (Friends=F(P,P) ) to table3 (also P). Can we do clustering and/or classification on one of the tables using the relationships to define "close" or to define the other notions? Categorycolorsizewtstorecitystatecountry
A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked. Facebook-Buys: Members 4321 F≡Friends(M,M) Members P≡Purchase(M,I) I≡Items 2345 X I MX≡& x X P x People that purchased everything in X. FX≡OR m MX F b = Friends of a MX person. So, X={x}, is Mx Purchases x strong" Mx=OR m Px F m x frequent if Mx large. This is a tractable calculation. Take one x at a time and do the OR. Mx=OR m Px F m x confident if Mx large. ct( Mx P x ) / ct(Mx) > minconf K 2 = {1,2,4} P 2 = {2,4} ct(K 2 ) = 3 ct(K 2 &P 2 )/ct(K 2 ) = 2/3 To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves.... ct(OR m P x F m & P x )/ct(OR m P x F m )>mncnf Kx=OR O g x frequent if Kx large (tractable- one x at a time and OR. g OR b Px F b Kiddos 4321 F≡Friends(K,B) Buddies P≡Purchase(B,I) I≡Items Groupies Others(G,K) K 2 ={1,2,3,4} P 2 ={2,4} ct(K 2 ) = 4 ct(K 2 &P 2 )/ct(K 2 )=2/ Fcbk buddy, b, purchases x, tells friends. Friend tells all friends. Strong purchase poss? Intersect rather than union (AND rather than OR). Ad to friends of friends Kiddos 4321 F≡Friends(K,B) Buddies P≡Purchase(B,I) I≡Items Groupies Compatriots (G,K) K 2 ={2,4} P 2 ={2,4} ct(K 2 ) = 2 ct(K 2 &P 2 )/ct(K 2 ) = 2/
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F. upward closure: If a condition is true of A then it is true of all supersets D of A. downward closure: If a condition is true of A, then it is true for all subsets D of A. For transitive (a+c)-hop strong rule mine where the focus or count entity is a hops from the antecedent and c hops from the consequent, if a (or c) is odd/even then downward/upward closure applies to frequency (confidence). Odd downward Even upward S(F,G) R(E,F) E F G AA CC T(G,H) H 2345 U(H,I) I The proof of the theorem: a pTree, X, is said to be "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that same position in Y. Lemma-0: For any two pTrees, X and Y, X & Y is covered by X and ct(X) ct(X&Y) Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions. Lemma-1: Let A B, & a B X a is covered by & a A X a Proof-1: Let Z=& a B-A X a then &a B X a = Z & (& a A X a ), so the result follows from lemma-0. Lemma-2: For a (or c) =0, frequency and confidence are upward closed Proof-2: Lemma-3: If a (or c) we have upward/downward closure of frequency or confidence, then for a+1 (or c+1) we have downward/upward closer. Proof-3: Taking the a and upward closure, going to a+1 and D A, we are removing ANDs in the numerator for both frequency and confidence, so by Lemma-1, the a+1 numerator is covers the a numerator and therefore the a+1_count the a_count. Therefore, the condition (frequency or confidence) holds in the a+1 case and we have downward closure. ct(B) ct(A), so ct(A)>mnsp ct(B)>mnsp and ct(C&A)/ct(C)>mncf ct(C&B)/ct(C)>.mncf
The Multi-hop Closure Theorem A hop is a relationship, R, hopping from entities E to F. upward closure: If a condition is true of A then it is true of all supersets D of A. downward closure: If a condition is true of A, then it is true for all subsets D of A. For transitive (a+c)-hop strong rule mine where the focus entity is a hops from the antecedent and c hops from the consequent, if a (or c) is odd/even then downward/upward closure applies to frequency (confidence). Odd downward Even upward A pTree, X, is "covered by" a pTree, Y, if 1-bit in X, there is a 1-bit at that same position in Y. Lemma-0: For any two pTrees, X&Y is covered by X and ct(X) ct(X&Y) Proof-0: ANDing with Y may zero some of X's 1-positions but never ones any of X's 0-positions. Lemma-1: Let A B, & a B X a is covered by & a A X a Proof-1: Let Z=& a B-A X a then &a B X a = Z & (& a A X a ), so the result follows from lemma-0. Thresh is upward/downard closed on A & a1 (&... )S a2 T a1 ) & a(n-1) (& an A R an ) ct( Lemma2: If n is even/odd Proof-2: Let A D, then & an D R an & an A R an a(n-1) (& an D R an ) & a(n-1) (& an A R an ) & a(n-1) (& an D R an ) a(n-2) & a(n-1) (& an A R an ) & a(n-2) & & a(n-1) (& an D R an ) a(n-2) & a(n-1) (& an A R an ) a(n-3)& a(n-2) & & a(n-3)& &
Dear Dr. Perrizo and All, I think I found a method to calculate mode of a dataset using pTrees. Assume we have a data set that is represented by three pTrees. So possible values of each data value is 0 to 7. Now if we do the following operations: F0 = count (P2'&P1'&P0') will give us frequency of value 0 F1 = count (P2'&P1 &P0') will give us frequency of value 1 F2 = count (P2'&P1 &P0') will give us frequency of value 2... F7 = count (P2 &P1 &P0 ) will give us frequency of value 7 Now Mode = Max(F0, F1,...,F7) Problem of this method is: if we have large number of pTrees then there will be large number of F operations and each F operation will involve many AND operations. For examples, if we have 8 pTrees then we'll have 2^8=256 F's and each F contains 8-1=7 AND operations. I have though of a solution that may overcome this problem: Assume we have 3 pTrees and Value=2 is the mode. So if we do F2=P2'&P1&P0' would give us maximum F value. Assume it is m. Now if we get the count of all individual component of F2 that is subsets ( P2', P1, P0, P2'&P1, P2'&P0', P1&P0', P2'&P1&P0') then all of them are must be greater than of equal to m (Down closure property). So to search for P2'&P1&P0' we can run an aprio like algorithm with singleton itemset P2, P2', P1, P1', P0, P0'. Then form doubleton P2P1, P2P1'... etc. Now we need a support value for pruning. Obviously the support should be the mode but we do not know it ahead of time. So we can set a minimum value of mode as support. (Note: There cannot be any PiPi' doubleton as it is 0.) Minimum value of mode is Min(1, floor[Datasize/2^n]) where n is number of pTrees. Sorry I cannot give any example now but I can try to give an example in the white board. Thanks. Sincerely, Mohammad
R Given a n-row table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., asc on key; or for spatial data, col/row- raster, Z, Hilbert), the sequence of predicate truth bits is the raw or level-0 predicate Tree (pTree) for that table, row predicate and row order. Given a raw pTree, P, a partitioned of it, par, and a bit-set predicate, bsp (e.g., pure1, pure0, gte50%One), the level-1 par, bsp pTree is the string of truths of bsp on consecutive partitions of par. If the partition is an equiwidth=m intervalization, it's called the level-1 stride=m bsp pTree. IRIS Table Name SL SW PL PW Color setosa red setosa blue setosa red setosa white setosa blue versicolor red versicolor red versicolor white versicolor blue versicolor white virginica white virginica red virginica blue virginica red virginica red P 0 SL, predicate: remainder(SL/2)=1 order: the given table order P 0 Color=red pred: Color=red order: given ord P 0 SL, pred: rem(div(SL/2)/2)=1 order: given order gte50% stride=5 P 1 SL,1 1 0 pure1 str=5 P 1 SL,1 0 gte25% str=5 P 1 SL,1 1 P 0 PW<7 1 0 pred: PW<7 order: given gte50% stride=5 P 1 PW<7 1 0 gte50% st=5 pTree predicts setosa. gte75% str=5 P 1 SL,1 1 0 gte50% str=5 P 1 C=red 0 1 pure1 str=5 P 1 C=red 0 gte25% str=5 P 1 C=red 1 gte75% str=5 P 1 C=red 0 1 P 0 SL, rem(SL/2)=1 ord: given gte50% stride=4 P 1 SL,0 0 1 gte50% stride=8 P 1 SL,0 0 1 gte50% stride=16 P 1 SL,0 0 lev2 pTree= lev1 pTree on a lev1. (1col tbl) P 0 SL, pred: rem(SL/2)=1 ord: given order P 1 gte50%,s=4,SL,0 ≡ gte50% stride=4 P 1 SL,0 0 1 level-2 gte50% stride=2 1 P 2 gte50%,s=4,SL, gte50_P 11 raw level-0 pTree level-1 gt50 stride=4 pTree level-1 gt50 stride=2 pTree
FAUST Satlog evaluation R G ir1 ir2 mn R G ir1 ir2 std Oblique level-0 using midpoint of means 1's 2's 3's 4's 5's 7's True Positives: False Positives: NonOblique lev-0 1's 2's 3's 4's 5's 7's True Positives: Class actual-> NonOblq lev1 gt50 1's 2's 3's 4's 5's 7's True Positives: False Positives: Oblique level-0 using means and stds of projections (w/o cls elim) 1's 2's 3's 4's 5's 7's True Positives: False Positives: Oblique lev-0, means, stds of projections (w cls elim in order) Note that none occurs 1's 2's 3's 4's 5's 7's True Positives: False Positives: a = pm r + (pm v -pm r ) = pstd v +2pstd r 2pstd r pm r *pstd v + pm v *2pstd r pstd r +2pstd v Oblique level-0 using means and stds of projections, doubling pstd No elimination! 1's 2's 3's 4's 5's 7's True Positives: False Positives: Oblique lev-0, means, stds of projs, doubling pstd r, classify, eliminate in 2,3,4,5,7,1 ord 1's 2's 3's 4's 5's 7's True Positives: False Positives: s 1, # of FPs reduced and TPs somewhat reduced. Better? Parameterize the 2 to max TPs, min FPs. Best parameter? Oblique lev-0, means,stds of projs, doubling pstd r, classify, elim 3,4,7,5,1,2 ord 1's 2's 3's 4's 5's 7's True Positives: False Positives: above=(std+stdup)/gap below=(std+stddn)/gapdn suggest ord abv below abv below abv below abv below avg red green ir1 ir2 cls avg s1/(2s1+s2) elim ord: TP: FP: tot TP actual TP nonOb L0 pure TP nonOblique FP level-1 50% TP Obl level FP MeansMidPoint TP Obl level FP s1/(s1+s2) TP 2s1/(2s1+s2) FP Ob L0 no elim TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP 2s1/(2s1+s2) FP Ob L TP BandClass rule FP mining (below) G[0,46] 2G[47,64] 5 G[65,81] 7 G[81,94] 4 G[94,255] {1,3} R[0,48] {1,2} R[49,62] {1,5} R[82,255] 3 ir1[0,88] {5,7}ir2[0,52] 5 Conclusion? MeansMidPoint and Oblique std1/(std1+std2) are best with the Oblique version slightly better. I wonder how these two methods would work on Netflix? Two ways: UTbl(User, M 1,...,M 17,770 ) (u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m) MTbl(Movie, U 1,...,U ) (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)
Netflix data {m k } k= uID rating date u i 1 r m k,u d m k,u u i 2. u i n k m k (u,r,d) avg:5655u/m mID uID rating date m 1 u 1 r m,u d m,u m 1 u 2. m u r 17770, d 17770, or U ,480, Main:(m,u,r,d) avg:209m/u u 1 u k u m 1 : m h : m rmhukrmhuk 47B MTbl(mID,u 1...u ) u 0,2 u ,0 m 1 : m h : m /1 47B MPTreeSet 3* bitslices wide (u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m) Of course, the two supports won't be tight together like that but they are put that way for clarity. Lots of 0 s in vector sp, umTraningTbl). Want the largest subtable without zeros. How? SubUTbl( n Sup(u) m Sup(n), Sup(u),m)? Using Coordinate-wise FAUST (not Oblique), in each coordinate, n Sup(u), divide up all users v Sup(n) Sup(m) into their rating classes, rating(m,v). then: 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. This of course may be slow. How can we speed it up? Coord FAUST, in each coord, v Sup(m), divide up all movies n Sup(v) Sup(u) to rating classes 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??) Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them. Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes? m 1... m h... m u 1 : u k. u rmhukrmhuk 47B UserTable(uID,m 1,...,m ) m 0,2... m 17769,0 u 1 : u k. u /0 47B UPTreeSet 3*17770 bitslices wide (u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m) u ?45 m12455m12455 m12455m12455