Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mark asked about anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers (AKA, anomaly detection?) (note this was.

Similar presentations


Presentation on theme: "Mark asked about anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers (AKA, anomaly detection?) (note this was."— Presentation transcript:

1 Mark asked about anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers (AKA, anomaly detection?) (note this was Dr. Dongmei (Dorothy) Ren's thesis topic) “A P-tree-based Outlier Detection Alg”, ISCA CAINE 2004, Orlando, FL, Nov., 2004 (with B. Wang, D. Ren) CAINE - 2004.pdf 124K Download Download “A Cluster-based Outlier Detection Method with Efficient Pruning”, ISCA CAINE, Nov., 2004 (with B. Wang, D. Ren) “A Density-based Outlier Detection Algorithm using Pruning Techniques”, ISCA CAINE 2004, Nov., 2004 (with B. Wang, K. Scott, D. Ren) “Outlier Detect with Local Pruning”, ACM Conf on Info and Knowledge Mgmt, ACM CIKM 2004, Nov., 2004, Washington, D.C., (with D. Ren). DownloadDownload “RDF: A Density-based Outlier Detection Meth using Vertical Data Rep”, IEEE ICDM 2004, Nov., 2004, Brighton, U.K., (with D. Ren, B. Wang). Download Download “A Vert Outlier Detect Method with Clusters as a By-Product”, Intl Conf. On Tools in Artificial Intell, IEEE ICTAI 2004, Boca Raton, FL, (with D. Ren). Download Download Quick outlier filter?: dis(dataset_mean, dataset_vom)>threshold1, and standard deviation threshold3). Depending on how many outliers one wants, one can march threshold3 down from infinity until you have that number or consider any point more than, say, 2 stds from the mean to be an outlier (in the direction of vector(mean-->vom) only)??? Slide 6: example of mean versus vector of medians (a large difference (class=V) suggests outliers, while mean~=vom none Let's look at the last two papers to see how we did things years ago. Better way? Use pTrees to mask each mode. Then for each mode separately, use this filter to determine if there are outliers in the mode. c 1000 Genomes Web Service: 1000 Genomes Project is an intl research effort to establish the most detailed catalogue of human genetic variation. It has grown to 200 TB including DNA sequenced from more than 1,700 individuals that researchers can now access on AWS for use in disease research. It aims to include genomes of more than 2,662 individuals from 26 populations, NIH will continue to add remaining genome samples to data collection this year. The dataset containing full genomic sequence of 1,700 individuals available. s3.amazonaws.com/1000genomess3.amazonaws.com/1000genomes Accessing 1000 Genomes Data: The data is publicly available free, in a centralized repository of data hosted on Amazon Simple Storage Service (Amazon S3). The data can be seamlessly accessed from AWS services such Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR). which provide orgs with highly scalable compute resources needed to take advantage of these collections, at no charge to the community. Researchers pay only for the additional AWS resources they need for further processing or analysis of the data. Learn more about Public Data Sets on AWS.Public Data Sets on AWS All 200 TB of the latest 1000 Genomes Project data is in a publicly available Amazon S3 bucket. You can access the data via simple HTTP, or take advantage of the AWS SDKs in languages such as Ruby, Java, Python,.NET and PHP.publicly available Amazon S3 bucket Analyzing 1000 Genomes Data Researchers can use Amazon EC2 utility computing service without usual capital investment required to work with data at this scale. AWS also provides a number of orchestration and automation services to help teams make their research available to others to remix and reuse.orchestrationautomation Making the data available via a bucket in Amazon S3 also means that customers can crunch the information using Hadoop via Amazon Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow.Amazon Elastic MapReduceCloudBurstCrossbow Other Sources The 1000 Genomes data free thru 1000 Genomes website, from each of 2 institutions that work together as project Data Coord Centre (DCC). NIH Natl Center for Biotechnology Info (NCBI), a div of Natl Library of Medicine at NIH: ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes ftp6.ncbi.nlm.nih.gov (for IPv6 access) http://www.ncbi.nlm.nih.gov/projects/faspftp/1000genomes/ (via Aspera)ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomesftp6.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov/projects/faspftp/1000genomes/ European Bioinfs Inst (EMBL-EBI), support from Wellcome Trust: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ http://www.1000genomes.org/aspera (via Aspera)ftp://ftp.1000genomes.ebi.ac.uk/vol1/http://www.1000genomes.org/aspera Education Grants Program Educators, researchers and students can apply for free credits to take advantage of the utility computing platform offered by AWS, along with Public Datasets such as the 1000 Genomes Project data. If you're running a genomics workshop or have a research project which could take advantage of the hosted 1000 Genomes dataset, you can apply for an AWS Grant.AWS Grant

2 Mark: Getting to oblique now. Here’s now Diana1 (max gap method) vs Diana2 (max separation method). As expected max separation performs somewhat better, very interested in seeing oblique…. We need to be seeing 90% on this test. Md: I wonder if we have any k-mean clustering algm impl using pTrees (see my vita "cluster" and below). It will create 1 mask pTree for each cluster. Alg is for 3 clusters but can be changed to more (or less) than 3. Assume we have n data points X={x 1, x 2,... xn} and we have to divide them into 3 clusters (k=3). In k-mean clustering algorithm, in first iteration, we select 3 centroids (center of the clusters). Then we find the distance of the centroids from each data points. Then we assign a point to that cluster from which it has the minimum distance. Then for each cluster we calculate the mean of the data points which are assigned to that cluster. These means are new centroid for the next iteration. This alg needs horizontal scan of all data points. But now we can do that without any horiz scan if points are in pTrees: Here is how it will work: 1. Represent the data points in pTrees: Iteration 1: 2. Assume three centroids (C1,C2 and C3) 3. Calc dist D1=D(X,C1), D2=D(X,C2) and D3= D(X,C3) - may be L1 or L2 whatever it is we can calculate using pTeees w/o any horiz scan. 4. Now calc P(D1<D2), P(D1<D3) and P(D2<D3). P(D1<D2) is a mask pTree where bit is 1 if D1<D2 else bit is 0. I have alg ready to calc these mask pTrees without any horizontal scan. 5. Now calc mask pTrees for each cluster by PC1 = P(D1<D2) & P(D1<D3), PC2 = ~P(D1<D2) & P(D2<D3), PC2 = ~P(D1<D3) & ~P(D2<D3) 6. Calculate New Centroid by Ci = Sum(X&PCi)/count(PCi) Now start the next iteration with new centroids calculated in step 6. I'll try to implement it using our pTreeSets with the iris data set. WP: There have been several papers on k-means with pTrees (check my vita). A main thrust is to compute new means in one horiz prog. The other thing would be to do the assignments (for each iteration) in one horizontal program (or at least signif sub-linearly). A faust-like approach should work (set Cut-Hyper-Planes between each pair of current means???) “Vertical K-Median Clustering”. International Conference on Computers and Their Applications, A. Perera, W. Perrizo, Seattle, March, 2006. “Vertical Set Square Distance Based Clustering without Prior Knowledge”, Conference on Intelligent and Adaptive Systems and Software Engineering, Toronto, 2005, A. Perera, T. Abidin, M. Serazi, G. Hamer, W. Perrizo. “A Cluster-based Outlier Detection Method with Efficient Pruning”, International Society of Computer Applications Conf. on Applics. in Industry and Eng., ISCA CAINE, Nov., 2004 (with B. Wang, D. Ren)

3 Square Kilometre Array (SKA) technology In order to provide a million square metres of collecting area, Square Kilometre Array demands a break from traditional radio telescope design. The SKA will drive technology development particularly in information and communication technology. Spin off innovations in this area will benefit other systems that process large volumes of data from geographically dispersed sources. The energy requirements of the SKA also present an opportunity to accelerate technology devel in scalable renewable energy generation, distribution, storage and demand reduction. The footprint of the SKA will cover an entire continent. Pivotal SKA technology is being demonstrated with a suite of precursor and pathfinder telescopes and with design studies by SKA groups around the world. Key SKA technologies will be determined from these and many solutions will be selected and integrated into the final instrument. To achieve both high sensitivity and high-res images of the radio sky, antennas (radio wave receptors), of SKA will be densely distributed in the central region of the array and then logarithmically positioned in gps along 5 spiral arms – each gp becoming widely spaced further from centre. Three antenna types, high frequency dishes and mid and low frequency aperture arrays, will be used by the SKA to provide continuous frequency coverage from 70 MHz to 10 GHz. Combining the signals from all the antennas will create a telescope with a collecting area equivalent to a dish with an area of about one square kilometre. Artist’s impression of the central core region of the SKA telescope. PARAMETERSPECIFICATION FREQUENCY RANGE70 MHz TO 10 GHz SENSITIVITY AREA / SYSTEM TEMP5 000 M²/K (400 μJy IN 1 MINUTE) BETWEEN 70 AND 300 MHz SURVEY FIGURE-OF-MERIT 4×10 7 – 2×10 10 m 4 K -2 deg 2 DEPENDING ON SENSOR TECHNOLOGY AND FREQ FIELD-OF-VIEW 200 SQ DEGS 70- 300 MHz 1-200 SQ DEGS 0.3 -1 GHz 1 SQU MAX 1- 10 GH ANGULAR RESOLUTION<0.1 ARCSECOND INSTANTANEOUS BANDWIDTHBAND CENTRE ± 50% SPECTRAL (FREQUENCY) CHANNELS16 384 PER BAND PER BASELINE CALIBRATED POLARISATION PURITY10 000:1 SYNTHESISED IMAGE DYNAMIC RANGE>1 000 000 IMAGING PROCESSOR COMPUTATION10 ~17 OPERATIONS/SECOND FINAL PROCESSED DATA OUTPUT10 GB/SECOND What is radio astronomy? Astronomers use radio telescopes to explore the universe by detecting electromagnetic radiation emitted by objects in space. Radio wave receptors, or antennas, detect relatively long wl (low freq) radio waves that penetrate Earth’s atmosphere. These radio signals have frequencies between about 30 MHz and 40 GHz, which is equivalent to, wavelengths from 10 m down to 7 mm. The SKA will observe at a frequency range from 70 MHz to 10 GHz which is equivalent to wavelengths of 4 m to 3 cm. Radio telescopes provide alt views of Universe to optical telescopes and can reveal areas of space that may be obscured w cosmic dust. Radio telescopes can be linked together to create a larger virtual telescope known as an interferometer. The SKA will be the world’s largest interferometer. The SKA key science projects The SKA will be a flexible instrument designed to address a wide range of fundamental questions in physics, astrophysics, cosmology, astrobiology. It will probe previously unexplored parts of distant Universe. 5 key science projects have been selected: Galaxy evolution, cosmology and dark energyGalaxy evolution, cosmology and dark energy How do galaxies evolve and what is dark energy? The acceleration in the expansion of the Universe has been attributed to a mysterious dark energy. The SKA will investigate this expansion after the Big Bang by mapping the cosmic distribution of hydrogen. The map will track young galaxies and help identify the nature of dark energy. more…more… Strong-field tests of gravity using pulsars and black holesStrong-field tests of gravity using pulsars and black holes Einstein right about gravity? SKA investigate gravity, challenge genrelativity. more…more… The origin and evolution of cosmic magnetismThe origin and evolution of cosmic magnetism What generates giant magnetic fields in space? SKA will create three-dimensional maps of cosmic magnets to understand how they stabilise galaxies, influence the formation of stars and planets, and regulate solar and stellar activity. more…more… Probing the Dark AgesProbing the Dark Ages How were the first black holes and stars formed? The SKA will look back to the Dark Ages, a time before the Universe lit up, to discover how the earliest black holes and stars were formed. more…more… The cradle of lifeThe cradle of life Are we alone? The SKA will be able to detect very weak extraterrestrial signals and will search for complex molecules, the building blocks of life, in space. more… Flexible design to enable exploration of the unknown. While this is truly exciting and transformational science, history has shown that many of the greatest discoveries have happened unexpectedly. The unique sensitivity and versatility of the SKA will make it a discovery machine. We should be prepared for the possibilities.more…

4 The PTreeSet Genius for Big Data Big Data is where it's at today! Querying BigData is where DBMSs are at today. Underneath, a good foundation is needed. Our foundations is Big Vertical Data The abstract data type, the PTreeSet (Dr. Greg Wettstein's invention!), is a perfect residualization of BVD! (for both DB querying and datamining? - since as data structures, PTreeSets are both horizontal and vertical.) PTreeSets incl methods for horiz query, vert DM, multihopQuery/DM, XML A tbl, T(A 1...A n ) as a PTreeSet data structure = bit matrix with (typically) each numeric attr converted to fixedpt(?), (negative numers??) bitsliced (pt_pos  schema) and each category attr bitmapped; coded then bitmapped or numerically coded then bisliced (or as is, ie, char(25) NAME col stored outside PTreeSet? (dates?) (addresses?) Let A 1..A k numeric with bitwidths=bw 1..bw k (0-based) and A k+1..A n categorical with category-counts=cc k+1...cc n, the PTreeSet is the bitmatrix: 0 1 0 1 0 0 1 A 1,bw 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 row number N... 5 4 3 2 1 A 1,bw 1 -1... A 1,0 0 0 0 0 0 0 0 A 2,bw 2 0 1 0 1 0 0 1... 0 0 0 0 0 1 0 A k+1,c 1 0 0 1 0 0 1 0...A n,cc n Methods for this data structure can provide fast horizontal row access, e.g., an FPGA could (with zero delay) convert each bit-row back to original data row. Methods already exist to provide vertical (level-0 or raw pTree) access. Add any Level1 PTreeSet can be added: given any row partition (eg, equiwidth =64 row intervalization) and a row predicate (e.g.,  50% 1-bits ). Add "level-1 only" DM methods, e.g., an FPGA device converts unclassified rowsets to equiwidth=64,  50% level1 pTrees, then the entire batch would be FAUST classified in one horiz program. Or lev1 pCKNN. 1 0 1 1 A 1,bw 1 0 0 1 0 0 0 1 1 inteval number roof (N/64)... 2 1 A 1,bw 1 -1... A 1,0 0 0 0 0 A 2,bw 2 1 0 0 1... 0 0 0 1 A k+1,c 1 1 0 1 0...A n,cc n pDGP (pTree Darn Good Protection) by permuting col ord (permution = key). Random pre-pad for each bit-col would makes it impossible to break the code by simply focusing on the first bit row. AHG(P,bpp) 001 100 001 100 000 011 100 0000 0100 0001 0100 0000 0011 0100 P 7B... 5 4 3 2 1 123 BPP 45...3B Relationships (rolodex cards) such as AdenineHumanGenome, are 2 PTreeSets, AHGPeoplePTreeSet (shown) and the AHGBasePairPositionPTreeSet (the rotation of the one shown). Vertical Rule Mining, Vertical Multi-hop Rule Mining and Classification/Clustering methods (viewing AHG as either a People table (cols=BPPs) or as a BPP table (cols=People). MRM and Classification done in combination? Any table is a relationship between row and column entities (heterogeneous entity) - e.g., an image = [reflect. labelled] relationship between pixel entity and wavelength interval entity. Always PTreeSetting both ways facilitates new research and make horizontal row methods (using FPGAs) instantaneous (1 pass across the row pTree) More security?: all pTrees same (max) depth, and intron-like pads randomly interspersed...

5 r r vv r m R r v v v r r v m V v r v v r v FAUST Oblique (our best classifier?) P R =P (X o d R ) < a R 1 pass gives classR pTree D≡ m R  m V d=D/|D| midpoint of means ( mom ) Separate class R using midpoint of means ( mom ) method: Calc a (m R +(m V -m R )/2) o d = a = (m R +m V )/2 o d (works also if D=m V  m R, d Training≡placing cut-hyper-plane(s) (CHP) (= n-1 dim hyperplane cutting space in two). Classification is 1 horizontal program (AND/OR) across pTrees, giving a mask pTree for each entire predicted class (all unclassifieds at-a-time) Accuracy improvement? Consider the dispersion within classes when placing the CHP. E.g., use the vom 1. vectors_of_median, vom, to represent each class, not the mean m V, where vom V ≡(median{v 1 |v  V}, mom_std, vom_std methods 2. mom_std, vom_std methods : project each class on d-line; then calculate std (one horizontal formula per class using Md's method); then use the std ratio to place CHP (No longer at the midpoint between m r and m v median{v 2 |v  V},...) vom V v1v1 v2v2 vom R std of distances, v o d, from origin along the d-line dim 2 dim 1 d-line Note:training (finding a and d) is a one-time process. If we don’t have training pTrees, we can use horizontal data for a,d (one time) then apply the formula to test data (as pTrees)

6 LandSat Satlog (UCI MLR). (Red, Green, Infrared1, infrared2 bands. 6 classes (1,2,3,4,5 and 7 No 6). In FaustObliqueLanSat.C, use Dr. W. PTree.H, PTreeSet.H, my PTreeOp.H headers. 4 atts sliced into R.txt, G.txt, IR1.txt, IR2.txt. 2 files - training and test. Class mean (mn), direction vector (D) and cutpoints constant (a) calc'd from training. ObliqueFunction called, takes data pts, D s and a s, returns a mask pTree. If mean1,2 used, D 12, a 12 given mask pTree (0 for cls1, 1 for cls2 and 1/0 (X) for others) FAUST OBLIQUE LANDSAT.C #include "PTree.H" #include "PTreeSet.H" #include "PTreeOp.H" #include #define BIT_NUM 8 #define ATT_NUM 4 #define DTS_LEN 2000 #define TRN_LEN 10 int readfile(int A[],int n,char *fname){FILE *fp;if((fp=fopen (fname,"rt"))==NULL){fprintf(stderr,"File open error %s\n",fname); return -1;}for(int i=0;i<n;i++)fscanf(fp,"%d",&A[i]);return 1;} static void ToValue(PTreeSet &p,int bit,int len,bool sign){int i,j, w,v;w=bit-1;for(i=0;i<len;i++){v=0;for(j=0;j<w;j++){if(p[j].is_set (i)){v+=pow(2,j);}}if(p[w].is_set(i)){if (sign)v=-1*(pow(2,w)-v); elsev+=pow(2,w);}fprintf(stdout,"%d",v);}fprintf(stdout,"\n");} static void ObliqueFunction(int X[][DTS_LEN],int D[],int a,PTree &mask){ unsigned long int len,att_num,bit_num;inti,j,k; att_num=ATT_NUM;len=DTS_LEN;bit_num=BIT_NUM;PTreeSet x,r,s,t; PTree dmy(len);dmy.clearall();for(i=0;i<bit_num;i++) x.add(dmy);for (i=0;i<2*bit_num+att_num+1;i++){ r.add(dmy);s.add(dmy);t.add(dmy);}for(k=0;k<att_num;k++){ /* Convert Attr X[k] to pTree and store in pTree set x*/ for(i=0;i<len;i++){unsigned int bit = 1; //fprintf(stdout,"Conv Val %d\n",X[k][i]);for(j=0;j< bit_num;j++){ if(bit&X[k][i])x[j].setbit(i);elsex[j].clearbit(i); it<<=1;}} for(i=0;i<2*bit_num+att_num+1;i++){r[i].clearall();if(D[k]<0){ MultiplyByValue(r,x,-D[k],bit_num,bit_num);Make2sComplement(r,r, 2*bit_num+att_num,len);}else{MultiplyByValue(r,x,D[k],bit_num, bit_num);}AddPTrees(t,s,r,(2*bit_num+att_num),len); for(i=0;i<2*bit_num+att_num;i++)s[i]=t[i];}int a_back;a_back=a; if(a<0){Make2sComplement(s,s,2*bit_num+att_num,len);a=-a;} GreaterThanValue(mask,s,a,(2*bit_num+att_num)); mask=mask&(~s[2*bit_num+att_num-1]);if(a_back>=0)mask=~mask; extern int main(int argc,char *argv[]);int main(int argc,char** argv){char DataSetName[ATT_NUM][30]={"LanSat/R.txt","LanSat/G. txt","LanSat/IR1.txt","LanSat/IR2.txt"};int A[ATT_NUM][DTS_LEN]; int Class[DTS_LEN];int i,j;for (i=0;i<ATT_NUM;i++){readfile(A[i],DTS_LEN,DataSetName[i]);}PTree m12(DTS_LEN),m13(DTS_LEN),m14(DTS_LEN),m15(DTS_LEN),m17(DTS_LEN), m23(DTS_LEN),m24(DTS_LEN),m25(DTS_LEN),m27(DTS_LEN),m34(DTS_LEN), m35(DTS_LEN),m37(DTS_LEN),m45(DTS_LEN),m47(DTS_LEN),m57(DTS_LEN), m1(DTS_LEN),m2(DTS_LEN),m3(DTS_LEN),m4(DTS_LEN),m5(DTS_LEN), m7(DTS_LEN),cl1(DTS_LEN),cl2(DTS_LEN),cl3(DTS_LEN),cl4(DTS_LEN), cl5(DTS_LEN),cl7(DTS_LEN); FAUST OBLIQUE LANDSAT.C (2) /* Calculation of D */ int mC1[ATT_NUM]={63,95,108,89}, mC2[ATT_NUM]={49,40,114,118}, mC3[ATT_NUM]={87,105,111,87},mC4[ATT_NUM]={77,91, 96,75}, mC5[ATT_NUM]={60,62,83,70},mC7[ATT_NUM]={69,77,82,64};int d12[ATT_NUM],d13[ATT_NUM],d14[ATT_NUM],d15[ATT_NUM], d17[ATT_NUM],d23[ATT_NUM],d24[ATT_NUM],d25[ATT_NUM], d27[ATT_NUM],d34[ATT_NUM],d35[ATT_NUM],d37[ATT_NUM], d45[ATT_NUM],d47[ATT_NUM],d57[ATT_NUM]; int sum,a12=0,a13=0,a14=0,a15=0,a17=0,a23=0,a24=0,a25=0,a27=0, a34=0,a35=0,a37=0,a45=0,a47=0,a57=0;double mn,dd=0.0; for(i=0;i<ATT_NUM;i++)d12[i]=mC1[i]-mC2[i];d13[i]=mC1[i]-mC3[i]; d14[i]=mC1[i]-mC4[i];d15[i]=mC1[i]-mC5[i];d17[i]=mC1[i]-mC7[i]; d23[i]=mC2[i]-mC3[i];d24[i]=mC2[i]-mC4[i];d25[i]=mC2[i]- mC5[i];d27[i]=mC2[i]-mC7[i];d34[i]=mC3[i]-mC4[i];d35[i]=mC3[i]- mC5[i];d37[i]=mC3[i]-mC7[i];d45[i]=mC4[i]-mC5[i];d47[i]=mC4[i]- mC7[i];d57[i]=mC5[i]-mC7[i];}for(i=0;i<ATT_NUM;i++){a12+=(mC1[i]+ mC2[i])*d12[i];a13+=(mC1[i]+mC3[i])*d13[i];a14+=(mC1[i]+mC4[i])*d 14[i];a15+=(mC1[i]+mC5[i])*d15[i];a17+=(mC1[i]+mC7[i])*d17[i];a23 +=(mC2[i]+mC3[i])*d23[i];a24+=(mC2[i]+mC4[i])*d24[i];a25+=(mC2[i] +mC5[i])*d25[i];a27+=(mC2[i]+mC7[i])*d27[i];a34+=(mC3[i]+mC4[i])* d34[i];a35+=(mC3[i]+mC5[i])*d35[i];a37+=(mC3[i]+mC7[i])*d37[i];a4 5+=(mC4[i]+mC5[i])*d45[i];a47+=(mC4[i]+mC7[i])*d47[i];a57+=(mC5[i ]+mC7[i])*d57[i];}a12/=2;a13/=2;a14/=2;a15/=2;a17/=2;a23/=2;a24/= 2;a25/=2;a27/=2;a34/=2;a35/=2;a37/=2;a45/=2;a47/=2;a57/=2; PTreeOp.H #include "PTree.H" #include "PTreeSet.H" #define N 5 PTreeSet inq,outq;int pin=0,pout=0;void pushin(PTree &p){inq[pin]=p;pin++;return;}void popout(PTree &p){pout-- ;p=outq[pout];return;} void swapq(void){int i; for(i=0;i<pin;i++){outq[i]=inq[i];}pout=pin;pin=0;}static void MultiplyByValue(PTreeSet &s,PTreeSet &p,int v,int n,int m){//r=v*p int i,j,k,x[50],y;PTree t1,sum,carry,dmy;pin=0;pout=0;y=v;for(j=0;j<m;j++){x[j]=y%2;y/=2;} /*return;*/ dmy.clearall(); for(i=0;i<100;i++){inq.add(dmy);outq.add(dmy);}for(i=0;i<m+n;i++){if(pout!=0){popout(t1);s[i]=t1;}else{t1.setall();s[i].clearall(); }while(pout>0){popout(t1);sum=s[i]^t1;carry=s[i]&t1;s[i]=sum;pushin(carry);} for(j=0;(j n-1)continue; if(x[j]!=0){sum=s[i]^p[k];carry=s[i]&p[k];s[i]=sum;pushin(carry);}}swapq();}return;} static void AddPTrees(PTreeSet &ss,PTreeSet &aa,PTreeSet &bb,int end,int len) {PTree sum(len),carry(len),t1(len),t2(len);int i;ss[0]=aa[0]^bb[0];carry=aa[0]&bb[0];for (i=1;i<end;i++){sum=aa[i]^bb[i];t1=aa[i]&bb[i];ss[i]=sum^carry;t2=sum&carry;carry=t1|t2;} ss[end]=carry;}static void Make2sComplement(PTreeSet &a,PTreeSet &bb,int n,int len){int i;PTree s(len),c(len),t(len);PTreeSet b;t.clearall();for(i=0;i<n;i++){b.add(t);b[i]=~bb[i];}a[0]=~b[0];c=b[0]; for(i=1;i<n;i++){a[i]=b[i]^c;c=b[i]&c;}} static void GreaterThanValue(PTree &mask,PTreeSet &p,int value,int nn){//P(A>v), mask=1 p>val int i,x,r;x=value; i=0; do{r=x%2;x/=2;i++;}while (r!=0);mask = p[i-1];while (i<nn){r=x%2;if(r==1){mask=mask&p[i];}else{mask=mask|p[i];}x/=2;i++;}} PTree.H #if !defined(PTREE_H) #define PTREE_H #include #include #include #include #include class PTree {private:unsigned int bits_per_wordbool loaded; unsigned long long int bitcnt;size_t wordcnt; unsigned long long int onecnt;/*1ct*/size_t *tree;unsigned long long int *index_list;bool _init(unsigned long long int); public: PTree(void);PTree(const PTree &);PTree(unsigned long long int);~PTree(void); const PTree & operator=(const PTree PTree operator& (const PTree &); PTree operator| (const PTree &); PTree operator^ (const PTree &); bool operator==(const PTree &);PTree operator~ (); bool is_loaded(void){return loaded;} void clearall(void){memset(tree,'\0',sizeof(size_t)*wordcnt);onecnt=0;} unsigned long long int size(void){return bitcnt;};bool is_clear(unsigned long long int bit){return !is_set(bit);}unsigned long long intget_count(void) {return onecnt;}void setall(void){for (unsigned long long int lp=0;lp<bitcnt;++lp)setbit(lp);return;}bool is_set(unsigned long long int);void setbit(unsigned long long int);void clearbit(unsigned long long int);unsigned long long int count();void reset();unsigned long long int*get_indexes(void);bool load(FILE *);bool load_binary(FILE *);bool save(FILE *);oid dump(FILE *);}; #endif

7 PTree.C #include #include #include #include #include #include "PTree.H" const char * const ascii_header="# ASCII PTree v1.0"; static size_t _base,_offset; static bool _get_position(unsigned long long int bit,unsigned long long int bitcnt, unsigned int bits_per_word) {if (bit>(bitcnt-1))return false;_ base=bit/bits_per_word; _offset=bit % bits_per_word; return true; } /*Priv fctn impl cnt #bits in word. Used by ct,load,op meths to set onecnt of the object during any of the method calls.*/ static inline int _count_word(size_t *word) {auto int bitcnt=0;auto uint8_t *byte_ptr=(uint8_t *) word; static int lookup[256]= {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5, 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8}; for(size_tbyte=0;byte<sizeof(size_t);++byte){bitcnt+=lookup[*byte_ptr];++byte_ptr;}return bitcnt;}static inline int _count_sparse_word(size_t *word) {auto int setbit,bitcnt=0; auto size_t tword=*word; while(tword!=0){++bitcnt; setbit=__builtin_ffsl(tword)-1;tword&=~(1UL<<setbit);}return bitcnt;} bool PTree::init(unsigned long long int bits){bits_per_word=sizeof(size_t)*CHAR_BIT;if((bits==0)||(bits>(bits_per_word*MAXINT))) return false; /* et number of bits represented and word count*/ bitcnt=bits; wordcnt=bits/bits_per_word; if((bitcnt % bits_per_word)!=0)++wordcnt; if((tree=(size_t *)malloc(sizeof(size_t)*wordcnt))==NULL){printf("Memory allocation failed.\n");return false;} memset(tree,'\0',wordcnt*sizeof(size_t));return true;} PTree::PTree(void){tree=NULL;index_list=NULL;reset();return;} PTree::PTree(const PTree &incoming){if(!_init(incoming.bitcnt)) return; index_list=NULL; memcpy(tree, incoming.tree, wordcnt*sizeof(size_t)); count(); loaded=true; return;} PTree::PTree(unsigned long long int bits){tree=NULL;index_list=NULL;reset();loaded=_init(bits);return;} const PTree & PTree::operator= (const PTree &incoming){if(&incoming==this) return*this; if(bitcnt!=incoming.bitcnt){reset(); if(!_init(incoming.bitcnt)) return *this;} memcpy(tree,incoming.tree,wordcnt*sizeof(size_t)); onecnt=incoming.onecnt;loaded=true;return *this;} PTree PTree::operator& (const PTree &incoming){auto size_t word=0; auto PTree tmp(incoming.bitcnt);while(word tree[word]&incoming.tree[word];++word;}tmp.count(); return tmp;} PTree PTree::operator^ (const PTree &incoming){auto size_t word=0;auto PTree tmp(incoming.bitcnt); while(word tree[word]^incoming.tree[word];++word;} tmp.count(); return tmp;} PTree PTree::operator| (const PTree &incoming){auto size_t word=0; auto PTree tmp(incoming.bitcnt); while(word tree[word]|incoming.tree[word];++word;} tmp.count(); return tmp;} bool PTree::operator==(const PTree &incoming){if(bitcnt!=incoming.bitcnt) return false; if(onecnt!=incoming.onecnt) return false; for(size_t word=0;word<wordcnt;++word)if(tree[word]!=incoming.tree[word])return false; return true;} PTree PTree::operator~ (){auto size_t words,residual_bits; auto unsigned long long int bitpos; auto PTree tmp(this->bitcnt); words=wordcnt; residual_bits=bitcnt % bits_per_word; if(residual_bits!=0) --words;for(size_t word=0;word<words;++word) {tmp.tree[word]=~tree[word];tmp.onecnt+= count_word(&tmp.tree[word]);} if(residual_bits==0) return tmp; bitpos=(wordcnt-1)* bits_per_word; for(size_t bit=0;bit<residual_bits;++bit){ if(is_set(bitpos))tmp.clearbit(bitpos); else tmp.setbit(bitpos); ++bitpos; } return tmp;} PTree::~PTree(void) {reset(); return;}bool PTree::is_set(unsigned long long int bit){if(!_get_position(bit, bitcnt, bits_per_word)) return false; if(tree[_base]&(1UL<<_offset)) return true; return false;} void PTree::setbit(unsigned long long int bit){auto bool was_clear; if(!_get_position(bit,bitcnt,bits_per_word))return; was_clear=is_clear(bit);tree[_base]|=(1UL<<_offset); if(was_clear)++onecnt; return;} void PTree::clearbit(unsigned long long int bit){auto bool was_set; if(!_get_position(bit,bitcnt,bits_per_word)) return; was_set=is_set(bit); tree[_base]&=~(1UL<<_offset); if(was_set) --onecnt; return;} unsigned long long int PTree::count(void){onecnt=0;for(size_t word=0;word<wordcnt;++word)onecnt+=_count_word(&tree[word]);return onecnt;} void PTree::reset(void){loaded =false;bitcnt=0; wordcnt=0; onecnt=0;if(tree!=NULL){free(tree);tree=NULL;} if(index_list!=NULL){free(index_list);index_list=NULL;}return;} unsigned long long int *PTree::get_indexes(void){auto int firstbit,bitbase=0;auto unsigned long long int entry=0;auto size_t tword, allocate; if(bitcnt==0)return NULL;allocate=onecnt*sizeof(unsigned long long int);index_list=(unsigned long long int *) realloc(index_list,allocate);if(index_list==NULL) return NULL; for(size_t word=0;word<wordcnt;++word){tword=tree[word];while(tword!=0){firstbit=__builtin_ffsl(tword)-1;index_list[entry++]=bitbase+firstbit; tword&=~(1UL<<firstbit);}if(entry==onecnt)return index_list; bitbase+=bits_per_word;}return index_list;} bool PTree::load(FILE *input){auto char *p,bufr[80]; auto unsigned long long int bits;if(fgets(bufr, sizeof(bufr),input)==NULL ) return false; if((p = strrchr(bufr, '\n'))!=NULL) *p='\0';if(strcmp(bufr,ascii_header)!=0) return false; if(fgets(bufr,sizeof(bufr),input)==NULL) return false; if((p=strrchr(bufr,'\n'))!=NULL) *p='\0'; #if 0 bits=strtoull(bufr,NULL,16); #else bits= trtoul(bufr,NULL,16); #endif if(!_init(bits) ) return false;for(size_t word=0;word<wordcnt;++word){auto size_t value;auto unsigned int lp;tree[word]=0; for(lp=0;lp<sizeof(size_t)/sizeof(uint32_t);++lp){if(fgets(bufr,sizeof(bufr),input)==NULL) return false;if((p=strrchr(bufr,'\n'))!=NULL)*p='\0'; value=ntohl(strtoul(bufr,NULL,16));tree[word]|=(value<<(lp*sizeof(uint32_t)*CHAR_BIT));}onecnt+=_count_word(&tree[word]);}loaded=true;return true;} bool PTree::load_binary(FILE *input){auto size_t bytes;auto unsigned long long int bits,blocks;if(this->bitcnt==0){if(fread(&bits,sizeof(unsigned long long int),1,input)!=1)return false;if(!_init(bits))return false;if(fread(&blocks,sizeof(blocks),1,input)!=1)return false; if(sizeof(blocks)==4)blocks*=2;}bytes=bitcnt/CHAR_BIT;if((bitcnt % CHAR_BIT)!=0)++bytes;while((bytes % sizeof(blocks))!=0)++bytes; if(fread(tree,sizeof(char),bytes,input)!=bytes){reset();return false;}count();loaded=true;return true;} bool PTree::save(FILE *outfile){if(!loaded) return false;fprintf(outfile,"%s\n",ascii_header);fprintf(outfile,"%llx\n",bitcnt); for(size_t word=0; word >(lp*sizeof(uint32_t)*CHAR_BIT); fprintf(outfile,"%08x\n",htonl(output));}} return true;} void PTree::dump(FILE*output){if(!loaded){fputs("PTree not loaded.\n", output);return;}fprintf(output,"%s: %s\n",__FILE__,__FUNCTION__);fprintf(output,"\tBit count:\t%llu\n",bitcnt);fprintf(output, "\tWord count:\t%zu\n",wordcnt);fprintf(output, "\tOne count:\t%llu\n\n",onecnt); auto unsigned long long int bitpos=0;for(size_t word=0;word<wordcnt; ++word){fprintf(output,"%6llu:",bitpos);for(unsignedintbit=0;bit =bitcnt){fputs("_",output);continue;}if(tree[word] &(1UL<<bit))fputs("1",output);else fputs("0",output); ++bitpos;}fprintf(output,":%llu\n",bitpos-1);}return;}

8 Mark S: "Faust is fast... takes ~15 sec on the same dataset that takes over 9 hours w knn and 40 min with pTree knn. Ready to take on oblique, need better accuracy (still working on that with cut method ("best gap" method)." FAUST is this many times faster than, Horizontal KNN 2160 taking 9.000 hours = 540.00 minutes = 32,400 sec. pCKNN: 160 taking.670 hours = 40.00 minutes = 2,400 sec. while Mdpt FAUST takes.004 hours =.25 minutes = 15 sec. "Doing experiments on faust to assess cutting off classification when gaps got too small (with an eye towards using knn or something from there). Results are pretty darn good… for faust this is still single gap, working on total gap (max of (min of prev and next gaps)) Here’s a new data sheet I’ve been working on focused on gov’t clients." Bill P: BestClsAttrGap-FAUST using all gaps meeting criteria (e.g., sum of 2 stds< gap width), AND all mask pTrees. Oblique FAUST is more accurate and faster. Md will send what he has and please interact with him on quadratics.. Could we get datasets for your performance analysis (with code of competitor algorithms etc.?) It would help us a lot in writing papers. We'd ork together on Oblique FAUST performance analysis using your benchmarks. You'd be co-author. My students crunch numbers... Mark S: Vendor opp: Provides data mining solutions to telecom operators for call analysis, etc - using faust in an unsupervised mode - thots on that for anomaly detection. Mark S 2/29: tweaking Greg's faust impl and look at gap split (looks for max gap, not max gap on both side of mean -should be?) WP: looks like 50%ones impure pTrees can give cut-hyperplanes (for FAUST) as good as raw pTrees. Advantage? Since FAUST training is a 1-time process, it isn't speed critical. Very fast impure pTree batch classific (after training). Once the cut-hyper-planes identified (e.g., FPGA spits out 50%ones impure pTrees for incoming unclassified datasets (e.g., satellite images) and sends them thro (FPGA) for "Md's "One-Pass-Across-Columns = OPAC" batch classification - all happening on-the-fly with nearly zero delay... For PINE (nearest neighbor), we don't even train a model, so the 50%ones impure pTree classification-phase could be very significantly better. Business Intelligence= "What does this customer want next, based on histories?": FAUST is model-based (training phase=build model of 1 hyperplane for Oblique or up to 1- per-col for non-Oblique). Use the model to classify. In Bus-Intel, with every new unclassified sample, a different vector space appears. (every customer rates a different set of items). So to use FAUST-PINE, there's the non-vector-space problem to solve. non-Oblique FAUST better than Oblique, since cols have different cardinalities (not a vector space to calculate oblique hyperplanes). Attempting is to marry MYRRH multi-hop Relationship or Rule Mining with FAUST-PINE Classificatn or Table Mining On Social Network Mining: We have some social network mining research threads percolating: 1. facebook-friends multi-hopped with buying-preference relationships (or multi-hopped with security threat relationships or with?) 2. implications of twitter blooms for event prediction (e.g., commod/stock changes, events, political trends, bubbles/bursts, purchasing patterns... WP: 3/1 "...very excited about the discussions on MYRRH and applying it to classification problems, seems hugely innovative..."I want to try to view Images as relationships, rather than as tables, each row = a pixel and each cols is "the photon count in a frequency band". Any table=relationship (AKA, a matrix, rolodex card) w 2 entity axes: 1. usual row entity (e.g., pixels), 2. col entity(s) (e.g., wavlen interval). Any matrix is a dual pair of tables (via rotation). Cust-Item Rating matrix is rating tbl pair: Custs(Items) and its rotated dual, Item(Custs). When sufficient #of fine-band, hyper-spectral sensors in the air (plus on/in the ground), there will be a sufficient # of separate columns to do MYRRH on the relationship between pixels and wavelengths multi-hopped with the relationship between classes and pixels (...nearly every measurement is a summarization or a intervalization (even pixel is a 2-D intervalization of an infinite set), so wl as an intervalization of a cont phenomenon is just as valid? What if we do FAUST-PINE on the rotated image relationship, Wavelength(pixel_photon_count) instead of, Pixel(Wavelength_photon_count)? Note that classes which are not convex in Pix(WL) (that are spread out spatially all over the image) might be convex in WL(Pix)? tried prelims - disappointing for classification (tried applying concept on SatLogLandsat(R,G,ir1,ir2,class). too few bands or classes? Still, I'm hoping for "Wow! Look at this!" when, e.g., classes aren't known/clear and there are thousands of them and millions of bands...) e.g., 2 huge square-ish relationships to multi-hop. difficult (curse of dim = too many cols which are the relevant?) rule mining comes into its own. One last thought: regarding " the curse of dimensionality = too many columns - which are the relevant ones? ", FAUST automatically filters irrelevant cols to find those that reveal [convex] classes (all good classes are convex in proper feature space. e.g., Class=yellow_car may round-ish in Pix(RedWaveLen,GreenWaveLen, BlueWaveLen, OtherWaveLens), once R,G,B are isolated as relevant ones. Class=pavement is fragmented in Pix(RWL,GWL,BWL,OWLs) but may be convex in WL(pix_x, pix_y) (because pavement is color consistent?) Last point: We have to get you a FAUST implementation! It almost has to be orders of magnitude faster than pknn! The speedup should be very sublinear - almost constant (nearly independent of cardinality) - because it is a bulk classifier (one horizontal pass gains us a class_mask_pTree, distinguishing all points predicted to be in that class). So, not only is it model-based, but it is a batch classifier. Model-based classifiers that require scanning horizontal datasets cannot compete! Mark 3/2/12: Very close on faust. WP: it's important the classification step be done in bulk lest you lose the main huge benefit of FAUST. What happens at the end if you've peeled off all the classes and there are still some unclassified points left? have “mixed”/“default” (e.g., SatLog class=6=“mixed”) potential interest from some folks who have close relationship with Arbitron. Seems like a netflix story to me... kmurph2@clemson.edu Mar 06 Yes, pTREES for med informatics, Bill! We could work so many miracles.. data we can generate requires robust informatics, comp. bio. would put resources into this. Keith Murphy, Chair Genetics/Biochem Dir, Clemson U Genomics Inst.kmurph2@clemson.edu WP: 3/6 W ave applied pTrees to Bioinformatics too (took second in 2002 ACM KDD-cup in bioinformatics and took first in the 2006 ACM KDD-cup in medical informatics. 2006 ACM KDD Cup Winning Team Leader Task 3. http://www.cs.unm.edu/kdd_cup_2006, http://www.cs.unm.edu/files/kdd-cup-2006-task-spec-final.pdf.http://www.cs.unm.edu/kdd_cup_2006http://www.cs.unm.edu/files/kdd-cup-2006-task-spec-final.pdf 2002 ACM KDD Cup, Task 2. Yeast Gene Regulation Prediction: See http://www.acm.org/sigs/sigkdd/kddcup/index.php?section=2002&method=reshttp://www.acm.org/sigs/sigkdd/kddcup/index.php?section=2002&method=res

9 pc bc lc cc pe age ht wt Multi-hop Data Mining (MDM): relationship1 (Buys= B(P,I) ties table1 P=People 2345 F(P,P)=Friends 0101 10100100 1001 5 4 3 2 P B(P,I)=Buys 0010 0000 0100 0001 I=Items 2345 Define the NearestNeighborVoterSet of {f} using strong R-rules with F in the consequent? A correlation is a relationship. A strong cluster based on several self-relationships (but different relationships, so it's not just strong implication both ways) is a set that strongly implies itself (or strongly implies itself after several hops (or when closing a loop). Find all strong, A  C, A  P, C  I Frequent iff ct(P A )>minsup and Confident iff ct(& p  A P p AND & i  C P i ) > minconf ct(& p  A P p ) Says: "A friend of all A will buy C if all A buy C." (the AND is always AND) Closures: A freq then A + freq. A  C not conf, then A  C - not conf ct(| p  A P p AND& i  C P i )>mncf ct(| p  A P p ) friend of any in A will buy C if any in A buy C. ct(| p  A P p AND | i  C P i )>mncf ct(| p  A P p ) Change to "friend of any in A will buy something in C if any in A buy C. (People=P=an axis with descriptive features columns) to table2 (Items), which is tied by relationship2 (Friends=F(P,P) ) to table3 (also P)... Can we do interesting clustering and/or classification on one of the tables using the relationships to define "close" or to define the other notions? Categorycolorsizewtstorecitystatecountry Dear Amal, We looked at 2012 cup too and, yes, it would form a good testbed for social media data mining work. Ya Zhu in our Sat gp is leading on "contests" and is looking at 2012 KDD Cup as well as Heritage Provider Network Health Prize (see kaggle.com). Hoping also for a nice test bed involving the our Netflix datasets (which you and then Dr. Wettstein prepared as pTrees and all have worked on extensively - Matt and Tingda Lu...). Hoping to find (in the netflix contest related literature) a real-life social network (a social relationship between two copies of the netflix customers such as maybe, facebook friends, that we can use inconjunction with the netflix "rates" relationship between netflix customers and netflix movies. We would be able to do something with that set up (all as PTreeSet both ways). For those new to dataSURG Dr. Amal Shehan Perera is a Senior Professor in Sri Lanka and was a lead researcher in our group for many yrs. kaggle.com He is the architect of using GAs to win the KDD Cup in both 2002 and 2006. He gets most of the credit for those wins, as it was definitely GA work in both cases that pushed us over the top (I believe anyway). He's the best!! You would be wise to stay in touch with him. Sat, Mar 24, Amal Shehan Perera <shehan@uom.lk: Just had a peek into the slides last week and saw a request for social media data. Just wanted to point out that the 2012 KDD Cup is on social media data. I haven't had a chance to explore data yet. If I do I will update you. Rgds,-amalshehan@uom.lk

10 Bioinformatics Data Mining: Most bioinformatics done so far is not really data mining but is more toward the database querying side. (e.g., a BLAST search). What would be real Bioinformatics Data Mining (BDM)? A radical approach View whole Human Genome as 4 binary relationships between People and base-pair-positions (ordered by chromosome first, then gene region?). AHG(P,bpp) 001 100 001 100 000 011 100 0000 0100 0001 0100 0000 0011 0100 P 7B... 5 4 3 2 1 123 bpp 45...3B AHG is the relationship between People and adenine (A) (1/0 for yes/no) THG is the relationship between People and thymine (T) (1/0 for yes/no) GHG is the relationship between People and guanine (G) (1/0 for yes/no) CHG is the relationship between People and cytosine (C) (1/0 for yes/no) Order bpp? By chromosome and by gene or region (level2 is chromosome, level1 is gene within chromosome.) Do it to facilitate cross-organism bioinformatics data mining? This is a comprehensive view of the human genome (plus other genomes). Create both a People-PTreeSet and PTreeSet vertical human genome DB with a human health records feature table associated with the people entity. Then use that as a training set for both classification and multi-hop ARM. A challenge would be to use some comprehensive decomposition (ordering of bpps) so that cross species genomic data mining would be facilitated. On the other hand, if we have separate PTreeSets for each chrmomsome (or even each regioin - gene, intron exon...) then we can may be able to dataming horizontally across the all of these vertical pTree databases. pc bc lc cc pe age ht wt AHG(P,bpp) 001 100 001 100 000 011 100 0000 0100 0001 0100 0000 0011 0100 P 7B... 5 4 3 2 1 123 bpp 45...3B The red person features used to define classes. AHG p pTrees for data mining. We can look for similarity (near neighbors) in a particular chromosome, a particular gene sequence, of overall or anything else. genechromosome

11 A facebook Member, m, purchases Item, x, tells all friends. Let's make everyone a friend of him/her self. Each friend responds back with the Items, y, she/he bought and liked. Facebook-Buys: Members 4321 F≡Friends(M,M) 0111 1011 0110 1101 1 2 3 4 Members P≡Purchase(M,I) 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 I≡Items 2345  X  I MX≡& x  X P x People that purchased everything in X. FX≡OR m  MX F b = Friends of a MX person. So,  X={x}, is Mx Purchases x strong" Mx=OR m  Px F m  x frequent if Mx large. This is a tractable calculation. Take one x at a time and do the OR. Mx=OR m  Px F m  x confident if Mx large. ct( Mx  P x ) / ct(Mx) > minconf 4321 0 1 0 1 2 1011 1001 2 4 K 2 = {1,2,4} P 2 = {2,4} ct(K 2 ) = 3 ct(K 2 &P 2 )/ct(K 2 ) = 2/3 To mine X, start with X={x}. If not confident then no superset is. Closure: X={x.y} for x and y forming confident rules themselves.... ct(OR m  P x F m & P x )/ct(OR m  P x F m )>mncnf Kx=OR O g  x frequent if Kx large (tractable- one x at a time and OR. g  OR b  Px F b Kiddos 4321 F≡Friends(K,B) 0111 1011 0110 1101 1 2 3 4 Buddies P≡Purchase(B,I) 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 I≡Items 2345 1 2 3 4 Groupies Others(G,K) 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 4321 0 1 0 1 2 1011 1001 2 4 K 2 ={1,2,3,4} P 2 ={2,4} ct(K 2 ) = 4 ct(K 2 &P 2 )/ct(K 2 )=2/4 0 1 0 1 4 1 1 1 0 2 1 1 0 1 1 1 2 3 4 Fcbk buddy, b, purchases x, tells friends. Friend tells all friends. Strong purchase poss? Intersect rather than union (AND rather than OR). Ad to friends of friends Kiddos 4321 F≡Friends(K,B) 0111 1011 0110 1101 1 2 3 4 Buddies P≡Purchase(B,I) 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 I≡Items 2345 1 2 3 4 Groupies Compatriots (G,K) 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 4321 0 1 0 1 2 1011 1001 2 4 K 2 ={2,4} P 2 ={2,4} ct(K 2 ) = 2 ct(K 2 &P 2 )/ct(K 2 ) = 2/2 0 1 0 1 4 1 1 0 1 1 1 2 3 4

12 R 11 1 0 1 0 1 Given a n-row table, a row predicate (e.g., a bit slice predicate, or a category map) and a row ordering (e.g., asc on key; or for spatial data, col/row- raster, Z, Hilbert), the sequence of predicate truth bits is the raw or level-0 predicate Tree (pTree) for that table, row predicate and row order. Given a raw pTree, P, a partitioned of it, par, and a bit-set predicate, bsp (e.g., pure1, pure0, gte50%One), the level-1 par, bsp pTree is the string of truths of bsp on consecutive partitions of par. If the partition is an equiwidth=m intervalization, it's called the level-1 stride=m bsp pTree. IRIS Table Name SL SW PL PW Color setosa 38 38 14 2 red setosa 50 38 15 2 blue setosa 50 34 16 2 red setosa 48 42 15 2 white setosa 50 34 12 2 blue versicolor 51 24 45 15 red versicolor 56 30 45 14 red versicolor 57 28 32 14 white versicolor 54 26 45 13 blue versicolor 57 30 42 12 white virginica 73 29 58 17 white virginica 64 26 51 22 red virginica 72 28 49 16 blue virginica 74 30 48 22 red virginica 67 26 50 19 red P 0 SL,0 0 1 0 1 0 1 predicate: remainder(SL/2)=1 order: the given table order P 0 Color=red 1 0 1 0 1 0 1 0 1 pred: Color=red order: given ord P 0 SL,1 1 0 1 0 1 0 1 pred: rem(div(SL/2)/2)=1 order: given order gte50% stride=5 P 1 SL,1 1 0 pure1 str=5 P 1 SL,1 0 gte25% str=5 P 1 SL,1 1 P 0 PW<7 1 0 pred: PW<7 order: given gte50% stride=5 P 1 PW<7 1 0 gte50% st=5 pTree predicts setosa. gte75% str=5 P 1 SL,1 1 0 gte50% str=5 P 1 C=red 0 1 pure1 str=5 P 1 C=red 0 gte25% str=5 P 1 C=red 1 gte75% str=5 P 1 C=red 0 1 P 0 SL,0 0 1 0 1 0 1 rem(SL/2)=1 ord: given gte50% stride=4 P 1 SL,0 0 1 gte50% stride=8 P 1 SL,0 0 1 gte50% stride=16 P 1 SL,0 0 lev2 pTree= lev1 pTree on a lev1. (1col tbl) P 0 SL,0 0 1 0 1 0 1 pred: rem(SL/2)=1 ord: given order P 1 gte50%,s=4,SL,0 ≡ gte50% stride=4 P 1 SL,0 0 1 level-2 gte50% stride=2 1 P 2 gte50%,s=4,SL,0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 gte50_P 11 raw level-0 pTree level-1 gt50 stride=4 pTree level-1 gt50 stride=2 pTree

13 gte50 Satlog-Landsat stride=64, classes: redsoil cotton greysoil dampgreysoil stubble verydampgreysoil 255... 1 0 R Rir2 0100 1010 0000 0000 255... 1 1 R ir 2 01...255 Rir1 1000 0101 1000 0000 255... 1 0 R ir 1 02...255 RG 0001 0000 0011 0100 255... 1 0 R G 01 255 r clcl cgdsv Rclass 000 000 000 100 0 0 0 0 0 0 1 0 0 1 0 0 Gir1 1010 0101 0000 0000 255... 1 0 G ir 1 01...255... 1 0 G Gir2 0100 1010 0000 0000 255... 1 0 G ir 2 01...255 Gclass 000 000 000 100 0 0 0 0 0 0 1 0 0 1 0 0 255... 1 0 ir1 ir1ir2 0100 1010 0000 0000 255... 1 0 ir1 ir 2 01...255 ir1class 000 000 000 100 0 0 0 0 0 0 1 0 0 1 0 0 255... 1 0 ir2 ir2class 000 000 000 100 0 0 0 0 0 0 1 0 0 1 0 0 r clcl cgdsv r clcl cgdsv r cl cgdsv gte50 Satlog-Landsat stride=320, get: Note: stride=320, means are way off and will produce inaccurate classification.. lev0 pVector is a bit string w 1bit/rec. lev1 pVector=bit string wbit/rec/stride, =pred_truth applied to record stride. levN pTree = levK pVec (K=0...N-1) all with same predicate and s.t each levK stride contained within 1 levK-1 stride. 320-bit strides start end cls cls 320 strd 2 1073 1 1 2 321 1074 1552 2 1 322 641 1553 2513 3 1 642 961 2514 2928 4 2 1074 1393 2929 3398 5 3 1553 1872 3399 4435 _7 3 1873 2192 4436 3 2193 2512 4 2514 2833 5 2929 3248 7 3399 3718 7 3719 4038 7 4039 4358 R G ir1 ir2 cls means stds means stds means stds means stds 1 64.33 6.80 104.33 3.77 112.67 0.94 100.00 16.31 2 46.00 0.00 35.00 0.00 98.00 0.00 66.00 0.00 3 89.33 1.89 101.67 3.77 101.33 3.77 85.33 3.77 4 78.00 0.00 91.00 0.00 96.00 0.00 78.00 0.00 5 57.00 0.00 53.00 0.00 66.00 0.00 57.00 0.00 7 67.67 1.70 76.33 1.89 74.00 0.00 67.67 1.70 4436... 2 1 pixels R WL band Gir 1 ir 2 class 29152230 202140 160318 2143110 54 155 78 10 2 7 4 1 4436... 2 1 pixels [w1,w2) WLs [w2,w3)[w3,w4)[w4,w5) 29152230 202140 160318 2143110 54 155 78 10 class 2 7 4 1 4436... 2 1 pixels Given a relationship, it generates 2 dual tables 4436... 2 1 pixels w1 WLs w2...w5000 29152230 202140 160318 2143110 54 155 78 10 4436... 2 1 pixels w1 WLs w2...w5000 29152230 202140 160318 2143110 54 155 78 10 w5000... w2 w1 WLs 1 pixels 2...4436 29152230 202140 160318 2143110 54 155 78 10 The table is (and it generates the [labeled by value] relationships):

14 FAUST Satlog evaluation R G ir1 ir2 mn 62.83 95.29 108.12 89.50 1 48.84 39.91 113.89 118.31 2 87.48 105.50 110.60 87.46 3 77.41 90.94 95.61 75.35 4 59.59 62.27 83.02 69.95 5 69.01 77.42 81.59 64.13 7 R G ir1 ir2 std 8 15 13 9 1 8 13 13 19 2 5 7 7 6 3 6 8 8 7 4 6 12 13 13 5 5 8 9 7 7 Oblique level-0 using midpoint of means 1's 2's 3's 4's 5's 7's True Positives: 322 199 344 145 174 353 False Positives: 28 3 80 171 107 74 NonOblique lev-0 1's 2's 3's 4's 5's 7's True Positives: 99 193 325 130 151 257 Class actual-> 461 224 397 211 237 470 NonOblq lev1 gt50 1's 2's 3's 4's 5's 7's True Positives: 212 183 314 103 157 330 False Positives: 14 1 42 103 36 189 Oblique level-0 using means and stds of projections (w/o cls elim) 1's 2's 3's 4's 5's 7's True Positives: 359 205 332 144 175 324 False Positives: 29 18 47 156 131 58 Oblique lev-0, means, stds of projections (w cls elim in 2345671 order) Note that none occurs 1's 2's 3's 4's 5's 7's True Positives: 359 205 332 144 175 324 False Positives: 29 18 47 156 131 58 a = pm r + (pm v -pm r ) = pstd v +2pstd r 2pstd r pm r *pstd v + pm v *2pstd r pstd r +2pstd v Oblique level-0 using means and stds of projections, doubling pstd No elimination! 1's 2's 3's 4's 5's 7's True Positives: 410 212 277 179 199 324 False Positives: 114 40 113 259 235 58 Oblique lev-0, means, stds of projs, doubling pstd r, classify, eliminate in 2,3,4,5,7,1 ord 1's 2's 3's 4's 5's 7's True Positives: 309 212 277 154 163 248 False Positives: 22 40 65 211 196 27 2s 1, # of FPs reduced and TPs somewhat reduced. Better? Parameterize the 2 to max TPs, min FPs. Best parameter? Oblique lev-0, means,stds of projs, doubling pstd r, classify, elim 3,4,7,5,1,2 ord 1's 2's 3's 4's 5's 7's True Positives: 329 189 277 154 164 307 False Positives: 25 1 113 211 121 33 above=(std+stdup)/gap below=(std+stddn)/gapdn suggest ord 425713 abv below abv below abv below abv below avg 1 4.33 2.10 5.29 2.16 1.68 8.09 13.11 0.94 4.71 2 1.30 1.12 6.07 0.94 2.36 3 1.09 2.16 8.09 6.07 1.07 13.11 5.27 4 1.31 1.09 1.18 5.29 1.67 1.68 3.70 1.07 2.12 5 1.30 4.33 1.12 1.32 15.37 1.67 3.43 3.70 4.03 7 2.10 1.31 1.32 1.18 15.37 3.43 4.12 red green ir1 ir2 cls avg 4 2.12 2 2.36 5 4.03 7 4.12 1 4.71 3 5.27 2s1/(2s1+s2) elim ord: 425713 TP: 355 205 224 179 172 307 FP: 37 18 14 259 121 33 1 2 3 4 5 7 tot 461 224 397 211 237 470 2000 TP actual 99 193 325 130 151 257 1155 TP nonOb L0 pure1 212 183 314 103 157 330 1037 TP nonOblique 14 1 42 103 36 189 385 FP level-1 50% 322 199 344 145 174 353 1537 TP Obl level-0 28 3 80 171 107 74 463 FP MeansMidPoint 359 205 332 144 175 324 1539 TP Obl level-0 29 18 47 156 131 58 439 FP s1/(s1+s2) 410 212 277 179 199 324 1601 TP 2s1/(2s1+s2) 114 40 113 259 235 58 819 FP Ob L0 no elim 309 212 277 154 163 248 1363 TP 2s1/(2s1+s2) 22 40 65 211 196 27 561 FP Ob L0 234571 329 189 277 154 164 307 1420 TP 2s1/(2s1+s2) 25 1 113 211 121 33 504 FP Ob L0 347512 355 189 277 154 164 307 1446 TP 2s1/(2s1+s2) 37 18 14 259 121 33 482 FP Ob L0 425713 2 33 56 58 6 18 173 TP BandClass rule 0 0 24 46 0 193 263 FP mining (below) G[0,46]  2G[47,64]  5 G[65,81]  7 G[81,94]  4 G[94,255]  {1,3} R[0,48]  {1,2} R[49,62]  {1,5} R[82,255]  3 ir1[0,88]  {5,7}ir2[0,52]  5 Conclusion? MeansMidPoint and Oblique std1/(std1+std2) are best with the Oblique version slightly better. I wonder how these two methods would work on Netflix? Two ways: UTbl(User, M 1,...,M 17,770 )  (u,m); umTrainingTbl = SubUTbl(Support(m), Support(u), m) MTbl(Movie, U 1,...,U 480189 )  (m,u); muTrainingTbl = SubMTbl(Support(u), Support(m), u)

15 Netflix data {m k } k=1..17770 uID rating date u i 1 r m k,u d m k,u u i 2. u i n k m k (u,r,d) avg:5655u/m mID uID rating date m 1 u 1 r m,u d m,u m 1 u 2. m 17770 u 480189 r 17770,480189 d 17770,480189 or U 2649429  -------- 100,480,507 --------  Main:(m,u,r,d) avg:209m/u u 1 u k u 480189 m 1 : m h : m 17770 rmhukrmhuk   47B   MTbl(mID,u 1...u 480189 ) u 0,2 u 480189,0 m 1 : m h : m 17770 0/1   47B   MPTreeSet 3*480189 bitslices wide  (u,m) to be predicted, from umTrainingTbl = SubUTbl(Support(m), Support(u),m) Of course, the two supports won't be tight together like that but they are put that way for clarity. Lots of 0 s in vector sp, umTraningTbl). Want the largest subtable without zeros. How? SubUTbl(  n  Sup(u)  m Sup(n), Sup(u),m)? Using Coordinate-wise FAUST (not Oblique), in each coordinate, n  Sup(u), divide up all users v  Sup(n)  Sup(m) into their rating classes, rating(m,v). then: 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. This of course may be slow. How can we speed it up? Coord FAUST, in each coord, v  Sup(m), divide up all movies n  Sup(v)  Sup(u) to rating classes 1. calculate the class means and stds. Sort means. 2. calculate gaps 3. choose best gap and define cutpoint using stds. Gaps alone not best (especially since the sum of the gaps is no more than 4 and there are 4 gaps). Weighting (correlation(m,n)-based) useful (higher the correlation the more significant the gap??) Ctpts constructed for just this one prediction, rating(u,m). Make sense to find all of them. Should just find, e,g, which n-class-mean(s) rating(u,n) is closest to and make those the votes? m 1... m h... m 17770 u 1 : u k. u 480189 rmhukrmhuk   47B   UserTable(uID,m 1,...,m 17770 ) m 0,2... m 17769,0 u 1 : u k. u 480189 1/0   47B   UPTreeSet 3*17770 bitslices wide  (u,m) to be predicted, form umTrainingTbl=SubUTbl(Support(m),Support(u),m) u 324513?45 m12455m12455 m12455m12455


Download ppt "Mark asked about anomaly detection using pTrees (AKA outlier determination?) Some pTree outlier detection papers (AKA, anomaly detection?) (note this was."

Similar presentations


Ads by Google