DSR is binary (1 means doc was sent by Sender to Reciever)

Slides:



Advertisements
Similar presentations
CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.
Advertisements

Linear Classifiers (perceptrons)
The Binary Numbering Systems
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.
RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there.
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Data Representation Number Systems.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
G = (  n  SUPu 1 e(u 1,n)FM n,...,  n  SUPu lastu e(u lastu,n)FM n,...,  v  SUPm 1 e(v,m 1 )UF v,...,  v  SUPlastm 1 e(v,m lastm )UF v ) 0 = dsse(t)/dt.
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
1234 G Exp G So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster.
Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
R r vv r m R r v v v v r r v m V v r v v r v Oblique FAUST Clustering P R = P (X dot d)
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Q&A f=distance dominated functional, avgGap=(f max -f min )/|f(X)| may be a good measurement for setting thresholds, e.g., x is an outlier=anomaly if.
5(I,C) (I,C) (I,C) (I,C)
EDU=E S# C# SNAME AGE CNAME SITE GRADE 17 5 BAID 19 3UA ND CLAY 21 3UA NJ CLAY 21 CUS ND THAISZ 18 3UA NJ THAISZ 18 CUS.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Statistics 202: Statistical Aspects of Data Mining
Support vector machines
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Transformation: Normalization
Chapter 7. Classification and Prediction
12-1 Organizing Data Using Matrices
Matrices Rules & Operations.
Artificial Neural Networks
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Near Duplicate Detection
Matrix Sketching over Sliding Windows
CIS 764 Database Systems Engineering
DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH
Classification with Perceptrons Reading:
The vertex-labelled, edge-labelled graph
Mean Shift Segmentation
MYRRH A hop is a relationship, R, hopping from one entity, E, to another entity, F. Strong Rule Mining (SRM) finds all frequent and confident rules, AC.
Computational Molecular Biology
Vectors Jeff Chastine.
Using a 3-dim DSR(Document Sender Receiver) matrix and
LSI, SVD and Data Management
Fitting Curve Models to Edges
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Matrix Algebra - Overview
Fundamentals of Data Representation
Support vector machines
Multilayer Perceptron & Backpropagation
FAUST Oblique Analytics are based on a linear or dot product, o Let X(X1...Xn) be a table. FAUST Oblique analytics employ.
Functional Analytic Unsupervised and Supervised data mining Technology
The Multi-hop closure theorem for the Rolodex Model using pTrees
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Support vector machines
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Joins and other advanced Queries
Word embeddings (continued)
Types of Errors Data transmission suffers unpredictable changes because of interference The interference can change the shape of the signal Single-bit.
Pattern Recognition and Training
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Mining Communications data prediction and anomaly detection on emails, tweets, phone, text DSR is binary (1 means doc was sent by Sender to Reciever). Or, Sender can be attr. of Doc(typ,Nm,Sz,HsTg,Sndr) fR The pSVD trick is to replace these massive relationship matrixes with small feature matrixes. fR fS f1,S f2,S 1 fS 1 Using just one feature, replace with vectors, f=fDfTfUfSfR or f=fDfTfU DSR U 2 3 4 5 TD 1 D T UT fD 1 D 2 3 4 5 fD 1 DSR rec  sender  fT 1 fT 1 Replace DSR with fD, fS, fR 5 4 3 T 2 TD fD 1 Replace TD with fT and fD fD 1 FU 1 fU 1 U 2 3 4 5 UT fT 1 fT 1 Replace UT with fU and fT feature matrixes (2 features) Use GradientDescent+LineSearch to minimize sum of square errors, sse, where sse is the sum over all nonblanks in TD, UT and DSR. Should we train User feature segments separately (train fU with UT only and train fS and fR with DSR only?) or train U with UT and DSR, then let fS = fR = fU , so f = This will be called 3D f. <----fD----> 1 <----fT----> <----fU----> <----fS----> <----fR----> Or training User the feature segment just once, f = This will be called 3DTU f <----fD----> 1 <----fT----> <fU=fS=fR> We do pTrees conversions and train F in the CLOUD; then download the resulting F to user's personal devices for predictions, anomaly detections. The same setup should work for phone record Documents, tweet Documents (in the US Library of Congress) and text Documents, etc.

pSVD for Communication Analytics, f = fDTD fTTD fTUT fUUT fSDSR fDDSR Train f as follows: Train w 2D matrix, TD Train w 2D matrix UT Train over the 3D matrix, DSR pSVD for Communication Analytics, f = fDTD 1 fTTD fTUT fUUT fSDSR fDDSR fRDSR sse=nbTD(td-TDtd)2 sse=nbUT(ut-UTut)2 sse=nbDSR(dsr-DSRdsr)2 ssed=2nbTD(td-TDtd)t sseu=2nbUT(ut-UTtd)t ssed=2nbDSR(dsr-DSRdsr)sr sset=2nbTD(td-TDtd)d sset=2nbUT(ut-UTtd)u sses=2nbDSR(dsr-DSRdsr)dr sser=2nbDSR(dsr-DSRdssr)ds pSVD classification predicts blank cell values. DSR fSDSR 1 fDTD fTTD fTUT fUUT U 2 3 4 5 D T fDDSR TD UT fRDSR pSVD FAUST Cluster: Use pSVD to speed up FAUST cluster by looking for gaps in TD rather than TD (i.e., using SVD predicted values rather than actual given TD values). The same goes for DT, UT, TU, DSR, SDR, RDS. E.g., on the T(d1,...,dn) table, the tth row is pSVD estimated as (ft*d1,...,ft*dn) and the dot product vot is pSVD estimated as k=1..n vk*ft*dk So we analyze gaps in this column of values taken over all rows, t. pSVD FAUST Classification: Use pSVD to speed up FAUST Classification by finding optimal cutpoints in TD rather than TD (i.e., using SVD predicted values rather than actual given TD values). Same goes for DT, UT, TU, DSR, SDR, RDS.

fTrowi=fRifC=fRi fC1...fCn = fRifCn ... fRifCn fTcolj=fRtrfCj=fR1 fCj A real valued vector space, T(C1..Cn) is a 2-entity (R=row entity, C=column entity) labeled relationship over rows R1..RN, columns C1..Cn Let fTi,j= fRiofC be the approximation to T where f=(fRfC) is a F(N+n) matrix trained to minimize sse=Tij nonblank(fTij-Tij)2. Assuming one feature (i.e., F=1): fTrowi=fRifC=fRi fC1...fCn = fRifCn ... fRifCn fTcolj=fRtrfCj=fR1 fCj : fRN = fR1fCj fRNfCj One forms each SPTS by multiplying a SPTS by a number (Md's alg) So we only need the two feature SPTSs. to get the entire PTS(fT) which approximates PTS(T) A 2-entity matrix can be viewed as a vector space 2 ways. E.g., Document entity: We meld the Document table with the DSR matrix and the DT matrix to form an ultrawide Universal Doc Tbl, UD(Name,Time,Sender,Length,Term1,...,TermN,Receiver1,...,Receivern) where N=~10.000 and n=~1,000,000,000. We train 2 feature vectors to approximate UD, fD and fC where fC=(fST,fS,fL,FT1,...,fTN,fR1,...,fRn). We have found it best to train with a minimum of matrixes, which means that there will be a distinct fD vectors for each matrix.) How many bitslices in the PTreeSet for UD? Assuming an average of bitwidth=8 for its columns, that would be 8,000,080,0024 bitslices. That may be too many to be useful (e.g., for download onto an Iphone). Therefore we can appoximate PTreeSetUD with fUD as above. Whenever we need a Scalar PTreeSet representing a column, Ck, of UD (from PTreeSetUD) we can download that fCk value plus fD and multiply the SPTS, fD, by the constant, fCk to get a "good" approximation to the actual SPTS needed. We note that the concept of the join (equijoin) which is so central to the relational model, is not necessary when we use the rolodex model and focus on entities (each entity, as a join attribute is pre-joined.)

A vector space is closed under addition (adding one vector componentwise to and multiplication by a scalar (real multiplication or multiplication of a vector by a real number producing another vector). We also need component-wise multiplication (vector multiplication) (the 1st half of dot product) but is not a required vector space operation. Md and Arjun, do you have code these? Some thoughts on scalar multiplication. It's just shifts and additions? e.g., Take v=(7,1,6,6,1,2,2,4)TR and scalar mult by 3=(0 1 1) 1 the leftmost 1 bit in 3 shifts each bitslice 1 to the left and those get added to the unshifted bitslices (due to the units 1 bit. The results bitslices are: r3 r2 r1 r0 v2 v1 v0 due to the 1x21 in 3 v2 v2 v1 v0 due to the 1x21 in 3 v2 v2+v1 v1+v0 v0 Note vi + vj = vi XOR vj with carry vi AND vj

Recalling the massive interconnection of relationships between entities, any analysis we do on this we can do after estimating each matrix using pSVD trained feature vectors for the entities. DSR 1  sender  rec UT 1 On the next slide we display the pSVD1 (one feature) replacement by a feature vector which approximates the non-blank cell values and predicts the blanks.  Customer 1 2 3 4 Item 1 customer rates movie as 5 card cust item card 5 6 7 People  1 2 3 4 Author movie 2 3 1 5 4 customer rates movie card 2 3 4 5 PI 2 3 4 5 PI 4 3 2 1 Course Enrollments 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc People  term  7 6 5 4 3 Gene 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 termterm card (share stem?) 1 3 Exp expPI card expgene card genegene card (ppi)

fE fDSR,S fG1 fG2 fG3 fG5 fE2 fG4 fUT,T fDSR,D fUT,U fCI,C fCI,I fTD,T On this slide we display the pSVD1 (one feature) replacement by a feature vector which approximates the non-blank cell values and predicts the blanks. 1 fDSR,R fDSR,S Train the following feature vector thru gradient descent of sse, but that each set of matrix feature vectors be trained on only the sse over the nonblank cells of that matrix. / train these 2 on GG1 \ /train these 2 on EG\ / train on GG2 \ And the same for the rest of them. Any data mining we can do with the matrixes, we can do (estimate) with the feature vectors (e.g., netflix like recommenders, prediction of blank cell values, FAUST gap based classification and clustering including anomaly detection). fG1 fG2 fG3 fG5 1 fE2 fG4 fUT,T Doc Sender Receiver fDSR,D fUT,U UT fCI,C 1 2 3 4 Gene2 Item Doc 5 G3 Experiment G1 6 T1 7 =Customer=users Author People = movie Course T2 fCI,I CI fTD,T fG1 fTT,T1 1 2 3 4 5 6 7 fUM,M fE,S fG2 1 fTD,D 1 fTD,D AD TD 1 fD2 3 2 1 fD1 fE,C GG1 Enroll DD fUM,M fE fG3 UserMovie ratings fG5 fTT,T1 1 fE1 1 fE2 ExpPI ExpG 3 2 fTT,T2 1 TermTerm fG4 GG2

k=1..n(f1R1f1Ck+..+fKR1fKCk)dk = A n-dim vector space, RC(C1,...,Cn) is a matrix or TwoEntityRelationship (with row entity instances R1...RN and column entity instances C1...Cn.) ARC will denote the pSVD approximation of RC: FC= f1C f2C 2 4 ... 5 A N+n vector, f=(fR, fC) defines prediction, pi,j=fRifCj, error, ei,j=pi,j-RCi,j then ARCf,i,j≡fRifCj and ARCf,row_i= fRifC= fRi(fC1...fCn)= (fRifC1...fRifCn). Use sse gradient descent to train f. fC 4 1 ... 3 RC C1 C2 ... Cn R1 R2 . RN fR 1 : 6 4 2 3 5 fR1(fCodt) Once f is trained and if d=unit n-vector, the SPTS, ARCfodt, is: k=1..n fR2fCkdk : k=1..n fRNfCkdk fR1k=1..n fCkdk = fR2k=1..n fCkdk fRNk=1..n fCkdk k=1..n fR1fCkdk = (fR1fC)odt = (fR2fC)odt (fRNfC)odt fR2(fCodt) fRN(fCodt) 1 : 2 4 3 5 f1R f2R 1 1 1 1 1 1 Compute fCodt=k=1..nfCkdk form constant SPTS with it, and multiply that SPTS by SPTS, fR. d 1 ... Any datamining that can be done on RC can be done using this pSVD approximation of RC, ARC e.g., FAUST Oblique (because ARCodt should show us the large gaps quite faithfully). Given any K(N+n) feature matrix, F=[FR FC], FRi=(f1Ri...fKRi), FCj=(f1Cj...fKCj) pi,j=fRiofCj=k=1..KfkRifkCj Once F is trained and if d=unit n-vector, the SPTS, ARCodt, is: (FR1oFC)odt = (FR2oFC)odt : (FRNoFC)odt k=1..n(f1R1f1Ck+..+fKR1fKCk)dk = k=1..n(f1R2f1Ck+..+fKR2fKCk)dk k=1..n(f1RNf1Ck+..+fKRNfKCk)dk FR1o(FCodt) FR2o(FCodt) FRNo(FCodt) Keeping in mind that we have decided (tentatively) to approach all matrixes as rotatable tables, this then is a universal method of approximation. The big question is, how good is the approximation for data mining? It is known to be good for Netflix type recommender matrixes but what about others?

t sse Rnd 1.67 24.357 1 0.1 4.2030 2 0.124 1.8173 3 0.085 1.0415 4 0.16 0.7040 5 0.08 0.5115 6 0.24 0.3659 7 0.074 0.2741 8 0.32 0.2022 9 0.072 0.1561 10 0.4 0.1230 11 0.07 0.0935 12 0.42 0.0741 13 0.05 0.0599 14 0.07 0.0586 15 0.062 0.0553 16 0.062 0.0523 17 0.062 0.0495 18 0.063 0.0468 19 2.1 0.0014 20 0.1 0.0005 21 0.2 0.0000 22 Of course if we take the previous data (all nonblanks=1. and we only count errors in those nonblarnks, then f=pure1 has sse=0. But of course, if it is a fax-type image (of 0/1s) then there are no blank (=0 positions must be assessed error too). So we change the data. 1 2 3 4 5 6 7 8 9 a b 1 1 2 5 3 2 3 4 5 6 3 7 2 8 9 4 3 10 1 11 3 4 12 13 1 14 5 15 2 e 1 2 3 4 5 6 7 8 9 a b 1 2 3 4 5 6 7 8 9 a b c d e f -0.23 -2.54 0.078 -4.52 -2.22 -3.56 -3.56 -3.56 -3.02 -2.22 -3.08 -3.85 0.213 -3.46 -0.18 -1.56 -2.50 -2.50 -2.50 -2.12 -1.56 -2.17 -2.71 -1.99 -3.85 -3.54 -1.74 -2.78 -2.78 -2.78 -2.36 -1.74 -2.41 -3.02 -1.52 0.047 -2.71 -1.33 -2.13 -2.13 -2.13 -1.81 -1.33 -1.85 -2.31 -1.68 -3.26 -3.00 -1.47 -2.36 -2.36 -2.36 -0.00 -1.47 -2.04 -2.55 -2.12 -4.11 0.215 -1.86 -2.97 -2.97 -2.97 -2.52 -1.86 -2.58 -0.22 -1.20 -2.32 -2.14 -1.05 -1.68 -1.68 -1.68 -1.42 -0.05 -1.46 -1.82 -2.53 -4.90 -4.50 -2.21 -3.54 -3.54 -3.54 -3.00 -2.21 -0.07 0.156 -1.20 -2.32 -2.14 -0.05 -1.68 -1.68 -1.68 -1.42 -1.05 -1.46 -1.82 0.008 -3.85 -3.54 -1.74 -2.79 -2.79 -2.79 -2.36 -1.74 -2.41 -3.02 t sse 0.25 13.128 1.04 11.057 0.4 10.633 0.6 10.436 0.4 10.349 0.6 10.298 0.4 10.266 0.6 10.241 0.5 10.223 0.4 10.209 1 10.193 0.4 10.182 0.5 10.176 0.5 10,171 0.5 10.167 0.5 10.164 0.5 10.161 0.5 10.159 0.5 10.158 0.5 10.157 0.5 10.156 0.5 10.155 0.5 10.154 1 2 3 4 5 6 7 8 9 a b 1.19 2.30 2.12 1.04 1.67 1.67 1.67 1.41 1.04 1.44 1.81 1.03 2.13 1.49 1.67 1.67 1.27 1.41 1.67 1.78 1.00 2.12 1 2 3 4 5 6 7 8 9 a b 1 1 2 1 3 1 1 4 5 6 1 7 1 8 9 1 1 10 1 11 1 1 12 13 1 14 1 15 1 Next, consider a fax-type image dataset (blanks=zeros. sse summed over all cells). e 1 2 3 4 5 6 7 8 9 a b c d e f 1 2 3 6 7 9 a b c d e f 0.605 -0.16 -0.35 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.09 -0.26 -0.00 -0.19 0.920 -0.17 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.04 -0.12 -0.00 0.255 -0.30 0.327 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.18 -0.49 -0.00 -0.00 -0.00 -0.00 -0.00 ******************-0.00 -0.00 -0.00 -0.00 ***** -0.00 -0.00 -0.00 -0.00 ******************0.999 -0.00 -0.00 -0.00 ***** -0.60 -0.24 0.453 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.15 0.596 -0.00 -0.00 -0.00 -0.00 -0.00 ******************-0.00 0.999 -0.00 -0.00 ***** -0.35 -0.14 -0.31 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.911 0.765 -0.00 -0.00 -0.00 -0.00 0.999 ******************-0.00 -0.00 -0.00 -0.00 ***** -0.00 -0.00 -0.00 ***************** Minimum sse=10.154 1 2 3 4 5 6 7 8 9 a b c d e f 0.06 -0.0 0.02 -0.0 -0. -0. -0. -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 1 2 3 4 5 6 7 8 9 a b c d e f 0.01 -0.0 0.04 -0. -0. -0.0 -0.0 -0. 0.05 -0.0 0.04 -0.0 -0.0 -0.0 0.01 Without any gradient descent rounds we can knock down column 1 with T=t+(tr1...tcf) but sse=11.017 (can't go below its min=10.154) 6 t=1 2 3 4 5 6 7 8 9 a b c d e 3.88 0.78 0.52 0.26 0.00 0.00 0.00 0.26 0.26 0.26 0.52 0.00 0.00 0.00 1 1 1 1 1 1 1 -0.4 -17 -1.0 .01 .01 -17 -17 .02 -1.9 -17 -1.9 .01 -17 f 1 2 3 4 5 6 7 8 9 a b c d 0.00 0.26 0.00 0.25 0.00 0.00 0.00 0.00 0.00 -0.0 0.00 0.00 0.00 0.00 -0.0 -0.2 -0.1 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0 0.99 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.2 0.86 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.99 -0.0 -0.0 -0.0 -0.0 -0.0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.99 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.99 0.99 -0.0 -0.0 -0.0 -0.0 -0.0 0.99 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -17 -0.5 e f 0.00 0.25 -0.0 -0.0 0.00 0.00 tr1 tr2 tr3 tr4 tr5 tr6 tr7 tr8 tr9 tra trb trc trd tre trf tc1 tc2 tc3 tc4 tc5 tc6 tc7 tc8 tc9 tca tcb tcc tcd tce tcf