pTree Text Mining... Position are April apple and an always. all again a... Term (Vocab) DocTrmPos pTreeSet Term Ex Term Freq tf tf tf doc freq JSE HHS LMM...doc 1 01 doc=1 doc=1 doc=1 term=a trm=again term=all 0 Data Cube Text Mining mdl reading-positions for doc=1, term=a (mdl = max doc length) mdl reading-positions: doc=1, term=again mdl reading-positions for doc=1, term=all 0 Length of this level-0 pTree= mdl*VocabLen*DocCount doc=2 doc=2 doc=2 term=a trm=again term=all 1 10 doc=3 doc=3 doc=3 term=a trm=again term=all... Length of this level-1 TermExistencePTree =VocabLen*DocCount pred is NOTpure df ( cnt) <--dfP 3 <--dfP 0 df k isn't a level-2 pTree since it's not a predicate on level-1 te strides. Next slides shows how to do it differently so that even the df k 's come out as level-2 pTrees doc=1 d=1 d=1 term=a t=again t=all d=2 d=2 d=2 t=a t=again t=all tf d=3 d=3 d=3 t=a t=again t=all tf tf 1 level-1 TermFreqPTrees (E.g., the predicate of tfP 0 : mod(sum(mdl-stride),2)=1)
... Pos are April apple and an always. all again a Vocab Terms Corpus pTreeSet Term Ex Term Freq tf tf tf doc freq JSE HHS LMM...doc 0 data Cube layout: term=a doc1 term=a doc2 term=a doc3 0 teP t=again trm=a trm=a term=a doc1 doc2 doc teP t=a t=again t=again t=again doc1 doc2 doc3 teP t=all tr=all t=all t=all doc1 doc2 doc3... pTree Text Mining term=again doc1... This one, overall, level-0 pTree, corpusP, has length = MaxDocLen*DocCount*VocabLen This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength These level-2 pTrees, dfP k have len= VocabLength tfP tfP 1 level-1 PTrees, tfP k e.g., pred of tfP 0 : mod(sum(mdl-stride),2)= doc=1 d=2 d=3 term=a t=a t=a 0... d=1 d=2 d=3 t=again t=again t=again tf d=1 d=2 d=3 t=all t=all t=all... level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP df count <--dfP 3 <--dfP 0 doc1 doc2 doc hdfP doc1 doc2 doc3
... Pos are April apple and an always. all again a Vocab Terms te tf tf tf tf doc freq JSE HHS LMM...doc 0 data Cube layout: pTree Text Mining P t=a,d=3 P t=a,d= P t=again,d= P t=a,d=1... term=a doc1 term=a doc2 term=a doc3 term=again doc1... This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen 0 teP t=again trm=a trm=a term=a doc1 doc2 doc teP t=a t=again t=again t=again doc1 doc2 doc3 teP t=all tr=all t=all t=all doc1 doc2 doc3... This overall, level-1 pTree, teP, has length = DocCount*VocabLength These level-2 pTrees, dfP k have len= VocabLength tfP tfP 1 level-1 PTrees, tfP k e.g., pred of tfP 0 : mod(sum(mdl-stride),2)= doc=1 d=2 d=3 term=a t=a t=a 0... d=1 d=2 d=3 t=again t=again t=again tf d=1 d=2 d=3 t=all t=all t=all... level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP df count <--dfP 3 <--dfP 0 doc1 doc2 doc hdfP doc1 doc2 doc3 Preface pTree LastChpt pTree Refrncs pTree Any of these masks can be ANDed into the P t=, d= pTrees before they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated).
I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between your pos of curds and whey. There came a big spider and sat down Lev-0Little Miss Muffet Lev1 (term freq/exist) te tf tf1 tf0 05HDS Humpty Dumpty sat on a wall. Humpt yDumpty a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between your pos Lev-0Humpty Dumpty Lev1 (term freq/exist) df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between Level-2 pTrees (document frequency)
Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text.Singular value decomposition LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1][1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up.synonymypolysemy[3] LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5][5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. local: [13] Binary if term exists in the doc TermFrequency; global weighting functions: Binary Normal GfIdf, Idf Entropy[13] Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) term represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Is a new SVD program run required for every new query? or is it a one time thing? If it is one-time, there is probably little advantage in searching for pTree speedups? If and when it is not a one-time application to the original data, pTree speedups my hold promise. Even if it is one-time, we might take the point of view that we do the SVD reduction (using standard horizontal methods) and then covert the result to vertical pTrees for the data mining (which would be done over and over again). That pTree-ization of the end result of the SVD reduction could be organized as in the previous slides. Here is a good paper on the subject of LSI and SVD: Thoughts for the future: I am now convinced we can do LSI using pTree processing. The heart of LSI is SVD. The heart of SVD is Gaussian Elimination (which is adding a constant times a matrix row to another row - which we can do with pTrees). We will talk more about this next Saturday and during the week.
SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T 0 S 0 D 0 T where T and D have ortho-normal columns and S has only the singular values on its diagonal in descending order. Remove from T 0,S 0,D 0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSD T (X^ is the rank=k matrix closest to X). We have reduced the dimension from rank(X) to k and we note, X^X^ T = TS 2 T T and X^ T X^ = DS 2 D T There are three sorts of comparisons of interest: Comparing 1. terms (how similar are terms, i and j?) (comparing rows) 2. documents (how similar are documents i and j?) (comparing documents) 3. terms and documents (how associated are term i and doc j?) (examining individual cells) Comparing terms (how similar are terms, i and j?) (comparing rows) Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents). X^X^ T is the square t x t symmetric matrix containing all these dot products. X^X^ T = TS 2 T T This means the ij cell in X^X^ T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms). Comparing documents (how similar are documents, i and j?) (comparing columns) Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms). X^ T X^ is the square d x d symmetric matrix containing all these dot products. X^ T X^ = DS 2 D T This means the ij cell in X^ T X^ is the dot prod of i and j columns of DS (considered coords of documents). Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^) Since X^ = TSD T cell ij is the dot product of the i th row of TS ½ and the j th column of DS ½
term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors c1 Human machine interface for Lab ABC computer apps c2 A survey of user opinion of comp system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measmnt m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey
X = T 0 S 0 D 0 T T 0 D 0 col-orthonormal. Approx X keeping only 1 st 2 singular values and corresp cols of T,D which are coords used to position terms and docs in 2D rep above. In this reduced model: X ~ X^ = TSD T term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors
inter comp doc\term human face uter user system response time EPS c c c c c mc m m m m mm q D d (mc+mm)/ mc+mm/2*d a q * d q dot d 0.65 far less than a, so q is way into the c class survey trees graph minors d(doc-i,q) human interface computer user system reponse time 1.00 (c1-q)^ (c2-q)^ (c3-q)^ (c4-q)^ (c5-q)^ (m1-q)^ (m2-q)^ (m3-q)^ (m4-q)^ What this tells us is that c1 is closests to q in the full space and that the other c documents are no closer than the m documents. Therefore q would probably be classified as c (one voter in the 1.5 nbhd) but not clearly. This shows the need for SVD or Oblique FAUST! EPS survey trees graph minors