PTree Text Mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an always. all again a... Term (Vocab) 0 0 0 0 0 0 0 0 0.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Databases Text Types
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Learning for Text Categorization
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
An Introduction to Latent Semantic Analysis
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Text mining.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
A pTree organization for text mining... Position are April apple and an always. all again a... Term (Vocab)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
Natural Language Processing Topics in Information Retrieval August, 2002.
Vector Semantics Dense Vectors.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
Multimedia Information Retrieval
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Representation of documents and queries
I’m working on implementing this…  here’s where I am so far. 
CS 430: Information Discovery
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

pTree Text Mining... Position are April apple and an always. all again a... Term (Vocab) DocTrmPos pTreeSet Term Ex Term Freq tf tf tf doc freq JSE HHS LMM...doc 1 01 doc=1 doc=1 doc=1 term=a trm=again term=all 0 Data Cube Text Mining mdl reading-positions for doc=1, term=a (mdl = max doc length) mdl reading-positions: doc=1, term=again mdl reading-positions for doc=1, term=all 0 Length of this level-0 pTree= mdl*VocabLen*DocCount doc=2 doc=2 doc=2 term=a trm=again term=all 1 10 doc=3 doc=3 doc=3 term=a trm=again term=all... Length of this level-1 TermExistencePTree =VocabLen*DocCount pred is NOTpure df ( cnt) <--dfP 3 <--dfP 0 df k isn't a level-2 pTree since it's not a predicate on level-1 te strides. Next slides shows how to do it differently so that even the df k 's come out as level-2 pTrees doc=1 d=1 d=1 term=a t=again t=all d=2 d=2 d=2 t=a t=again t=all tf d=3 d=3 d=3 t=a t=again t=all tf tf 1 level-1 TermFreqPTrees (E.g., the predicate of tfP 0 : mod(sum(mdl-stride),2)=1)

... Pos are April apple and an always. all again a Vocab Terms Corpus pTreeSet Term Ex Term Freq tf tf tf doc freq JSE HHS LMM...doc 0 data Cube layout: term=a doc1 term=a doc2 term=a doc3 0 teP t=again trm=a trm=a term=a doc1 doc2 doc teP t=a t=again t=again t=again doc1 doc2 doc3 teP t=all tr=all t=all t=all doc1 doc2 doc3... pTree Text Mining term=again doc1... This one, overall, level-0 pTree, corpusP, has length = MaxDocLen*DocCount*VocabLen This one, overall, level-1 pTree, teP, has length = DocCount*VocabLength These level-2 pTrees, dfP k have len= VocabLength tfP tfP 1 level-1 PTrees, tfP k e.g., pred of tfP 0 : mod(sum(mdl-stride),2)= doc=1 d=2 d=3 term=a t=a t=a 0... d=1 d=2 d=3 t=again t=again t=again tf d=1 d=2 d=3 t=all t=all t=all... level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP df count <--dfP 3 <--dfP 0 doc1 doc2 doc hdfP doc1 doc2 doc3

... Pos are April apple and an always. all again a Vocab Terms te tf tf tf tf doc freq JSE HHS LMM...doc 0 data Cube layout: pTree Text Mining P t=a,d=3 P t=a,d= P t=again,d= P t=a,d=1... term=a doc1 term=a doc2 term=a doc3 term=again doc1... This overall level-0 pTree corpusP length MaxDocLen*DocCount*VocabLen 0 teP t=again trm=a trm=a term=a doc1 doc2 doc teP t=a t=again t=again t=again doc1 doc2 doc3 teP t=all tr=all t=all t=all doc1 doc2 doc3... This overall, level-1 pTree, teP, has length = DocCount*VocabLength These level-2 pTrees, dfP k have len= VocabLength tfP tfP 1 level-1 PTrees, tfP k e.g., pred of tfP 0 : mod(sum(mdl-stride),2)= doc=1 d=2 d=3 term=a t=a t=a 0... d=1 d=2 d=3 t=again t=again t=again tf d=1 d=2 d=3 t=all t=all t=all... level-2 PTree, hdfP?? (Hi Doc Feq): pred=NOTpure0 applied to tfP df count <--dfP 3 <--dfP 0 doc1 doc2 doc hdfP doc1 doc2 doc3 Preface pTree LastChpt pTree Refrncs pTree Any of these masks can be ANDed into the P t=, d= pTrees before they are concatenated as above (or repetitions of the mask can be ANDED after they are concatenated).

I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with  30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between your pos of curds and whey. There came a big spider and sat down Lev-0Little Miss Muffet Lev1 (term freq/exist) te tf tf1 tf0 05HDS Humpty Dumpty sat on a wall. Humpt yDumpty a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between your pos Lev-0Humpty Dumpty Lev1 (term freq/exist) df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between Level-2 pTrees (document frequency)

Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text.Singular value decomposition LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1][1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up.synonymypolysemy[3] LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5][5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. local: [13] Binary if term exists in the doc TermFrequency; global weighting functions: Binary Normal GfIdf, Idf Entropy[13] Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) term represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Is a new SVD program run required for every new query? or is it a one time thing? If it is one-time, there is probably little advantage in searching for pTree speedups? If and when it is not a one-time application to the original data, pTree speedups my hold promise. Even if it is one-time, we might take the point of view that we do the SVD reduction (using standard horizontal methods) and then covert the result to vertical pTrees for the data mining (which would be done over and over again). That pTree-ization of the end result of the SVD reduction could be organized as in the previous slides. Here is a good paper on the subject of LSI and SVD: Thoughts for the future: I am now convinced we can do LSI using pTree processing. The heart of LSI is SVD. The heart of SVD is Gaussian Elimination (which is adding a constant times a matrix row to another row - which we can do with pTrees). We will talk more about this next Saturday and during the week.

SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T 0 S 0 D 0 T where T and D have ortho-normal columns and S has only the singular values on its diagonal in descending order. Remove from T 0,S 0,D 0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSD T (X^ is the rank=k matrix closest to X). We have reduced the dimension from rank(X) to k and we note, X^X^ T = TS 2 T T and X^ T X^ = DS 2 D T There are three sorts of comparisons of interest: Comparing 1. terms (how similar are terms, i and j?) (comparing rows) 2. documents (how similar are documents i and j?) (comparing documents) 3. terms and documents (how associated are term i and doc j?) (examining individual cells) Comparing terms (how similar are terms, i and j?) (comparing rows) Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents). X^X^ T is the square t x t symmetric matrix containing all these dot products. X^X^ T = TS 2 T T This means the ij cell in X^X^ T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms). Comparing documents (how similar are documents, i and j?) (comparing columns) Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms). X^ T X^ is the square d x d symmetric matrix containing all these dot products. X^ T X^ = DS 2 D T This means the ij cell in X^ T X^ is the dot prod of i and j columns of DS (considered coords of documents). Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^) Since X^ = TSD T cell ij is the dot product of the i th row of TS ½ and the j th column of DS ½

term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors c1 Human machine interface for Lab ABC computer apps c2 A survey of user opinion of comp system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measmnt m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey

X = T 0 S 0 D 0 T T 0 D 0 col-orthonormal. Approx X keeping only 1 st 2 singular values and corresp cols of T,D which are coords used to position terms and docs in 2D rep above. In this reduced model: X ~ X^ = TSD T term\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors

inter comp doc\term human face uter user system response time EPS c c c c c mc m m m m mm q D d (mc+mm)/ mc+mm/2*d a q * d q dot d 0.65 far less than a, so q is way into the c class survey trees graph minors d(doc-i,q) human interface computer user system reponse time 1.00 (c1-q)^ (c2-q)^ (c3-q)^ (c4-q)^ (c5-q)^ (m1-q)^ (m2-q)^ (m3-q)^ (m4-q)^ What this tells us is that c1 is closests to q in the full space and that the other c documents are no closer than the m documents. Therefore q would probably be classified as c (one voter in the 1.5 nbhd) but not clearly. This shows the need for SVD or Oblique FAUST! EPS survey trees graph minors