1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Databases Text Types
Latent Semantic Analysis
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
An Introduction to Latent Semantic Analysis
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Separate multivariate observations
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
PTree Text Mining... Position are April apple and an always. all again a... Term (Vocab)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Vector Semantics Dense Vectors.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing

2 Course Administration Comments on Assignment 1. Office hours. See correction on web site.

3 Reading Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, "Indexing by latent semantic analysis". Journal of the American Society for Information Science, Volume 41, Issue 6, bin/issuetoc?ID=

4 Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition.

5 t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The index term vector space

6 Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together

7 Example Query: "IDF in computer-based information look-up" Index terms for a document: access, document, retrieval, indexing How can we recognize that information look-up is related to retrieval and indexing? Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?

8 Models of Semantic Similarity Proximity models: Put similar items together in some space or structure Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.] Factor analysis based on matrix of similarities between documents (single mode). Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.

9 Selection of Two-mode Factor Analysis Additional criterion: Computationally efficient O(N 2 k 3 ) N is number of terms plus documents k is number of dimensions

10 Technical Memo Example: Titles c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

11 Technical Memo Example: Terms and Documents Terms Documents c1c2c3c4c5m1m2m3m4 human interface computer user system response time EPS survey trees graph minors

12 Technical Memo Example: Query Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c3 Misses c4 and c5

13 Figure 1 term document query --- cosine > 0.9

14 Mathematical concepts Singular Value Decomposition Define X as the term-document matrix, with t rows (number of index terms) and n columns (number of documents). There exist matrices T, S and D', such that: X = T 0 S 0 D 0 ' T 0 and D 0 are the matrices of left and right singular vectors T 0 and D 0 have orthonormal columns S 0 is the diagonal matrix of singular values

15 Dimensions of matrices X= T0T0 D0'D0'S0S0 t x dt x mm x dm x m m is the rank of X < min(t, d)

16 Reduced Rank Diagonal elements of S 0 are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S 0 and the corresponding rows and columns of T 0 and D 0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence. ~ ~ ^ ^

17 Selection of singular values X = t x dt x kk x dk x k k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k « m. T SD' ^

18 Comparing Two Terms XX' = TSD'(TSD')' = TSD'DS'T' = TSS'T Since D is orthonormal = TS(TS)' To calculate the i, j cell, take the dot product between the i and j rows of TS Since S is diagonal, TS differs from T only by stretching the coordinate system ^ ^ The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences. ^

19 Comparing Two Documents X'X = (TSD')'TSD' = DS(DS)' To calculate the i, j cell, take the dot product between the i and j columns of DS. Since S is diagonal DS differs from D only by stretching the coordinate system ^ ^ The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences. ^

20 Comparing a Term and a Document Comparison between a term and a document is the value of an individual cell of X. X = TSD' = TS(DS)' where S is a diagonal matrix whose values are the square root of the corresponding elements of S. ^ - - -

21 Technical Memo Example: Query Terms Query x q human1 interface0 computer0 user0 system1 response0 time0 EPS0 survey0 trees1 graph0 minors0 Query: "human system interactions on trees" In term-document space, a query is represented by x q, a t x 1 vector. In concept space, a query is represented by d q, a 1 x k vector.

22 Query Suggested form of d q is: d q = x q 'TS -1 Example of use. To compare a query against document i, take the i th element of the product of DS and d q S, which is the i th element of product of DS and x q 'T. Note that is a d q row vector.

23 Query Let x q be the vector of terms for a query q. In the reduced dimensional space, q, is represented by a pseudo-document, d q, at the centroid of the corresponding term points, with appropriate rescaling of the axes. d q = x q 'TS -1

24 Experimental Results Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments available. Documents were full text of title and abstract. Stop list of 439 words (SMART); no stemming, etc. Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method.

25 Experimental Results: 100 Factors

26 Experimental Results: Number of Factors