Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Isoparametric Elements Element Stiffness Matrices
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Dimensionality Reduction PCA -- SVD
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Dimensional reduction, PCA
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Latent Dirichlet Allocation a generative model for text
Modeling Modern Information Retrieval
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
C.Watterscsci64031 Probabilistic Retrieval Model.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Plan for Today’s Lecture(s)
LSI, SVD and Data Management
Latent Dirichlet Analysis
CS 430: Information Discovery
Restructuring Sparse High Dimensional Data for Effective Retrieval
Retrieval Performance Evaluation - Measures
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval

Dimensionality Reduction Imagine we’ve collected data on the HEIGHT and WEIGHT of everyone in a classroom of N students. These might be plotted as in figure 5.3 Notice the correlation around an axis we might call SIZE. Students vary most along this dimension; it captures most of the information about their distribution. It is possible to capture a major source of variation across the HEIGHT / WEIGHT sample, because the two quantities are correlated.

Dimensionality Reduction in the Vector Space Model In the Vector Space Model (VSM), the Index is a D*V element matrix, where D is the number of documents and V is the size of the vocabulary (see next slide) Attempts to reduce this large dimensional space into something smaller are called dimensionality reduction. There are two reasons we might be interested in reducing dimensions: a) reduce the size of the representation of the documents – initially there are many zeros, representing all the words not in the document – i.e. the matrix is sparse. b) to exploit what is known as the latent semantic relationships among these keywords. The VSM assumes the vocabulary dimensions to be orthogonal to one another; i.e the keywords are independent. But index terms are highly dependent, highly correlated with one another. We exploit this by capturing only those axes of maximal variation and throwing away the rest.

Singular Value Decomposition (SVD) Just as students’ HEIGHT and WEIGHT are correlated about the dimension SIZE, we can guess that (at least some small sets of) keywords are correlated, and so we can reduce the dimensionality of the Index in the same way we reduced that of our students’ sizes. Using SVD, the Index is decomposed into three matrices, U, L and A (which can be multiplied together to get back the original Index). L is a diagonal matrix, where the value in cell 1 reflects the importance of the most dominant correlation, the value in cell 2 reflects the second most dominant correlation… the value in the Vth cell reflects the least dominant correlation. We reduce the number of dimensions from V to k, by keeping only the first v cells in L. Finally multiply the three matrices back together again to produce an Index with fewer dimensions – an approximation of the original.

How many dimensions to reduce to? To date, the only answers are empirical Using too few dimensions dramatically degrades performance. A few hundred dimensions might suffice for a very topically-focussed vocabulary (e.g. medicine), but more might be needed when describing a broader domain of discourse See next slide: 500 dimensions gives highest proportion correct on synonym text.

“Latent Semantic” Claims In IR, SVD was first applied to the Index matrix by Deerwester et al. (1990), and was called Latent Semantic Indexing (LSI). The “Latent Semantic” claim derives from the authors’ belief that the reduced dimension representation of documents in fact reveals semantic correlations among index terms. While one author might use CAR and another the synonym AUTO, the correlation of both of those with other terms like HIGHWAY, GASOLINE and DRIVING will result in an abstracted document feature / dimension on which queries using either keyword, CAR or AUTO, will work equivalently. Retrieval based on synonyms has been achieved.

Probabilisitic (Bayesian) Retrieval There are no absolute logical grounds on which to prove that any document is relevant to any query. Our best hope is to retrieve documents which are probably relevant. The probability ranking principle (van Rijsbergen) is the assumption that an optimal IR system orders (ranks) documents in decreasing probability of relevance.

Pr(Rel) There are at least two possible interpretations of what a probability of relevance Pr(Rel) might mean: a) imagine (considering a particular query) an “experiment” showing the document to multiple users b) the same document/query relevance question is repeatedly put to the same user, who sometimes replies that it’s relevant and sometimes that it isn’t. Either way, we focus on one query and compute Pr(Rel) conditionalised by the features we might associate with the document d.

Bayesian Inversion Let x be a vector of features xi describing a document. A matching function match(q,d) α Pr(Rel|x) If we had worked hard on a corpus of documents to identify (always with respect to some particular query) which were Rel and which were not, it would be possible to carefully study which features xi were reliably found in relevant documents and which were not. Collecting such statistics for each feature would then allow us to estimate Pr(x|Rel) – the probability of any set of features x given that we know the document is Rel. The retrieval question requires that we ask the converse: the probability that a document with features x should be considered relevant. This inversion is achieved via the familiar Bayes rule: Pr(Rel|x) = Pr(x|Rel). Pr(Rel) / Pr(x)

Odds Calculation (2) The first term Odds(Rel) will be small; the odds of picking a relevant versus irrelevant document independent of any features of the document are not good. Odds(Rel) is a characteristic of the entire corpus or the generality of the query but insensitive to any analysis we might perform on a particular document. In order to calculate the second term Pr(x | Rel) / Pr(x | NRel) we need a more refined model of how documents are “constructed” from their features.

Binary Independence Model The binary assumption is that all the features xi are binary (either present or absent in a document). The much bigger assumption is that the document’s features occur independently of each other: But think of the example of “click here” – whenever we see “click” on a web page it is usually followed by “here”.

Comparing with the query Recall that both queries and documents live in the same vector space defined over the features xi. The two products of equation 5.46 (defined in terms of presence or absence of a feature in a document) can be broken into four subcases, depending on whether the features occur in the query. We don’t care about terms that are not in the query because they don’t affect the query document comparison – assume that the probability of these features being present in relevant and irrelevant documents is equal. See figure 5.6 (previous slide): sets D and Q are defined in terms of those features xi present and absent in the document and query respectively. Equation 5.46 can be rewritten (don’t worry about the exact derivation as follows (5.48):

Estimating the pi and qi with RelFbk Consider the retrospective case where when we have RelFbk from a user who has evaluated each of the top N documents in an initial retrieval and has found R of these to be relevant (as well as evaluating all the N-R remaining and found them to be irrelevant). If a particular feature xi is present in n of the retrieved documents with r of these relevant, then this bit of RelFbk provides reasonable estimates for pi and qi. See equations 5.49 and 5.50 on the previous slide.