Information Retrieval and Web Search Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/
Outline Advanced IR Models Probabilistic Model Latent Semantic Analysis
Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeating this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant) The model assumes that this probability of relevance depends on the query and the document representations only Ideal answer set is referred to as R and should maximize the probability of relevance Documents in the set R are predicted to be relevant But, how to compute probabilities? what is the sample space?
The Ranking Probabilistic ranking computed as: Definition: sim(q, dj) = P(dj relevant-to q) / P(dj non-relevant-to q) How to read this? “Maximize the number of relevant documents, minimize the number of irrelevant documents” This is the odds of the document dj being relevant Taking the odds minimize the probability of an erroneous judgement Definition: Weights wij {0,1} P(R | vec(dj)): probability that given document is relevant P(R | vec(dj)) : probability that document is not relevant Bayes Rule: P(A|B) P(B) = P(B|A)P(A)
Probabilistic Model PROS: Ranking based on probability of being relevant CONS: Need to guess initial relevant set Binary weights instead of term-weighting Independence assumption of index terms
Latent Semantic Analysis (LSA) Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.
LSA Relationship between concepts and words is many-to-many Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms For retrieval, analyze queries the same way, and compute cosine similarity of vectors of ideas
Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy) and the third (dependence)
LSA Find the latent semantic space that underlies the documents Find the basic (coarse-grained) ideas, regardless of the words used to say them A kind of co-occurrence analysis; co-occurring words as “bridges” between non–co-occurring words Latent semantic space has many fewer dimensions than term space has Space depends on documents from which it is derived Components have no names; can’t be interpreted
Technical Memo Example: Titles c1 Human machine interface for Lab ABC computer applications c2 A survey of user opinion of computer system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey
Technical Memo Example: Terms and Documents Technical Memo Example: Terms and Documents Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1
Technical Memo Example: Query Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5
Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that: X = T0S0D0' T0 and D0 are the matrices of left and right singular vectors S0 is the diagonal matrix of singular values
More Linear Algebra A non-negative real number σ is a singular value for X if and only if there exist unit-length vectors u in Kt and v in Kd such that X v= σu and X‘ u= σv u are the left singular vectors while v are the right singular vectors K = field such as the field of real numbers
Eigenvectors vs. Singular vectors Eigenvector: Mv = λv where v is an eigenvector and v is a scalar (real number) called eigenvalue MV = VD, where D is the collection of eigenvalues and V is the collection of eigenvectors M = VDV-1 if V is invertible (which is the case if all eigenvectors are distinct) ‘
Eigenvectors vs. Singular vectors M = VDV‘ (V‘=V-1 if eigenvectors are normalized) XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘ = TDT‘ D = SS‘
Linear Algebra X = T0S0D0' T,D are column orthonormal Their columns are orthogonal vectors that can form a basis for a space They are unitary which means T‘ and D‘ are also orthonormal
More Linear Algebra Unitary matrices have the following properties UU‘=U‘U=In If U has all entries real it is orthogonal Orthogonal matrices preserve the inner product of two real vectors <Ux,Uy=<x,y> U is an isometry, i.e. preserves distances
LSA Properties The projection into the latent concept space preserves topological properties of the original space Close vectors will stay close The reduced latent concept space is the best approximation of the original space in terms of distance-preservation compared to other choices of same dimensionality Both terms and documents are mapped into a new space where they both could be compared
Dimensions of matrices t x d t x m m x m m x d S0 D0' X = T0 m is the rank of X < min(t, d)
Reduced Rank S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence. ^ ~ ^
Dimensionality Reduction t x d t x k k x k k x d S D' ^ = X T k is the number of latent concepts (typically 300 ~ 500) X ~ X= TSD' ^
Animation of SVD M is just an m×m square matrix with positive determinant whose entries are plain real numbers. Run as slideshow to see the Animation From Wikipedia
Projected Terms XX‘ = (TSD‘)(TSD)‘=TSDD‘S‘T‘ = TSS‘T‘= (TS)(TS)‘
Mathematical Revision A is a p x q matrix B is a r x q matrix ai is the vector represented by row i of A bj is the vector represented by row j of B The inner product ai.bj is element i, j of AB' q r ith row of A q p B' jth row of B A
Comparing a Query and a Document A query can be expressed as a vector xq in the term-document vector space. xqi = 1 if term i is in the query and 0 otherwise. (Ignore query terms that are not in the term vector space.) Let pqj be the inner product of the query xq with document dj in the term-document vector space. pqj is the jth element in the product of xq'X ^
Comparing a Query and a Document [pq1 ... pqj ... pqt] = [xq1 xq2 ... xqt] document dj is column j of X ^ X ^ inner product of query q with document dj query ^ pq' = xq'X = xq'TSD' = xq'T(DS)' similarity(q, dj) = cosine of angle is inner product divided by lengths of vectors pqj |xq| |dj|
LSA Summary Strong formal framework Completely automatic; no stemming required; allows misspellings “Conceptual IR” recall improvement: one can retrieve relevant documents that do not contain any search terms Often improving precision is more important: need query and word sense disambiguation Computation is expensive
Summary Probabilistic Model LSA
Next Project Presentations Review Exam Winter Break!