2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1
2010 © University of Michigan … Latent semantic indexing Singular value decomposition …
2010 © University of Michigan Problems with lexical semantics Polysemy –bar, bank, jaguar, hot –tend to reduce precision Synonymy –building/edifice, Large/big, Spicy/hot –tend to reduce recall Relatedness –doctor/patient/nurse/treatment Sparse matrix Need: dimensionality reduction
2010 © University of Michigan Problem in Retrieval 4 Query = “information retrieval” Document 1 = “inverted index precision recall” Document 2 = “welcome to ann arbor” Which one should we rank higher? Query vocabulary & doc vocabulary mismatch! Smoothing won’t help here… If only we can represent documents/queries by topics!
2010 © University of Michigan Latent Semantic Indexing Motivation –Query vocabulary & doc vocabulary mismatch –Need to match/index based on concepts (or topics) Main idea: –Projects queries and documents into a space with “latent” semantic dimensions –Dimensionality reduction: the latent semantic space has fewer dimensions (semantic concepts) –Exploits co-occurrence: Co-occurring terms are projected onto the same dimensions
2010 © University of Michigan 6 Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)
2010 © University of Michigan Concept Space = Dimension Reduction 7 Number of concepts (K) is always smaller than the number of words (N) or number of documents (M). If we represent a document as a N-dimension vector; and the corpus as an M*N matrix… The goal is to reduce the dimension from N to K. But how can we do that?
2010 © University of Michigan Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices –Vectors –Matrices –Matrix multiplication
2010 © University of Michigan Eigenvectors and eigenvalues An eigenvector is an implicit “direction” for a matrix where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle Computing eigenvalues (det = determinant): if A is square (N x N), has r distinct solutions, where 1 <= r <= N For each λ found, you can find v by, or
2010 © University of Michigan Eigenvectors and eigenvalues Example: det (A- I) = (-1- )*(- )-3*2=0 Then: =0; 1 =2; 2 =-3 For Solutions: x 1 =x 2
2010 © University of Michigan Eigenvectors and eigenvalues Wait, that means there are many eigenvectors for the same eigenvalue… v = (x 1, x 2 ) T ; x 1 = x 2 corresponds to many vectors, e.g., (1, 1) T, (2, 2) T, (650, 650) T … Not surprising … if v is an eigenvector of A, v ’ = c v is also an eigenvector (c is any non-zero constant) 11
2010 © University of Michigan Matrix Decomposition If A is a square (N x N) matrix and it has N linearly independent eigenvectors, it can be decomposed into U U -1 where U: matrix of eigenvectors (every column) diagonal matrix of eigenvalues AU = U U -1 AU = A = U U -1
2010 © University of Michigan Example
2010 © University of Michigan Example Eigenvalues are 3, 2, 0 x is an arbitrary vector, yet Sx depends on the eigenvalues and eigenvectors
2010 © University of Michigan 15 What about an arbitrary matrix? A: n x m matrix (n documents, m terms) A = U V T (as opposed to A = U U -1 ) U: n x n matrix; V: m x m matrix n x m diagonal matrix only values on the diagonal can be non-zero. UU T = I; VV T = I
2010 © University of Michigan SVD: Singular Value Decomposition A=U V T U is the matrix of orthogonal eigenvectors of AA T V is the matrix of orthogonal eigenvectors of A T A The components of are the eigenvalues of A T A This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A)
2010 © University of Michigan Term matrix normalization D1 D2 D3 D4 D5
2010 © University of Michigan Example (Berry and Browne) T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & children’s room (for your home) D3: child safety at home D4: your baby’s health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collector’s guide
2010 © University of Michigan Document term matrix
2010 © University of Michigan Decomposition u = v =
2010 © University of Michigan Decomposition = Spread on the v1 axis
2010 © University of Michigan What does this have to do with dimension reduction? Low rank matrix approximation SVD: A [m*n] = U [m*m] m*n V T n*n Remember that is a diagonal matrix of eigenvalues If we only keep the largest r eigenvalues.. A ≈ U [m*r] r*r V T n*r 22
2010 © University of Michigan Rank-4 approximation s4 =
2010 © University of Michigan Rank-4 approximation u*s4*v'
2010 © University of Michigan Rank-4 approximation u*s4: word vector representation of the concepts/topics
2010 © University of Michigan Rank-4 approximation s4*v': new (concept/topic) representation of documents
2010 © University of Michigan Rank-2 approximation s2 =
2010 © University of Michigan Rank-2 approximation u*s2*v'
2010 © University of Michigan Rank-2 approximation u*s2: word vector representation of the concepts/topics
2010 © University of Michigan Rank-2 approximation s2*v': new (concept/topic) representation of documents
2010 © University of Michigan 31 Latent Semantic Indexing A [n x m] ≈ U [n x r] r x r] (V [m x r] ) T A: n x m matrix (n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts)
2010 © University of Michigan Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space LSI matches documents even if they don’t have words in common; –If they share frequently co-occurring terms
2010 © University of Michigan 33 Back to the CS-MED example (Slide from C. Faloutsos’s talk)
2010 © University of Michigan 34 Example of LSI data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U V T
2010 © University of Michigan 35 How to Map Query/Doc to the Same Concept Space? q T concept = q T V d T concept = d T V data inf. retrieval brain lung qT=qT= = Similarity with CS-concept dT=dT= (Slide adapted from C. Faloutsos’s talk)
2010 © University of Michigan Useful pointers
2010 © University of Michigan Readings MRS18 MRS17, MRS19 MRS20
2010 © University of Michigan Problem of LSI Concepts/Topics are hard to interpret New document/query vectors could have negative values Lack of statistical interpretation Probabilistic latent semantic indexing… 38
2010 © University of Michigan 39 General Idea of Probabilistic Topic Models Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM) Modeling text data with a mixture model involving multinomial distributions –A document is “generated” by sampling words from some multinomial distribution –Each time, a word may be generated from a different distribution –Many variations of how these multinomial distributions are mixed Topic mining = Fitting the probabilistic model to text Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time | topic), p(time | topic, location))
2010 © University of Michigan 40 Document as a Sample of Mixed Topics Applications of topic models: –Summarize themes/aspects –Facilitate navigation/browsing –Retrieve documents –Segment documents –Many others How can we discover these topic word distributions? Topic 1 Topic k Topic 2 … Background B government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans is 0.05 the 0.04 a [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
2010 © University of Michigan 41 Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words
2010 © University of Michigan PLSI (a.k.a. Aspect Model) Every document is a mixture of underlying (latent) K aspects (topics) with mixture weights p(z|d) –How is this related to LSI? Each aspect is represented by a distribution of words p(w|z) Estimate p(z|d) and p(w|z) using EM algorithm
2010 © University of Michigan 43 PLSI as a Mixture Model Topic z 1 Topic z k Topic z 2 … Document d Background B warning 0.3 system aid 0.1 donation 0.05 support statistics 0.2 loss 0.1 dead is 0.05 the 0.04 a kk 11 22 B B W p(z 1 |d) 1 - B “Generating” word w in doc d in the collection Parameters: B =noise-level (manually set) P(z|d) and p(w|z) are estimated with Maximum Likelihood ? ? ? ? ? ? ? ? ? ? ? p(z 2 |d) p(z k |d)
2010 © University of Michigan Parameter Estimation using EM Algorithm We have the equation for log-likelihood function from the PLSI model, which we want to maximize: Maximizing likelihood using Expectation Maximization
2010 © University of Michigan EM Steps E-Step –Expectation step where expectation of the likelihood function is calculated with the current parameter values M-Step –Update the parameters with the calculated posterior probabilities –Find the parameters that maximizes the likelihood function
2010 © University of Michigan E Step It is the probability that a word w occurring in a document d, is explained by topic z
2010 © University of Michigan M Step All these equations use p(z|d,w) calculated in E Step Converges to a local maximum of the likelihood function We will see more when we talk about topic modeling
2010 © University of Michigan Example of PLSI 48
2010 © University of Michigan Topics represented as word distributions Topics are interpretable! 49 - Example of topics found from blog articles about “Hurricane Katrina”