Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata
Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Entries are scores of the terms in the documents (Boolean Count Weight) Query is also a vector in the term- space 2 d1d1 d2d2 d3d3 d4d4 d5d5 q car automobile engine search Vector similarity: cosine of the angle between the vectors What is the problem? car ~ automobile But in the vector space model each term is a different dimension Recall
Synonyms in different dimensions Synonyms Car and automobiles are synonyms But different dimensions Same situation for terms belonging to similar concepts Goal: can we map synonyms (similar concepts) to same dimensions automatically? 3 automobile car d2d2 q d1d1 engine d1d1 d2d2 d3d3 d4d4 d5d5 q car automobile engine search
Linear algebra review 4 Rank of a matrix: number of linearly independent columns (or rows) If A is an m × n matrix, rank(A) ≤ min(m, n) d1d1 d2d2 d3d3 d4d4 d5d5 car Rank ofautomobile 12000= ? engine search
Linear algebra review A square matrix M is called orthogonal if its rows and columns are orthogonal unit vectors (orthonormal vectors) – Each column (row) has norm 1 – Any two columns (rows) have dot product 0 For a square matrix M, if there is a vector v such that Av = λv for some scalar λ, then v is called an eigenvector of A λ is the corresponding eigenvalue 5
Singular value decomposition If A is an m × n matrix with rank r Then there exists a factorization of A as: 6 where U (m × m) and V (n × n) are orthogonal, and Σ (m × n) is a diagonal-like matrix Σ = (σ ij ), where σ ii = σ i, for i = 1, …, r are the singular values of A, all non-diagonal entries of Σ are zero σ 1 ≥ σ 2 ≥ … ≥ σ r ≥ 0 Columns of U are the left singular vectors of A
Singular value decomposition 7 n×n σ1σ1 σrσr 0 0 m×nm×mm×r σ1σ1 σrσr r×rr×n
Matrix digonalization for symmetric matrix If A is an m × m matrix with rank r Consider C = AA T. Then: 8 C has rank r Σ 2 is a diagonal matrix with entries σ i 2, for i = 1, …, r Columns of U are the eigenvectors of C σ i 2 are the corresponding eigenvalues of C
SVD of term – document matrix 9 Documents are vectors in the m dimensional term space But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors m×k k×k k×n Low rank approximation AkAk VkTVkT ΣkΣk
Low-rank approxmation 10 Rank k Now compute cosine similarity with the query q Computationally, still m dimensional vectors
Retrieval in the concept space Retrieval in the term-space (cosine): both q and d are m dimensional vectors (m = #of terms) 11 Term space (m) concept space (k) Use the first k singular vectors Query: q U k T q (k × m, m × 1 = k × 1) Document: d U k T d (k × m, m × 1 = k × 1) Cosine similarity in the concept space: Other variants: map using (U k Σ k ) T
How to find the optimal low-rank? Primarily intuitive – Assumption that a document collection has exactly k concepts – No systematic method to find optimal k – Experimental results are not very consistent 12
HOW DOES LSI WORK? Bast & Majumdar, SIGIR
Spectral retrieval – general framework Term-document matrix A (m × n) m terms, n documents q (m × 1) dimension reduction to concept space L (k× m) L. A (k × n) k concepts, n documents L. q (k × 1) cosine similarities in concept space cosine similarities in term space Singular value decomposition (SVD) A = U Σ V T m × n m × r r × r r × n U k = first k columns of U L = U k T (k × m) LSI and other LSI-based retrieval methods are called “Spectral retrieval”
Spectral retrieval as document "expansion" car auto engine search · 0-1 expansion matrix car auto engine search =
Spectral retrieval as document "expansion" car auto engine search · add car if auto is present 0-1 expansion matrix = car auto engine search
Spectral retrieval as document "expansion" car auto engine search Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms expansion matrix U k U k T matrix L = U 2 U 2 T projecting to 2 dimensions add car if auto is present · = expansion matrix depends heavily on the subspace dimension! car auto engine search LSI expansion matrix
Why document "expansion" car auto engine search Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms add car if auto is present · = expansion matrix U k U k T matrix L = U 3 U 3 T projecting to 3 dimensions car auto engine search LSI expansion matrix expansion matrix depends heavily on the subspace dimension! Finding the optimal number of dimensions k remained an open problem
Relatedness Curves How the entries in the expansion matrix depend on the dimension k of the subspace Plot (i,j)-th entry of expansion matrix T = L T L = U k U k T against the dimension k Cumulative dot product of the i-th and j-th rows of U i j k U = {singular vectors} k
node / vertex subspace dimension logic / logics subspace dimension logic / vertex subspace dimension Types of Relatedness Curves Three main types expansion matrix entry 0 No single dimension is appropriate for all term pairs But the shape of the curve indicates the term-term relationship!
Curves for related terms We call two terms perfectly related if they have an identical co- occurrence pattern subspace dimension subspace dimension subspace dimension expansion matrix entry proven shape for perfectly related terms provably small change after slight perturbation more perturbation shape: up, then down point of fall-off is different for every term pair, we can calculate that term 1 term D b b B a0 0a AA
Curves for unrelated terms Co-occurrence graph: – terms are vertices – edge between two terms if they co-occur We call two terms perfectly unrelated if no path connects them in the graph curves for unrelated terms randomly oscillate around zero proven shape for perfectly unrelated terms provably small change after slight perturbation more perturbation subspace dimension subspace dimension subspace dimension expansion matrix entry 0
TN: the non-negativity Test 1.Normalize term-document matrix so that theoretical point of fall-off is same for all term pairs 2.Discard the parts of the curves after this point 3.For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to subspace dimension subspace dimension subspace dimension A simple 0-1 classification, produces a sparse expansion matrix! Related terms set entry to 1 Related terms set entry to 1 Unrelated terms set entry to 0 expansion matrix entry 0
TS: the Smoothness Test 1.Again, discard the part of the curves after the theoretical point of fall-off (same for every term-pair, after normalization) 2.For each term pair compute the smoothness of its curve (= 1 if very smooth, 0 as number of turns increase) 3.If smoothness is above some threshold, set entry in expansion matrix to 1, otherwise to subspace dimension subspace dimension subspace dimension expansion matrix entry Related terms set entry to 1 Related terms set entry to Unrelated terms set entry to 0 0 Again, 0-1 classification, produces a sparse expansion matrix!
Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al Term-normalized LSI Ding et al Correlation-based LSI Dupret et al Iterative Residual Rescaling Ando & Lee 2001 non-negativity test smoothness test Average precision
Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms Reuters 36.2% 32.0% 37.0% 32.3% —— 41.9% 42.9% docs 5701 terms Ohsumed 13.2% 6.9% 13.0% 10.9% —— 14.4% 15.3% docs terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Average precision
Asymmetric term-term relations Related terms: fruit – apple Until some dimension k ’ the curve fruit – apple is above the curve apple - apple Until dimension k ’ apple is more related to fruit than to apple itself Asymmetric relation: fruit is more general than apple 0 k fruit - apple fruit - fruit apple - apple Bast, Dupret, Majumdar & Piwowarski, 2006
Examples More general Less general More general Less general Apple--FruitCar--Opel Space--SolarRestaurant--Dish India--GandhiFashion--Trousers Restaurant--WaiterMetal--Zinc Sweden--StockholmIndia--Delhi Church--PriestOpera--Vocal Metal--AluminumFashion--Silk Saudi--SultanFish--Shark
Sources and Acknowledgements IR Book by Manning, Raghavan and Schuetze: Bast and Majumdar: Why spectral retrieval works. SIGIR 2005 – Some slides are adapted from the talk by Hannah Bast 29