Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18
Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000 100) while preserving distances? Now, O(nm) to compute cos(d,q) for all d Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection
Briefly LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. What about polysemy ?
Notions from linear algebra Matrix A, vector v Matrix transpose (A t ) Matrix product Rank Eigenvalues and eigenvector v: Av = v
Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space
Singular-Value Decomposition Recall m n matrix of terms docs, A. A has rank r m,n Define term-term correlation matrix T=AA t T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T Define doc-doc correlation matrix D=A t A D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D
A’s decomposition Given P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product) It turns out that A = P R t Where is a diagonal matrix with the eigenvalues of T=AA t in decreasing order. = A P RtRt mnmnmrmr rrrr rnrn
For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial] Denote by k this new version of , having rank k Typically k is about 100, while r ( A’s rank ) is > 10,000 = P kk RtRt Dimensionality reduction AkAk document useless due to 0-col/0-row of k m x r r x n r k k k 00 0 A m x k k x n
Guarantee A k is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m n matrices of rank k, A k is the best approximation to A wrt the following measures: min B, rank(B)=k ||A-B|| 2 = ||A-A k || 2 = k min B, rank(B)=k ||A-B|| F 2 = ||A-A k || F 2 = k 2 k+2 2 r 2 Frobenius norm ||A|| F 2 = 2 2 r 2
Reduction X k = k R t is the doc-matrix k x n, hence reduced to k dim Since we are interested in doc/q correlation, we consider: D=A t A =(P R t ) t (P R t ) = ( R t ) t ( R t ) Approx with k, thus get A t A X k t X k (both are n x n matr.) We use X k to define how to project A and Q: X k = k R t, substitute R t = P t A, so get P k t A In fact, k P t = P k t which is a k x m matrix This means that to reduce a doc/query vector is enough to multiply it by P k t thus paying O(km) per doc/query Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) R,P are formed by orthonormal eigenvectors of the matrices D,T
Which are the concepts ? c-th concept = c-th row of P k t (which is k x m) Denote it by P k t [c], whose size is m = #terms P k t [c][i] = strength of association between c-th concept and i-th term Projected document: d’ j = P k t d j d’ j [c] = strenght of concept c in d j Projected query: q’ = P k t q q’ [c] = strenght of concept c in q
Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !
An interesting math result f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!! Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given > 0, there exists a function f : P IR k such that for every pair of points u,v in P it holds: (1 - ) ||u - v|| 2 ≤ ||f(u) – f(v)|| 2 ≤ (1 + ) ||u-v|| 2 Where k = O( -2 log n)
What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v|| 2
How to compute a JL-embedding? E[r i,j ] = 0 Var[r i,j ] = 1 If we set R = r i,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions
Finally... Random projections hide large constants k (1/ ) 2 * log m, so it may be large… it is simple and fast to compute LSI is intuitive and may scale to any k optimal under various metrics but costly to compute, now good libraries indeed