Presentation is loading. Please wait.

Presentation is loading. Please wait.

Packing to fewer dimensions

Similar presentations


Presentation on theme: "Packing to fewer dimensions"— Presentation transcript:

1 Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Packing to fewer dimensions Paolo Ferragina Dipartimento di Informatica Università di Pisa

2 Speeding up cosine computation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) to compute cos(d,q) for all d Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection

3 Briefly LSI is data-dependent Random projection is data-independent
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Briefly LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. What about polysemy ?

4 Latent Semantic Indexing
Sec. 18.4 Latent Semantic Indexing courtesy of Susan Dumais

5 Notions from linear algebra
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Notions from linear algebra Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues l and eigenvector v: Av = lv Example

6 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space

7 Singular-Value Decomposition
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Singular-Value Decomposition Recall m  n matrix of terms  docs, A. A has rank r  m,n Define term-term correlation matrix T=AAt T is a square, symmetric m  m matrix Let P be m  r matrix of r eigenvectors of T Define doc-doc correlation matrix D=AtA D is a square, symmetric n  n matrix Let R be n  r matrix of r eigenvectors of D

8 A’s decomposition Rt S P A
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A’s decomposition Given P (for T, m  r) and R (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that A = PSRt Where S is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn = mr rn rr Rt S P A

9 Dimensionality reduction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dimensionality reduction For some k << r, zero out all but the k biggest eigenvalues in S [choice of k is crucial] Denote by Sk this new version of S, having rank k Typically k is about 100, while r (A’s rank) is > 10,000 document k k k = useless due to 0-col/0-row of Sk r S Sk Rt P Ak A r x n k x n m x r m x k

10 Guarantee Ak is a pretty good approximation to A:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Guarantee Ak is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures: minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk sr2 Frobenius norm ||A||F2 = s12+ s sr2

11 Reduction We use Xk to define how to project A and Q:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" R,P are formed by orthonormal eigenvectors of the matrices D,T Reduction Xk = Sk Rt is the doc-matrix k x n, hence reduced to k dim Since we are interested in doc/q correlation, we consider: D=At A =(P S Rt)t (P S Rt) = (SRt)t (SRt) Approx S with Sk, thus get At A  Xkt Xk (both are n x n matr.) We use Xk to define how to project A and Q: Xk = Sk Rt , we have Pkt A = Sk Rt = Xk Since Xk plays the role of A, its columns are the projected docs, hence it is enough to multiply Q by Pkt to get the projected query, O(km) time Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

12 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Which are the concepts ? c-th concept = c-th row of Pkt (which is k x m) Denote it by Pkt [c], whose size is m = #terms Pkt [c][i] = strength of association between c-th concept and i-th term Projected document: d’j = Pkt dj d’j[c] = strenght of concept c in dj Projected query: q’ = Pkt q q’ [c] = strenght of concept c in q

13 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

14 An interesting math result
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" An interesting math result Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given e > 0, there exists a function f : P  IRk such that for every pair of points u,v in P it holds: (1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + e) ||u-v||2 Where k = O(e-2 log n) f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!!

15 What about the cosine-distance ?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v||2

16 How to compute a JL-embedding?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How to compute a JL-embedding? If we set the projection matrix P = pi,j as a random m x k matrix, where its components are independent random variables with one of the following two distributions: E[ri,j] = 0 Var[ri,j] = 1

17 Finally... Random projections hide large constants
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Finally... Random projections hide large constants k  (1/e)2 * log n, so k may be large… it is simple and fast to compute LSI is intuitive and may scale to any k optimal under various metrics but costly to compute, do exist good libraries


Download ppt "Packing to fewer dimensions"

Similar presentations


Ads by Google