Presentation is loading. Please wait.

Presentation is loading. Please wait.

Packing to fewer dimensions

Similar presentations


Presentation on theme: "Packing to fewer dimensions"— Presentation transcript:

1 Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Packing to fewer dimensions Paolo Ferragina Dipartimento di Informatica Università di Pisa

2 Speeding up cosine computation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) to compute cos(d,q) for all n docs Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection

3 Briefly LSI is data-dependent Random projection is data-independent
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Briefly LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. What about polysemy ?

4 Latent Semantic Indexing
Sec. 18.4 Latent Semantic Indexing courtesy of Susan Dumais

5 Notions from linear algebra
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Notions from linear algebra Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues l and eigenvector v: Av = lv Example

6 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space

7 Singular-Value Decomposition
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Singular-Value Decomposition Recall m  n matrix of terms  docs, A. A has rank r  m,n Define term-term correlation matrix T=AAt T is a square, symmetric m  m matrix Let U be m  r matrix of r eigenvectors of T Define doc-doc correlation matrix D=AtA D is a square, symmetric n  n matrix Let V be n  r matrix of r eigenvectors of D

8 A’s decomposition Vt S U A
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A’s decomposition Given U (for T, m  r) and V (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that A = U S Vt Where S is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn = mr rn rr Vt S U A

9 Dimensionality reduction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dimensionality reduction Fix some k << r, zero out all but the k biggest eigenvalues in S [choice of k is crucial] Denote by Sk this new version of S, having rank k Typically k is about 100, while r (A’s rank) is > 10,000 document k k k = useless due to 0-col/0-row of Sk r Sk S Vt Ak A U r x n k x n m x r m x k

10 A running example

11 A running example

12 A running example

13 Guarantee Ak is a pretty good approximation to A:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Guarantee Ak is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures: minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk sr2 Frobenius norm ||A||F2 = s12+ s sr2

14 Reduction We use Xk to define how to project A:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" U,V are formed by orthonormal eigenvectors of the matrices D,T Reduction Since we are interested in doc/doc correlation, we consider: D=At A =(U S V t)t (U S V t) = (S V t)t (S V t) Hence X = S Vt is a matrix r x n, may play the role of A To reduce its size we set Xk = Sk Vt is a matrix k x n and thus get At A  Xkt Xk (both are n x n matrices) We use Xk to define how to project A: Since Xk = Sk Vk t  Xk = Ukt A (use def of SVD of A) Since Xk may play role of A, its cols are proj. docs Similarly Q can be interpreted as a new col of A and thus it is enough to multiply Ukt times Q to get the projected query, O(km) time A = U S VT , AT = V S UT , AT U S-1 = V

15 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Which are the concepts ? c-th concept = c-th col of Uk (which is m x k) Uk[i][c] = strength of association between c-th concept and i-th term Vtk[c][j] = strength of association between c-th concept and j-th document Projected document: d’j = Utk dj d’j [c] = strenght of concept c in dj Projected query: q’ = Utk q q’[c] = strenght of concept c in q

16 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

17 An interesting math result
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" An interesting math result Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given e > 0, there exists a function f : P  IRk such that for every pair of points u,v in P it holds: (1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + e) ||u-v||2 Where k = O(e-2 log n) f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!!

18 What about the cosine-distance ?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v||2 2 f(u) * f(v) <= ||f(u)||^2 + ||f(v)||^2 - ||f(u) – f(v)||^2 <= (1+eps) ||u||^2 + (1+eps) ||v||^2 – (1-eps) ||u-v||^2 2f(u)f(v) <= (1+eps) (||u||^2 + ||v||^2) – (1-eps) (||u||^2 + ||v||^2 – 2 uv) = 2 eps (||u||^2 + ||v||^2) + (1-eps) (2 uv)

19 How to compute a JL-embedding?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How to compute a JL-embedding? If we set the projection matrix P = pi,j as a random m x k matrix, where its components are independent random variables with one of the following two distributions: 2 E[pi,j] = 0 Var[pi,j] = 1

20 Finally... Random projections hide large constants
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Finally... Random projections hide large constants k  (1/e)2 * log n, so k may be large… it is simple and fast to compute LSI is intuitive and may scale to any k optimal under various metrics but costly to compute, do exist good libraries


Download ppt "Packing to fewer dimensions"

Similar presentations


Ads by Google