Download presentation
Presentation is loading. Please wait.
Published byMilton Lang Modified over 9 years ago
1
Information Retrieval Latent Semantic Indexing
2
Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000 100) while preserving distances? Two methods: “Latent semantic indexing” Random projection
3
Two approaches LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees probable stretching properties between pair of points.
4
Notions from linear algebra Matrix A, vector v Matrix transpose (A t ) Matrix product Rank Eigenvalues and eigenvector v: Av = v
5
Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled in this new vector space
6
Example 16 terms 17 docs
7
Intuition (contd) More than dimension reduction: Derive a set of new uncorrelated features (roughly, artificial concepts), one per dimension. Docs with lots of overlapping terms stay together Terms also get pulled together onto the same dimension Each term or document is then characterized by a vector of weights indicating its strength of association with each of these underlying concepts Ex. car and automobile get pulled together, since co-occur in docs with tires, radiator, cylinder,… Here comes “semantic” !!!
8
Singular-Value Decomposition Recall m n matrix of terms docs, A. A has rank r m,n Define term-term correlation matrix T=AA t T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T Define doc-doc correlation matrix D=A t A D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D
9
A’s decomposition Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product) It turns out that A = P R t Where is a diagonal matrix with the eigenvalues of T=AA t in decreasing order. = A P RtRt mnmnmrmr rrrr rnrn
10
For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial] Denote by k this new version of , having rank k Typically k is about 100, while r ( A’s rank ) is > 10,000 = P kk RtRt Dimensionality reduction AkAk document useless due to 0-col/0-row of k m x r r x n r k k k 00 0 A m x k k x n
11
Guarantee A k is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m n matrices of rank k, A k is the best approximation to A wrt the following measures: min B, rank(B)=k ||A-B|| 2 = ||A-A k || 2 = k min B, rank(B)=k ||A-B|| F 2 = ||A-A k || F 2 = k 2 k+2 2 r 2 Frobenius norm ||A|| F 2 = 2 2 r 2
12
Reduction X k = k R t is the doc-matrix reduced to k<n dim Take the doc-correlation matrix: It is D=A t A =(P R t ) t (P R t ) = ( R t ) t ( R t ) Approx with k, and thus get A t A X k t X k We use X k to approx A: X k = k R t = P k t A. This means that to reduce a doc/query vector is enough to multiply it by P k t (i.e. k x m matrix) Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) R,P are formed by orthonormal eigenvectors of the matrices D,T
13
Which are the concepts ? c-th concept = c-th row of P k t (which is k x m) Denote it by P k t [c], note its size is m = #terms P k t [c][i] = strength of association between c-th concept and i-th term Projected document: d’ j = P k t d j d’ j [c] = strenght of concept c in d j
14
Information Retrieval Random Projection
15
An interesting math result! Setting v=0 we also get a bound on f(u)’s stretching!!!
16
What about the cosine-distance ? f(u)’s, f(v)’s stretching
17
Defining the projection matrix R’s columns k
18
Concentration bound!!! Is R a JL-embedding?
19
Gaussians are good!! NOTE: Every col of R is unitary and uniformly distributed over the unit-sphere; moreover, the k cols of R are orthonormal on average.
20
A practical-theoretical idea !!! E[r i,j ] = 0 Var[r i,j ] = 1
21
Question !! Various theoretical results known. What about practical cases?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.