Download presentation
1
Latent Semantic Analysis
2
An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?
3
Eigenvectors and Eigenvalues
Let A be an n × n matrix. If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax = x for some scalar .
4
Matrix Decomposition Let S be the matrix with eigenvectors of A as columns. Let be the diagonal matrix with the eigenvalues of A on the diagonal. Then A = SS-1 If A is symmetric then we have S-1=ST A = SST
5
Singular Value Decomposition
Let A be an m × n matrix with entries being real numbers and m > n. Consider the n × n square matrix B = ATA. B is symmetric it has been shown that the eigenvalues of such (ATA) matrices are non-negative. Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > > n2 For some index r (possibly n) the first r numbers are positive whereas the rest are zero. S1 = [x1, , xr] y1=(1/1)Ax yr=(1/r)Axr S2 = [y1, ..., yr] We can show that A = S2 S1T is diagonal and the values along the diagonal are 1, , n which are called singular values. If we denote S2 by S and S1 by U we have A = S UT
6
Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger
7
Document-term matrix
8
Latent Concepts Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept. Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities which are otherwise hidden…
9
Matrix Matrix A can be written: A = SUT
Let's "neglect" the last three singular values of as being too "small"... Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2T Matrix A is approximated as: A2 = S2U2T In general: Ak = SkUkT where a good value for k is determined empirically.
10
Matrices 2, S2, U2
11
Representing Documents, Terms, and Queries
Represent documents by the column vectors of U2T Represent terms by the row vectors S2 Represent queries by the centroid vector of their terms
12
Geometry
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.