Latent Semantic Indexing via a Semi-discrete Matrix Decomposition
Papers from the same authors with similar topics 1.Kolda, T.G. & O'Leary, D.P. A semidiscrete matrix decomposition for latent semantic indexing information retrieval ACM Trans. Inf. Syst., 1998, 16, Kolda, T.G. & O’Leary, D.P. George Cybenko, D.P.O. (ed.) Latentsemantic indexing via a semi-discrete matrix decomposition Springer-Verlag, 1999, 107, 73–80 3.Kolda, T.G. & O'leary, D.P. Algorithm 805: computation and uses of the semidiscrete matrix decomposition ACM Transactions on Mathematical Software, 2000, 26, 415– 435
Vector Space Framework Query:
Weight of term in a document
Motivation for using SDD Singular Value Decomposition (SVD) is used for Latent Semantic Indexing (LSI) to estimate the structure of word usage across documents. Use Semi-discrete Decomposition (SDD) instead of SVD for LSI to save storage space and retrieval time.
Why? Claim: SVD has nice theoretical properties but SVD contains a lot of information, probably more than is necessary for this application.
SVD vs SDD SVD: SDD:
SDD is an approximate representation of the matrix. Repackaging, even without removing anything, might not result in the original matrix. Theorems exist that say that as the number of terms k tends to infinity, slowly you will converge to the original matrix. The speed of convergence depends on the original estimate, used to "initialize" the iterative decomposition algorithm.
Result: Storage Space SVDSDD Approximate comparative storage space (for same given rank k ) Size per element Double word (64 bits) 2 bits Size per scalar value Double word (64 bits) Single word (32 bits) TOTAL8k(m + n + 1)4k + ¼ k(m + n)
Medline test case
Results on Medline test case
Method for SDD Greedy algorithm to iteratively construct the kth triplet, d k, x k, and y k.
Metrics in those papers 1.Kolda, T.G. & O'Leary, D.P. A semidiscrete matrix decomposition for latent semantic indexing information retrieval ACM Trans. Inf. Syst., 1998, 16, Kolda, T.G. & O’Leary, D.P. George Cybenko, D.P.O. (ed.) Latentsemantic indexing via a semi-discrete matrix decomposition Springer-Verlag, 1999, 107, 73–80 3.Kolda, T.G. & O'leary, D.P. Algorithm 805: computation and uses of the semidiscrete matrix decomposition ACM Transactions on Mathematical Software, 2000, 26, 415–435 NOTE: all y’s above are fixed. x and y are alternatively fixed in each algorithm
Greedy Algorithm
Notes on the algorithm Starting vector y: every 100 th element is 1 and all the other are 0. A k A as k ∞ Find the minimum F-norm can be simplified to find an optimal x. Improvement threshold may be improvement = |new - old| / old
1.Fix y. 2.Find optimal d. 3.Remove d by plugging d. 4.Solve x. 5.Use x and y to find d. Finding x and d This is simplified to (and use for the algorithm) 1.Fix y 2.Find optimal x over m x-vectors 3.Given x and y, find d*
There are m possible values for J; thus, we only need to check m possible x vectors to determine the optimal solution.