Kathryn Linehan Advisor: Dr. Dianne O’Leary Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary
Information Retrieval Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine
Querying a Document Database We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A Entry ( i , j ): importance of term i in document j Query Vector, q Entry ( i ): importance of term i in the query
Term-Document Matrix Entry ( i, j) : weight of term i in document j 1 2 3 4 Term Example: Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
Query Vector Example: Entry ( i ) : weight of term i in the query search for “Mark Twain” Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
Document Scoring Document 1 2 3 4 Term Scores Mark Twain Doc 1 Samuel Doc 2 Clemens Doc 3 Purple Doc 4 Fairy Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not Example taken from [5]
Can we do better if we replace the matrix by an approximation? Singular Value Decomposition (SVD) Nonnegative Matrix Factorization (NMF) CUR Decomposition
Nonnegative Matrix Factorization (NMF) W and H are nonnegative k x n m x n m x k Storage: k(m + n) entries
NMF Multiplicative update algorithm of Lee and Seung found in [1] Find W, H to minimize Random initialization for W,H Gradient descent method Slow due to matrix multiplications in iteration
NMF Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs.
NMF Validation B: 500 x 200 random sparse matrix. Rank(NMF) = 80.
CUR Decomposition C (R) holds c (r) sampled and rescaled columns (rows) of A U is computed using C and R C U R c x r r x n , where k is a rank parameter m x n m x c Storage: (nz(C) + cr + nz(R)) entries
CUR Implementations CUR algorithm in [3] by Drineas, Kannan, and Mahoney Linear time algorithm Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time) Other Modifications: our ideas Deterministic CUR code by G. W. Stewart [2]
Sampling Column (Row) norm sampling [3] Subspace Sampling [4] Prob(col j) = (similar for row i) Subspace Sampling [4] Uses rank-k SVD of A for column probabilities Prob(col j) = Uses “economy size” SVD of C for row probabilities Prob(row i) = Sampling without replacement
Computation of U Linear U [3]: approximately solves Optimal U: solves
Deterministic CUR Code by G. W. Stewart [2] Uses a RRQR algorithm that does not store Q We only need the permutation vector Gives us the columns (rows) for C (R) Uses an optimal U
Compact Matrix Decomposition (CMD) Improvement Remove repeated columns (rows) in C (R) Decreases storage while still achieving the same relative error [6] Algorithm [3] [3] with CMD Runtime 0.008060 0.007153 Storage 880.5 550.5 Relative Error 0.820035 A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
CUR: Sampling with Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. Legend: Sampling, U
Sampling without Replacement: Scaling vs. No Scaling Invert scaling factor applied to
CUR: Sampling without Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling
CUR Comparison B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling
Judging Success: Precision and Recall Measurement of performance for document retrieval Average precision and recall, where the average is taken over all queries in the data set Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. Precision: Recall:
LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.
LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations. (CUR: r = c = k)
Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec) SVD 0.8203 686500 22.5664 NMF 0.8409 686400 23.0210 CUR: cn,lin 1.4151 17242 0.1741 CUR: cn,opt 0.9724 16358 0.2808 CUR: sub,lin 1.2093 16175 48.7651 CUR: sub,opt 0.9615 16108 49.0830 CUR: w/oR,no 0.9931 17932 0.3466 CUR: w/oR,yes 0.9957 17220 0.2734 CUR:GWS 0.9437 25020 2.2857 LTM -- 52003 --
Conclusions We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD We can achieve LSI results that are almost as good with cheaper approximations Less storage Less computation time
Completed Project Goals Code/validate NMF and CUR Analyze relative error, runtime, and storage of NMF and CUR Improve CUR algorithm of [3] Analyze use of NMF and CUR in LSI
References [1] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007. M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [2] [3] [4] [5] [6]