Download presentation
Presentation is loading. Please wait.
1
Kathryn Linehan Advisor: Dr. Dianne O’Leary
Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary
2
Information Retrieval
Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine
3
Querying a Document Database
We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A Entry ( i , j ): importance of term i in document j Query Vector, q Entry ( i ): importance of term i in the query
4
Term-Document Matrix Entry ( i, j) : weight of term i in document j
Term Example: Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
5
Query Vector Example: Entry ( i ) : weight of term i in the query
search for “Mark Twain” Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
6
Document Scoring Document Term Scores Mark Twain Doc 1 Samuel Doc 2 Clemens Doc 3 Purple Doc 4 Fairy Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not Example taken from [5]
7
Can we do better if we replace the matrix by an approximation?
Singular Value Decomposition (SVD) Nonnegative Matrix Factorization (NMF) CUR Decomposition
8
Nonnegative Matrix Factorization (NMF)
W and H are nonnegative k x n m x n m x k Storage: k(m + n) entries
9
NMF Multiplicative update algorithm of Lee and Seung found in [1]
Find W, H to minimize Random initialization for W,H Gradient descent method Slow due to matrix multiplications in iteration
10
NMF Validation A: 5 x 3 random dense matrix. Average over 5 runs.
B: 500 x 200 random sparse matrix. Average over 5 runs.
11
NMF Validation B: 500 x 200 random sparse matrix. Rank(NMF) = 80.
12
CUR Decomposition C (R) holds c (r) sampled and rescaled columns (rows) of A U is computed using C and R C U R c x r r x n , where k is a rank parameter m x n m x c Storage: (nz(C) + cr + nz(R)) entries
13
CUR Implementations CUR algorithm in [3] by Drineas, Kannan, and Mahoney Linear time algorithm Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time) Other Modifications: our ideas Deterministic CUR code by G. W. Stewart [2]
14
Sampling Column (Row) norm sampling [3] Subspace Sampling [4]
Prob(col j) = (similar for row i) Subspace Sampling [4] Uses rank-k SVD of A for column probabilities Prob(col j) = Uses “economy size” SVD of C for row probabilities Prob(row i) = Sampling without replacement
15
Computation of U Linear U [3]: approximately solves Optimal U: solves
16
Deterministic CUR Code by G. W. Stewart [2]
Uses a RRQR algorithm that does not store Q We only need the permutation vector Gives us the columns (rows) for C (R) Uses an optimal U
17
Compact Matrix Decomposition (CMD) Improvement
Remove repeated columns (rows) in C (R) Decreases storage while still achieving the same relative error [6] Algorithm [3] [3] with CMD Runtime Storage 880.5 550.5 Relative Error A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
18
CUR: Sampling with Replacement Validation
A: 5 x 3 random dense matrix. Average over 5 runs. Legend: Sampling, U
19
Sampling without Replacement: Scaling vs. No Scaling
Invert scaling factor applied to
20
CUR: Sampling without Replacement Validation
A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling
21
CUR Comparison B: 500 x 200 random sparse matrix. Average over 5 runs.
Legend: Sampling, U, Scaling
22
Judging Success: Precision and Recall
Measurement of performance for document retrieval Average precision and recall, where the average is taken over all queries in the data set Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. Precision: Recall:
23
LSI Results Term-document matrix size: 5831 x All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.
24
LSI Results Term-document matrix size: 5831 x 1033.
All matrix approximations are rank 100 approximations. (CUR: r = c = k)
25
Matrix Approximation Results
Rel. Error (F-norm) Storage (nz) Runtime (sec) SVD NMF CUR: cn,lin CUR: cn,opt CUR: sub,lin CUR: sub,opt CUR: w/oR,no CUR: w/oR,yes CUR:GWS LTM
26
Conclusions We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD We can achieve LSI results that are almost as good with cheaper approximations Less storage Less computation time
27
Completed Project Goals
Code/validate NMF and CUR Analyze relative error, runtime, and storage of NMF and CUR Improve CUR algorithm of [3] Analyze use of NMF and CUR in LSI
28
References [1] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1): , September 2007. M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR CMSC TR-4591, University of Maryland, May 2004. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1): , 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2): , 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4): , October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [2] [3] [4] [5] [6]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.