Kathryn Linehan Advisor: Dr. Dianne O’Leary

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Beyond Streams and Graphs: Dynamic Tensor Analysis
Latent Semantic Analysis
ACCELERATING GOOGLE’S PAGERANK Liz & Steve. Background  When a search query is entered in Google, the relevant results are returned to the user in an.
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Less is More: Compact Matrix Decomposition for Large Sparse Graphs
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Probabilistic Latent Semantic Analysis
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Chapter 5: Information Retrieval and Web Search
AMSC 6631 Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images: Midyear Report Alfredo Nava-Tudela John J. Benedetto,
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Non Negative Matrix Factorization
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
SAND C 1/17 Coupled Matrix Factorizations using Optimization Daniel M. Dunlavy, Tamara G. Kolda, Evrim Acar Sandia National Laboratories SIAM Conference.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Non-negative Matrix Factorization
Matrix Factorization and its applications By Zachary 16 th Nov, 2010.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets  Mehmet Koyutürk and Ananth Grama, Dept. of Computer Sciences, Purdue University {koyuturk,
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Large Scale Search: Inverted Index, etc.
Document Clustering Based on Non-negative Matrix Factorization
Matrix Sketching over Sliding Windows
LSI, SVD and Data Management
Parallelism in High-Performance Computing Applications
Singular Value Decomposition
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Junghoo “John” Cho UCLA
Latent Semantic Analysis
Presentation transcript:

Kathryn Linehan Advisor: Dr. Dianne O’Leary Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary

Information Retrieval Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine

Querying a Document Database We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A Entry ( i , j ): importance of term i in document j Query Vector, q Entry ( i ): importance of term i in the query

Term-Document Matrix Entry ( i, j) : weight of term i in document j 1 2 3 4 Term Example: Mark Twain Samuel Clemens Purple Fairy Example taken from [5]

Query Vector Example: Entry ( i ) : weight of term i in the query search for “Mark Twain” Mark Twain Samuel Clemens Purple Fairy Example taken from [5]

Document Scoring Document 1 2 3 4 Term Scores Mark Twain Doc 1 Samuel Doc 2 Clemens Doc 3 Purple Doc 4 Fairy Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not Example taken from [5]

Can we do better if we replace the matrix by an approximation? Singular Value Decomposition (SVD) Nonnegative Matrix Factorization (NMF) CUR Decomposition

Nonnegative Matrix Factorization (NMF) W and H are nonnegative k x n m x n m x k Storage: k(m + n) entries

NMF Multiplicative update algorithm of Lee and Seung found in [1] Find W, H to minimize Random initialization for W,H Gradient descent method Slow due to matrix multiplications in iteration

NMF Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs.

NMF Validation B: 500 x 200 random sparse matrix. Rank(NMF) = 80.

CUR Decomposition C (R) holds c (r) sampled and rescaled columns (rows) of A U is computed using C and R C U R c x r r x n , where k is a rank parameter m x n m x c Storage: (nz(C) + cr + nz(R)) entries

CUR Implementations CUR algorithm in [3] by Drineas, Kannan, and Mahoney Linear time algorithm Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time) Other Modifications: our ideas Deterministic CUR code by G. W. Stewart [2]

Sampling Column (Row) norm sampling [3] Subspace Sampling [4] Prob(col j) = (similar for row i) Subspace Sampling [4] Uses rank-k SVD of A for column probabilities Prob(col j) = Uses “economy size” SVD of C for row probabilities Prob(row i) = Sampling without replacement

Computation of U Linear U [3]: approximately solves Optimal U: solves

Deterministic CUR Code by G. W. Stewart [2] Uses a RRQR algorithm that does not store Q We only need the permutation vector Gives us the columns (rows) for C (R) Uses an optimal U

Compact Matrix Decomposition (CMD) Improvement Remove repeated columns (rows) in C (R) Decreases storage while still achieving the same relative error [6] Algorithm [3] [3] with CMD Runtime 0.008060 0.007153 Storage 880.5 550.5 Relative Error 0.820035 A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.

CUR: Sampling with Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. Legend: Sampling, U

Sampling without Replacement: Scaling vs. No Scaling Invert scaling factor applied to

CUR: Sampling without Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling

CUR Comparison B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling

Judging Success: Precision and Recall Measurement of performance for document retrieval Average precision and recall, where the average is taken over all queries in the data set Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. Precision: Recall:

LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.

LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations. (CUR: r = c = k)

Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec) SVD 0.8203 686500 22.5664 NMF 0.8409 686400 23.0210 CUR: cn,lin 1.4151 17242 0.1741 CUR: cn,opt 0.9724 16358 0.2808 CUR: sub,lin 1.2093 16175 48.7651 CUR: sub,opt 0.9615 16108 49.0830 CUR: w/oR,no 0.9931 17932 0.3466 CUR: w/oR,yes 0.9957 17220 0.2734 CUR:GWS 0.9437 25020 2.2857 LTM -- 52003 --

Conclusions We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD We can achieve LSI results that are almost as good with cheaper approximations Less storage Less computation time

Completed Project Goals Code/validate NMF and CUR Analyze relative error, runtime, and storage of NMF and CUR Improve CUR algorithm of [3] Analyze use of NMF and CUR in LSI

References [1] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007. M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [2] [3] [4] [5] [6]