Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Eigen Decomposition and Singular Value Decomposition
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Eigen Decomposition and Singular Value Decomposition
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Function Optimization Newton’s Method. Conjugate Gradients
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
3D Geometry for Computer Graphics
Kathryn Linehan Advisor: Dr. Dianne O’Leary
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Dominant Eigenvalues & The Power Method
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
A Note on Rectangular Quotients By Achiya Dax Hydrological Service Jerusalem, Israel
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Hierarchical Segmentation: Finding Changes in a Text Signal Malcolm Slaney and Dulce Ponceleon IBM Almaden Research Center.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Natural Language Processing Topics in Information Retrieval August, 2002.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets  Mehmet Koyutürk and Ananth Grama, Dept. of Computer Sciences, Purdue University {koyuturk,
Document Clustering Based on Non-negative Matrix Factorization
LSI, SVD and Data Management
Parallelism in High-Performance Computing Applications
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Indexing via a Semi-discrete Matrix Decomposition

Papers from the same authors with similar topics 1.Kolda, T.G. & O'Leary, D.P. A semidiscrete matrix decomposition for latent semantic indexing information retrieval ACM Trans. Inf. Syst., 1998, 16, Kolda, T.G. & O’Leary, D.P. George Cybenko, D.P.O. (ed.) Latentsemantic indexing via a semi-discrete matrix decomposition Springer-Verlag, 1999, 107, 73–80 3.Kolda, T.G. & O'leary, D.P. Algorithm 805: computation and uses of the semidiscrete matrix decomposition ACM Transactions on Mathematical Software, 2000, 26, 415– 435

Vector Space Framework Query:

Weight of term in a document

Motivation for using SDD Singular Value Decomposition (SVD) is used for Latent Semantic Indexing (LSI) to estimate the structure of word usage across documents. Use Semi-discrete Decomposition (SDD) instead of SVD for LSI to save storage space and retrieval time.

Why? Claim: SVD has nice theoretical properties but SVD contains a lot of information, probably more than is necessary for this application.

SVD vs SDD SVD: SDD:

SDD is an approximate representation of the matrix. Repackaging, even without removing anything, might not result in the original matrix. Theorems exist that say that as the number of terms k tends to infinity, slowly you will converge to the original matrix. The speed of convergence depends on the original estimate, used to "initialize" the iterative decomposition algorithm.

Result: Storage Space SVDSDD Approximate comparative storage space (for same given rank k ) Size per element Double word (64 bits) 2 bits Size per scalar value Double word (64 bits) Single word (32 bits) TOTAL8k(m + n + 1)4k + ¼ k(m + n)

Medline test case

Results on Medline test case

Method for SDD Greedy algorithm to iteratively construct the kth triplet, d k, x k, and y k.

Metrics in those papers 1.Kolda, T.G. & O'Leary, D.P. A semidiscrete matrix decomposition for latent semantic indexing information retrieval ACM Trans. Inf. Syst., 1998, 16, Kolda, T.G. & O’Leary, D.P. George Cybenko, D.P.O. (ed.) Latentsemantic indexing via a semi-discrete matrix decomposition Springer-Verlag, 1999, 107, 73–80 3.Kolda, T.G. & O'leary, D.P. Algorithm 805: computation and uses of the semidiscrete matrix decomposition ACM Transactions on Mathematical Software, 2000, 26, 415–435 NOTE: all y’s above are fixed. x and y are alternatively fixed in each algorithm

Greedy Algorithm

Notes on the algorithm Starting vector y: every 100 th element is 1 and all the other are 0. A k  A as k  ∞ Find the minimum F-norm can be simplified to find an optimal x. Improvement threshold may be improvement = |new - old| / old

1.Fix y. 2.Find optimal d. 3.Remove d by plugging d. 4.Solve x. 5.Use x and y to find d. Finding x and d This is simplified to (and use for the algorithm) 1.Fix y 2.Find optimal x over m x-vectors 3.Given x and y, find d*

There are m possible values for J; thus, we only need to check m possible x vectors to determine the optimal solution.