Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Slides:



Advertisements
Similar presentations
Latent Semantic Analysis
Advertisements

Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Lecture 19 Singular Value Decomposition
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Symmetric Matrices and Quadratic Forms
Principal Component Analysis
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Chapter 2 Dimensionality Reduction. Linear Methods
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
SVD: Singular Value Decomposition
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität.
Unsupervised Learning II Feature Extraction
Vector Semantics Dense Vectors.
CS246 Linear Algebra Review. A Brief Review of Linear Algebra Vector and a list of numbers Addition Scalar multiplication Dot product Dot product as a.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
LSI, SVD and Data Management
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Symmetric Matrices and Quadratic Forms
Packing to fewer dimensions
Restructuring Sparse High Dimensional Data for Effective Retrieval
Lecture 20 SVD and Its Applications
Symmetric Matrices and Quadratic Forms
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Motivation Application in several areas: –querying –clustering, identifying topics –Other: synonym recognition (TOEFL..) Psychology test essay scoring

Motivation Latent Semantic Indexing is –Latent: Captures associations which are not explicit –Semantic: Represents meaning as a function of similarity to other entities –Cool: Lots of spiffy applications, and the potential for some good theory too

Overview IR and two classical problems How LSI works Why LSI is effective: A probabilistic analysis

Information Retrieval Text corpus with many documents (docs) Given a query, find relevant docs Classical problems: –synonymy: missing docs with reference to “automobile” when querying on “car” –polysemy: retrieving docs on internet when querying on “surfing” Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval Represent each document as a word vector Represent corpus as term-document matrix (T-D matrix) A classical method: –Create new vector from query terms –Find documents with highest dot-product

Document vector space

Latent Semantic Indexing(LSI) Process term-document (T-D) matrix to expose statistical structure Convert high-dimensional space to lower- dimensional space, throw out noise, keep the good stuff Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters U = universe of terms n = number of terms m = number of docs A = n x m matrix with rank r –columns represent docs –rows represent terms

Singular Value Decomposition (SVD) LSI uses SVD, a linear analysis method:

SVD r is the rank of A D diagonal matrix of the r singular values U and V matrices composed of orthonormal columns SVD is always possible numerical methods for SVD exist run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

Synonymy LSI used in several ways: e.g. detecting synonymy A measure of similarity for two terms: –In original space: dot product of rows (terms) and of (, entry in ) –Better: dot product of rows and of (, entry in )

“Semantic” Space

Synonymy (intuition) Consider the term-term autocorrelation matrix If two terms co-occur (e.g. supply-demand) we get nearly identical rows Yields a small eigenvalue for The eigenvector will likely be projected out in as it gives a weak eigenvalue

A Performance Evaluation Landauer & Dumais –Perform LSI on 30,000 encyclopedia articles –Take synonym test from TOEFL –Choose most similar word LSI % (52.2% corrected for guessing) People % (52.7% corrected for guessing) Correlated.44 with incorrect alternatives

A Probabilistic Analysis overview The model: –Topics sufficiently disjoint –Each doc drawn from a single (random) topic Result: –With high probability (whp) : Docs from the same topic will be similar Docs from different topics will be dissimilar

The Probabilistic Model K topics, each corresponding to a set of words The sets are mutually disjoint Below, all random choices are made uniformly at random A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.) choosing a doc: –choose length of the doc –choose a topic –Repeat times: With prob choose a word from topic With prob choose a word from other topics

Set up Let vector assigned to doc by the rank-k LSI performed on the corpus. The rank-k LSI is -skewed if » (intuition) Docs from the same topic should be similar (high dot product), …

The Result Theorem: Assume the corpus is created from the model just described (k topics, etc.). Then the rank-k LSI is -skewed with probability

Proof Sketch Show with k topics, we obtain k orthogonal subspaces –Assume strictly disjoint topics ( ) show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra- topic) –( ) relax by using a matrix perturbation analysis

Extensions Theory should go beyond explaining (ideally) Potential for speed up: –project the doc vectors onto a suitably small space –perform LSI on this space Yields O(m( n + c log n)) compared to O(mnc)

Future work Learn more abstract algebra (math)! Extensions: –docs spanning multiple topics? –polysemy? –other positive properties? Another important role of theory: –Unify and generalize: spectral analysis has found applications elsewhere in IR