Download presentation
Presentation is loading. Please wait.
Published byIlene Hart Modified over 9 years ago
1
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala
2
Motivation Application in several areas: –querying –clustering, identifying topics –Other: synonym recognition (TOEFL..) Psychology test essay scoring
3
Motivation Latent Semantic Indexing is –Latent: Captures associations which are not explicit –Semantic: Represents meaning as a function of similarity to other entities –Cool: Lots of spiffy applications, and the potential for some good theory too
4
Overview IR and two classical problems How LSI works Why LSI is effective: A probabilistic analysis
5
Information Retrieval Text corpus with many documents (docs) Given a query, find relevant docs Classical problems: –synonymy: missing docs with reference to “automobile” when querying on “car” –polysemy: retrieving docs on internet when querying on “surfing” Solution: Represent docs (and queries) by their underlying latent concepts
6
Information Retrieval Represent each document as a word vector Represent corpus as term-document matrix (T-D matrix) A classical method: –Create new vector from query terms –Find documents with highest dot-product
7
Document vector space
8
Latent Semantic Indexing(LSI) Process term-document (T-D) matrix to expose statistical structure Convert high-dimensional space to lower- dimensional space, throw out noise, keep the good stuff Related to principal component analysis (PCA), multiple dimensional scaling (MDS)
9
Parameters U = universe of terms n = number of terms m = number of docs A = n x m matrix with rank r –columns represent docs –rows represent terms
10
Singular Value Decomposition (SVD) LSI uses SVD, a linear analysis method:
11
SVD r is the rank of A D diagonal matrix of the r singular values U and V matrices composed of orthonormal columns SVD is always possible numerical methods for SVD exist run time: O(m n c), where c denotes the average number of words per document
12
T-D Matrix Approximation
13
Synonymy LSI used in several ways: e.g. detecting synonymy A measure of similarity for two terms: –In original space: dot product of rows (terms) and of (, entry in ) –Better: dot product of rows and of (, entry in )
14
“Semantic” Space
15
Synonymy (intuition) Consider the term-term autocorrelation matrix If two terms co-occur (e.g. supply-demand) we get nearly identical rows Yields a small eigenvalue for The eigenvector will likely be projected out in as it gives a weak eigenvalue
16
A Performance Evaluation Landauer & Dumais –Perform LSI on 30,000 encyclopedia articles –Take synonym test from TOEFL –Choose most similar word LSI - 64.4% (52.2% corrected for guessing) People - 64.5% (52.7% corrected for guessing) Correlated.44 with incorrect alternatives
17
A Probabilistic Analysis overview The model: –Topics sufficiently disjoint –Each doc drawn from a single (random) topic Result: –With high probability (whp) : Docs from the same topic will be similar Docs from different topics will be dissimilar
18
The Probabilistic Model K topics, each corresponding to a set of words The sets are mutually disjoint Below, all random choices are made uniformly at random A corpus of m docs, each doc created as follows..
19
The Probabilistic Model (cont.) choosing a doc: –choose length of the doc –choose a topic –Repeat times: With prob choose a word from topic With prob choose a word from other topics
20
Set up Let vector assigned to doc by the rank-k LSI performed on the corpus. The rank-k LSI is -skewed if » (intuition) Docs from the same topic should be similar (high dot product), …
21
The Result Theorem: Assume the corpus is created from the model just described (k topics, etc.). Then the rank-k LSI is -skewed with probability
22
Proof Sketch Show with k topics, we obtain k orthogonal subspaces –Assume strictly disjoint topics ( ) show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra- topic) –( ) relax by using a matrix perturbation analysis
23
Extensions Theory should go beyond explaining (ideally) Potential for speed up: –project the doc vectors onto a suitably small space –perform LSI on this space Yields O(m( n + c log n)) compared to O(mnc)
24
Future work Learn more abstract algebra (math)! Extensions: –docs spanning multiple topics? –polysemy? –other positive properties? Another important role of theory: –Unify and generalize: spectral analysis has found applications elsewhere in IR
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.