Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Similar presentations


Presentation on theme: "Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala."— Presentation transcript:

1 Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

2 Motivation Application in several areas: –querying –clustering, identifying topics –Other: synonym recognition (TOEFL..) Psychology test essay scoring

3 Motivation Latent Semantic Indexing is –Latent: Captures associations which are not explicit –Semantic: Represents meaning as a function of similarity to other entities –Cool: Lots of spiffy applications, and the potential for some good theory too

4 Overview IR and two classical problems How LSI works Why LSI is effective: A probabilistic analysis

5 Information Retrieval Text corpus with many documents (docs) Given a query, find relevant docs Classical problems: –synonymy: missing docs with reference to “automobile” when querying on “car” –polysemy: retrieving docs on internet when querying on “surfing” Solution: Represent docs (and queries) by their underlying latent concepts

6 Information Retrieval Represent each document as a word vector Represent corpus as term-document matrix (T-D matrix) A classical method: –Create new vector from query terms –Find documents with highest dot-product

7 Document vector space

8 Latent Semantic Indexing(LSI) Process term-document (T-D) matrix to expose statistical structure Convert high-dimensional space to lower- dimensional space, throw out noise, keep the good stuff Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

9 Parameters U = universe of terms n = number of terms m = number of docs A = n x m matrix with rank r –columns represent docs –rows represent terms

10 Singular Value Decomposition (SVD) LSI uses SVD, a linear analysis method:

11 SVD r is the rank of A D diagonal matrix of the r singular values U and V matrices composed of orthonormal columns SVD is always possible numerical methods for SVD exist run time: O(m n c), where c denotes the average number of words per document

12 T-D Matrix Approximation

13 Synonymy LSI used in several ways: e.g. detecting synonymy A measure of similarity for two terms: –In original space: dot product of rows (terms) and of (, entry in ) –Better: dot product of rows and of (, entry in )

14 “Semantic” Space

15 Synonymy (intuition) Consider the term-term autocorrelation matrix If two terms co-occur (e.g. supply-demand) we get nearly identical rows Yields a small eigenvalue for The eigenvector will likely be projected out in as it gives a weak eigenvalue

16 A Performance Evaluation Landauer & Dumais –Perform LSI on 30,000 encyclopedia articles –Take synonym test from TOEFL –Choose most similar word LSI - 64.4% (52.2% corrected for guessing) People - 64.5% (52.7% corrected for guessing) Correlated.44 with incorrect alternatives

17 A Probabilistic Analysis overview The model: –Topics sufficiently disjoint –Each doc drawn from a single (random) topic Result: –With high probability (whp) : Docs from the same topic will be similar Docs from different topics will be dissimilar

18 The Probabilistic Model K topics, each corresponding to a set of words The sets are mutually disjoint Below, all random choices are made uniformly at random A corpus of m docs, each doc created as follows..

19 The Probabilistic Model (cont.) choosing a doc: –choose length of the doc –choose a topic –Repeat times: With prob choose a word from topic With prob choose a word from other topics

20 Set up Let vector assigned to doc by the rank-k LSI performed on the corpus. The rank-k LSI is -skewed if » (intuition) Docs from the same topic should be similar (high dot product), …

21 The Result Theorem: Assume the corpus is created from the model just described (k topics, etc.). Then the rank-k LSI is -skewed with probability

22 Proof Sketch Show with k topics, we obtain k orthogonal subspaces –Assume strictly disjoint topics ( ) show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra- topic) –( ) relax by using a matrix perturbation analysis

23 Extensions Theory should go beyond explaining (ideally) Potential for speed up: –project the doc vectors onto a suitably small space –perform LSI on this space Yields O(m( n + c log n)) compared to O(mnc)

24 Future work Learn more abstract algebra (math)! Extensions: –docs spanning multiple topics? –polysemy? –other positive properties? Another important role of theory: –Unify and generalize: spectral analysis has found applications elsewhere in IR


Download ppt "Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala."

Similar presentations


Ads by Google