Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Motivation Application in several areas: –querying –clustering, identifying topics –Other: synonym recognition (TOEFL..) Psychology test essay scoring

Motivation Latent Semantic Indexing is –Latent: Captures associations which are not explicit –Semantic: Represents meaning as a function of similarity to other entities –Cool: Lots of spiffy applications, and the potential for some good theory too

Overview IR and two classical problems How LSI works Why LSI is effective: A probabilistic analysis

Information Retrieval Text corpus with many documents (docs) Given a query, find relevant docs Classical problems: –synonymy: missing docs with reference to “automobile” when querying on “car” –polysemy: retrieving docs on internet when querying on “surfing” Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval Represent each document as a word vector Represent corpus as term-document matrix (T-D matrix) A classical method: –Create new vector from query terms –Find documents with highest dot-product

Document vector space

Latent Semantic Indexing(LSI) Process term-document (T-D) matrix to expose statistical structure Convert high-dimensional space to lower- dimensional space, throw out noise, keep the good stuff Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters U = universe of terms n = number of terms m = number of docs A = n x m matrix with rank r –columns represent docs –rows represent terms

Singular Value Decomposition (SVD) LSI uses SVD, a linear analysis method:

SVD r is the rank of A D diagonal matrix of the r singular values U and V matrices composed of orthonormal columns SVD is always possible numerical methods for SVD exist run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

Synonymy LSI used in several ways: e.g. detecting synonymy A measure of similarity for two terms: –In original space: dot product of rows (terms) and of (, entry in ) –Better: dot product of rows and of (, entry in )

“Semantic” Space

Synonymy (intuition) Consider the term-term autocorrelation matrix If two terms co-occur (e.g. supply-demand) we get nearly identical rows Yields a small eigenvalue for The eigenvector will likely be projected out in as it gives a weak eigenvalue

A Performance Evaluation Landauer & Dumais –Perform LSI on 30,000 encyclopedia articles –Take synonym test from TOEFL –Choose most similar word LSI - 64.4% (52.2% corrected for guessing) People - 64.5% (52.7% corrected for guessing) Correlated.44 with incorrect alternatives

A Probabilistic Analysis overview The model: –Topics sufficiently disjoint –Each doc drawn from a single (random) topic Result: –With high probability (whp) : Docs from the same topic will be similar Docs from different topics will be dissimilar

The Probabilistic Model K topics, each corresponding to a set of words The sets are mutually disjoint Below, all random choices are made uniformly at random A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.) choosing a doc: –choose length of the doc –choose a topic –Repeat times: With prob choose a word from topic With prob choose a word from other topics

Set up Let vector assigned to doc by the rank-k LSI performed on the corpus. The rank-k LSI is -skewed if » (intuition) Docs from the same topic should be similar (high dot product), …

The Result Theorem: Assume the corpus is created from the model just described (k topics, etc.). Then the rank-k LSI is -skewed with probability

Proof Sketch Show with k topics, we obtain k orthogonal subspaces –Assume strictly disjoint topics ( ) show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra- topic) –( ) relax by using a matrix perturbation analysis

Extensions Theory should go beyond explaining (ideally) Potential for speed up: –project the doc vectors onto a suitably small space –perform LSI on this space Yields O(m( n + c log n)) compared to O(mnc)

Future work Learn more abstract algebra (math)! Extensions: –docs spanning multiple topics? –polysemy? –other positive properties? Another important role of theory: –Unify and generalize: spectral analysis has found applications elsewhere in IR

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Similar presentations

Presentation on theme: "Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Similar presentations

Presentation on theme: "Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala."— Presentation transcript:

Similar presentations

About project

Feedback