Download presentation
1
Latent Semantic Analysis
Keith Trnka
2
Latent Semantic Indexing
Application of Latent Semantic Analysis (LSA) to Information Retrieval Motivations Unreliable evidence synonomy: Many words refer to same object Affects recall polysemy: Many words have multiple meanings Affects precision
3
LSI Motivation (example)
Source: Deerwater et al. 1990
4
LSA Solution Terms are overly noisy
analogous to overfitting in Term-by-Document matrix Terms and documents should be represented by vectors in a “latent” semantic space LSI essentially infers knowledge from co-occurrence of terms Assume “errors” (sparse data, non-co-occurrences) are normal and account for them
5
LSA Methods Start with a Term-by-Document matrix (A, like fig. 15.5)
Optionally weight cells Apply Singular Value Decomposition: t = # of terms d = # of documents n = min(t, d) Approximate using k (semantic) dimensions:
6
LSA Methods (cont’d) So that the Euclidean distance is minimized (hence, a least squares method) Each row of T is a measure of similarity for a term to a semantic dimension Likewise for D
7
LSA Application Querying for Information Retrieval: query is a psuedo-document: weighted sum over all terms in query of rows of T compare similarity to all documents in D using cosine similarity measure Document similarity: vector comparison of D Term similarity: vector comparison of T
8
LSA Application (cont’d)
Choosing k is difficult commonly k = 100, 150, 300 or so overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) What are the k semantic dimensions? undefined k performance
9
LSI Performance Source: Dumais 1997
10
Considerations of LSA Conceptually high recall: query and document terms may be disjoint Polysemes not handled well LSI: Unsupervised/completely automatic Language independent CL-LSI: Cross-Language LSI weakly trained Computational complexity is high optimization: random sampling methods Formal Linear Algebra foundations Models language acquisition in children
11
Fun Applications of LSA
Synonyms on TOEFL Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading Use Term similarity and random guessing for unknown terms Best results at k = 400 The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).
12
Subject-Matter Knowledge
Introduction to Psychology multiple choice tests from New Mexico State University and the University of Colorado, Boulder LSA scored significantly less than average, although received a passing grade LSA has trouble with knowledge not in corpus
13
Text Coherence/Comprehension
Kintcsh and colleagues Convert a text to a logic and test for coherence Slow, by hand LSA Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one Predicted comprehension with r = .93
14
Essay Grading
15
Summary Latent Semantic Analysis has a wide variety of useful applications Useful in cognitive psychology research Useful as a generic IR technique Possibly useful for lazy TAs?
16
References (Papers, etc.)
(Manning and Shutze) Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) Automatic Cross-Language IR Using LSI (Dumais et al 1996) (Landauer, et al) An Introduction to Latent Semantic Analysis (1998) A Solution to Plato’s Problem… (1997) How Well Can Passage Meaning be Derived without Using Word Order? … (1997)
17
References (Books) NLP Related SVD Related
Foundations of Statistical NLP Manning and Schutze 2003 SVD Related Linear Algebra Strang 1998 Matrix Computations (3rd ed.) Golub and Van Loan 1996
18
LSA Example/The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.