Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.

Similar presentations


Presentation on theme: "Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy."— Presentation transcript:

1 Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

2 Latent Semantic Indexing Term-matching Most retrieval systems match words of a query (keywords) with words of a document. Problem What if users want to retrieve information based upon conceptual content?

3 Latent Semantic Indexing Expressing a concept in keywords is complicated and unreliable zSynonymy: many ways to define a concept. Results in ‘poor recall’. zPolysemy: most words have multiple meanings. Results in ‘poor precision’.

4 Latent Semantic Indexing Three factors contribute to the failure that IR systems have in overcoming problems associated w/synonymy & polysemy: 1.Identification of index terms is incomplete 2.No automatic method adequately addresses polysemy 3.Technical: the way current IR systems work

5 Latent Semantic Indexing Goal... to build an IR system that predicts what terms “really” are implied by a query or what terms “really” apply to a document (i.e. the latent semantics).

6 Latent Semantic Indexing Choosing a model Proximity model: similar items are put near each other in some space or structure.

7 Latent Semantic Indexing Existing proximity models include: zHierarchical, partition & overlapping clusterings zUltrametric & additive trees zFactor-analytic & multidimensional distance models

8 Latent Semantic Indexing Alternate model was considered, based on the following criteria: 1.Adjustable representational richness 2.Explicit representation of both terms and documents 3.Computational tractability for large datasets

9 Latent Semantic Indexing Singular value decomposition (SVD) or two-mode factor analysis, satisfied all three criteria! SVD: a fully automatic statistical method used to determine associations among terms in a large document collection, and to create a semantic or concept space.

10 Latent Semantic Indexing Basis of LSI: zDocuments are condensed to contain only “content words” w/semantic meaning zPatterns of word distribution (co-occurrence) are analyzed across a collection of documents.

11 Latent Semantic Indexing Basis of LSI: zDocument collection is examined as a whole yDocuments with many words in common are semantically close. yDocuments with few words in common are semantically distant.

12 Latent Semantic Indexing Steps of LSI: zFormat document: stop words removed, punctuation removed, no capitalization. zSelect content words: words with no semantic value are removed using stop list. zApply Stemming*: reduces words to root form. *(not applied in Deerwester, et al.)

13 Latent Semantic Indexing Result: List of content words The list of content words is used to generate a term-document matrix.

14 Latent Semantic Indexing Term-document matrix

15 Latent Semantic Indexing Term-document matrix: Term weighting* is applied to each value SVD algorithm is applied to the matrix Matrix represents vectors in a multi- dimensional space *(not applied in Deerwester, et al.)

16 Latent Semantic Indexing Visual representation of a three-dimensional space: Content words form three orthogonal axes (mutually perpendicular) eggs bacon coffee

17 Latent Semantic Indexing “If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in 'bacon-eggs-and- coffee' space. The size and direction of each vector tells you how many of the three key items were in any particular order, and the set of all the vectors taken together tells you something about the kind of breakfast people favor on a Saturday morning.” Retrieved from: http://javelina.cet.middlebury.edu/lsa/out/lsa_explanation.htm

18 Latent Semantic Indexing Retrieved from http://lsi.research.telcordia.com/lsi-bin/lsiQuery

19 Latent Semantic Indexing Romans 1:22 Professing themselves to be wise, they became fools… Romans 16:6 Greet Mary, who bestowed much labour on us. Matthew 24:22 And except those days should be shortened, there should no flesh be saved: but for the elect's sake those days shall be shortened. John 3:17 For God sent not his Son into the world to condemn the world; but that the world through him might be saved.

20 Latent Semantic Indexing (Deerwester…) System compared to: Straight term matching Voorhees SMART Using: 1. collection of medical abstracts (MED) 2. information science abstracts (CISI)

21 Latent Semantic Indexing Summary of analyses LSI performed better than or equal to simple term matching LSI was shown to be superior to system described by Voorhees LSI performed better than or equal to SMART

22 Latent Semantic Indexing Conclusion zLSI represents both terms and documents in the same space which provides for the retrieval of relevant information. zLSI does not rely on literal matching thus retrieves more relevant information than other methods. zLSI offers an adequate solution to the problem of synonymy but only a partial solution to the problem of polysemy.


Download ppt "Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy."

Similar presentations


Ads by Google