Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about.

Similar presentations


Presentation on theme: "1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about."— Presentation transcript:

1 1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about Computer Science in Dublin? (look familiar??!?!) COMP-4016 ~ Computer Science Department ~ University College Dublin ~ www.cs.ucd.ie/staff/nick ~ © Nicholas Kushmerick 2001

2 2 LSI -vs- PageRank, Hub/Auth How to solve the familiar problems of term-based IR (synonomy, polysemy)? PageRank, Hubs/Authorities: mine valuable evidence about page quality/relevance from relationships across documents (namely, hyperlinks). Documents’ terms play almost a secondary role in retrieval! LSI: Don’t throw baby out with bathwater: Terms are incredibly useful/important! The key is to employ statistical analysis to tease apart multiple meanings of a given word (polysemy) and multiple words for a given meaning (synonomy). (Of course, LSI predates the Web [and therefore topology-based techniques] by a decade!)

3 3 Simple example #1 Ack: Dumais 1997

4 4 Simple example #1 - query for “portable” cosine>0.9 Only Doc 3 retrieved Doc 2 retrieved too!

5 5 Example #1 continued - query for “laptop” Automatic - no thesaurus!

6 6 Stochastic model of language generation “financial institution” “part of river” My favorite bank is AIB, located near the south bank of the Liffey on Dame Street. On the other hand, the north quay is home to numerous bureaux de change. ‘concept’  probability distribution over words used to express the concept Synonymy:A given word can have >0 probability for several concepts Polysemy:Several words can have >0 probability for a given concept Goal: offline: use statistics over many documents to estimate distributions online: use distributions to estimate most-likely concept ‘explaining’ the words observed in some particular document

7 7 Example 2 Presence of “information” and “computer” in Doc#2 are “typos” the author “should have” said “bits” and “device” (resp.) instead (Polysemy) Absence of “information” and “computer” in Doc#1 are “mistakes” the author “meant to” include them but “forgot” and used “document” and “access” instead (Synonomy) the “right answer”

8 8 The Singular Value Decomposition A fact from linear algebra : Any matrix X can be decomposed as X = T 0 · S 0 · D 0 XT0T0 S0S0 D0D0 t x d t x r r x r r x d ·· = terms documents r = the rank of X = number of linearly independent columns/rows ie, number of non-duplicate (up to constant multiple) rows/columns 0 0

9 9 SVD, continued S 0 has a very special structure: diagonal elements are sorted, and non-diagonal elements are zero Also… –T 0 and D 0 T must satisfy some additional properties (“orthogonal unit-length columns”) –Refer to D 0 rather than D 0 to simplify some of the theoretical descriptions of the SVD Algorithm: computing the SVD just means solving a big set of simultaneous equations; it’s slow, but but there’s no magic or wizardry needed S0S0 0 0 interesting evidence of latent structure noise, coincidences, anomolies, …

10 10 The Idea Perform SVD on term-document matrix X, with one extra pseudo-document representing the query The diagonal values in S 0 encode the “weight” of the various “higher-order” semantic concepts that give rise to observed terms X Retain only the top K  50 high-weight values; these are the “dominant” concepts that were used to stochastically generate the observed terms Plot documents & query in this lower-dimensional space, and used good-old-fashioned cosine similarity to retrieve relevant documents Discard the low-weight “noise” values; these represent an attempt to “make sense” of the noise/typos/mistakes in the observed terms

11 11 The Idea, take 2 XT S D t x d t x k k x k k x d ··  terms documents q 0 0 T 0 ·S·D 0 = X  T·S·D however we do not “need/want” to reproduce the original term-document matrix exactly: it contains many noisy/mistaken observations

12 12 Example #3 -- 1

13 13 Example #3 - 2

14 14 Example #3 - 3 LSI Factor 1 LSI Factor 2 using K=2… “differential equations” “applications & algorithms” T Each term’s coordinates specified in first K values of its row. Each doc’s coordinates specified in first K values of its column. D

15 15 Some real results

16 16 “Exotic” uses of LSI - example: Cross-language retrieval In English, ship is a synonym for boat In Franglish, ship is a synonym bateau The idea:

17 17 Cross-language retrieval - Evaluation

18 18 Cross-language retrieval - Application

19 19 Summary We all know that term-based information retrieval has serious deficits (namely: synonymy & polysemy) Latent semantic indexing/analysis: Simple statistically rigorous technique for transforming original document/term matrix into a (more compact and reliable!) “concept space” Probabilistic model of document/query generation: synonymy and polysemy terms are a kind of noise, so the IR system’s job is to estimate the original “signal” (latent semantic “meaning”) Highly effective, and lots of other more “exotic” applications, too.


Download ppt "1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about."

Similar presentations


Ads by Google