1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Text Databases Text Types
Modern Information Retrieval Chapter 1: Introduction
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
Intro to NLP - J. Eisner1 Words vs. Terms Taken from Jason Eisner’s NLP class slides:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Plan for Today’s Lecture(s)
LSI, SVD and Data Management
CS 430: Information Discovery
Information Retrieval and Web Search
Latent Semantic Analysis
Presentation transcript:

1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about Computer Science in Dublin? (look familiar??!?!) COMP-4016 ~ Computer Science Department ~ University College Dublin ~ ~ © Nicholas Kushmerick 2001

2 LSI -vs- PageRank, Hub/Auth How to solve the familiar problems of term-based IR (synonomy, polysemy)? PageRank, Hubs/Authorities: mine valuable evidence about page quality/relevance from relationships across documents (namely, hyperlinks). Documents’ terms play almost a secondary role in retrieval! LSI: Don’t throw baby out with bathwater: Terms are incredibly useful/important! The key is to employ statistical analysis to tease apart multiple meanings of a given word (polysemy) and multiple words for a given meaning (synonomy). (Of course, LSI predates the Web [and therefore topology-based techniques] by a decade!)

3 Simple example #1 Ack: Dumais 1997

4 Simple example #1 - query for “portable” cosine>0.9 Only Doc 3 retrieved Doc 2 retrieved too!

5 Example #1 continued - query for “laptop” Automatic - no thesaurus!

6 Stochastic model of language generation “financial institution” “part of river” My favorite bank is AIB, located near the south bank of the Liffey on Dame Street. On the other hand, the north quay is home to numerous bureaux de change. ‘concept’  probability distribution over words used to express the concept Synonymy:A given word can have >0 probability for several concepts Polysemy:Several words can have >0 probability for a given concept Goal: offline: use statistics over many documents to estimate distributions online: use distributions to estimate most-likely concept ‘explaining’ the words observed in some particular document

7 Example 2 Presence of “information” and “computer” in Doc#2 are “typos” the author “should have” said “bits” and “device” (resp.) instead (Polysemy) Absence of “information” and “computer” in Doc#1 are “mistakes” the author “meant to” include them but “forgot” and used “document” and “access” instead (Synonomy) the “right answer”

8 The Singular Value Decomposition A fact from linear algebra : Any matrix X can be decomposed as X = T 0 · S 0 · D 0 XT0T0 S0S0 D0D0 t x d t x r r x r r x d ·· = terms documents r = the rank of X = number of linearly independent columns/rows ie, number of non-duplicate (up to constant multiple) rows/columns 0 0

9 SVD, continued S 0 has a very special structure: diagonal elements are sorted, and non-diagonal elements are zero Also… –T 0 and D 0 T must satisfy some additional properties (“orthogonal unit-length columns”) –Refer to D 0 rather than D 0 to simplify some of the theoretical descriptions of the SVD Algorithm: computing the SVD just means solving a big set of simultaneous equations; it’s slow, but but there’s no magic or wizardry needed S0S0 0 0 interesting evidence of latent structure noise, coincidences, anomolies, …

10 The Idea Perform SVD on term-document matrix X, with one extra pseudo-document representing the query The diagonal values in S 0 encode the “weight” of the various “higher-order” semantic concepts that give rise to observed terms X Retain only the top K  50 high-weight values; these are the “dominant” concepts that were used to stochastically generate the observed terms Plot documents & query in this lower-dimensional space, and used good-old-fashioned cosine similarity to retrieve relevant documents Discard the low-weight “noise” values; these represent an attempt to “make sense” of the noise/typos/mistakes in the observed terms

11 The Idea, take 2 XT S D t x d t x k k x k k x d ··  terms documents q 0 0 T 0 ·S·D 0 = X  T·S·D however we do not “need/want” to reproduce the original term-document matrix exactly: it contains many noisy/mistaken observations

12 Example #3 -- 1

13 Example #3 - 2

14 Example #3 - 3 LSI Factor 1 LSI Factor 2 using K=2… “differential equations” “applications & algorithms” T Each term’s coordinates specified in first K values of its row. Each doc’s coordinates specified in first K values of its column. D

15 Some real results

16 “Exotic” uses of LSI - example: Cross-language retrieval In English, ship is a synonym for boat In Franglish, ship is a synonym bateau The idea:

17 Cross-language retrieval - Evaluation

18 Cross-language retrieval - Application

19 Summary We all know that term-based information retrieval has serious deficits (namely: synonymy & polysemy) Latent semantic indexing/analysis: Simple statistically rigorous technique for transforming original document/term matrix into a (more compact and reliable!) “concept space” Probabilistic model of document/query generation: synonymy and polysemy terms are a kind of noise, so the IR system’s job is to estimate the original “signal” (latent semantic “meaning”) Highly effective, and lots of other more “exotic” applications, too.