Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
Concepts and Categories. Functions of Concepts By dividing the world into classes of things to decrease the amount of information we need to learn, perceive,
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
NLP Workshop Eytan Ruppin Ben Sandbank
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Latent Semantic Analysis:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Vector Semantics Dense Vectors.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
May 26, 2005: Empiricism versus Rationalism in Language Learning
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil

Outline The Problem Some History LSA A Small Example Efficiency Other applications Summary

The Problem Given a collection of documents: retrieve documents that are relevant to a given query Match terms in documents to terms in query

The Problem The vector space method – term (rows) by document (columns) matrix, based on occurrence – translate into vectors in a vector space one vector for each document – cosine to measure distance between vectors (documents) small angle = large cosine = similar large angle = small cosine = dissimilar

The Problem Two problems that arose using the vector space model: – synonymy: many ways to refer to the same object, e.g. car and automobile leads to poor recall – polysemy: most words have more than one distinct meaning, e.g. Jaguar leads to poor precision

The Goal Latent Semantic Indexing was proposed to address these two problems with the vector space model

Some History Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in The first papers about LSI: – Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, – Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6),

LSA: The idea Idea (Deerwester et al): – “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher- order (or latent) structure in the association of terms and documents to reveal such relationships.” The assumption is that co-occurrence says something about semantics: words about the same things are likely to occur in the same contexts If we have many words and contexts, small differences in co-occurrence probabilities can be compiled together to give information about semantics.

LSA: Overview Build a matrix with rows representing words and columns representing context (a document or word string) Apply SVD – unique mathematical decomposition of a matrix into the product of three matrices: two with orthonormal columns-- (orthonormal)? one with singular values on the diagonal – tool for dimension reduction – similarity measure based on co-occurrence – finds optimal projection into low-dimensional space

LSA Methods Start with a Term-by-Document matrix Optionally weight cells Apply Singular Value Decomposition: – t = # of terms – d = # of documents – n = min(t, d) Approximate using k (semantic) dimensions:

LSA: SVD – can be viewed as a method for rotating the axes in n-dimensional space, so that the first axis runs along the direction of the largest variation among the documents the second dimension runs along the direction with the second largest variation and so on – generalized least-squares method

LSA Rank-reduced Singular Value Decomposition (SVD) performed on matrix – all but the k highest singular values are set to 0 – produces k-dimensional approximation of the original matrix – this is the “semantic space” Compute similarities between entities in semantic space (usually with cosine)

A Small Example Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey

A Small Example – 2 r (human.user) = -.38r (human.minors) = -.29

A Small Example – 3 T =

A Small Example – 4 S =

A Small Example – 5 D =

A Small Example – 7

r (human.user) =.94r (human.minors) = -.83

A Small Example – 2 again r (human.user) = -.38r (human.minors) = -.29

Correlation: Raw data

Some Issues with LSI SVD Algorithm complexity O(n^2k^3) n = number of terms + documents k = number of dimensions in semantic space (typically small ~50 to 350) Although lot of empirical evidence no concrete proof of why LSI works

Semantic Dimension Finding optimal dimension for semantic space precision-recall improve as dimension is increased until hits optimal, then slowly decreases until it hits standard vector model in many tasks works well, still room for research choosing k is difficult overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) k performance

Other Applications Has proved to be a valuable tool in many areas as well as IR – summarization – cross-language IR – topics segmentation – text classification – question answering – LSA can pass the TOEFL

LSA can Pass the TOEFL Task: – Multiple-choice test for synonym – Given one word, find best match out of 4 alternatives Training: – Corpus of 30,473 articles from Grolier’s Academic – Used first ~150 words from each article => 60,768 unique – words that occur at least twice – 300 singular vectors Result – LSI gets 52.5% correct (corrected for guessing) – Non-LSI similarity gets 15.8% (other paper 29.5%) correct – Average (foreign) human test taker gets 52.7% Landauer, T. K. and Dumais, S. T. (1997) A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2)

LSA can mark essays LSA judgments of the quality of sentences correlate at r = 0.81 with expert ratings LSA can judge how good an essay (on a well-defined set topic) is by computing the average distance between the essay to be marked and a set of model essays – The correlation are equal to between-human correlations “If you wrote a good essay and scrambled the words you would get a good grade," Landauer said. "But try to get the good words without writing a good essay!”

Good References The group at the University of Colorado at Boulder has a web site where you can try out LSA and download papers – Papers are also available at: –