IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Dimensionality Reduction PCA -- SVD
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Principal Component Analysis
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Vector Space Model CS 652 Information Extraction and Integration.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
The Boolean Model Simple model based on set theory
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Latent Semantic Indexing
CS 430: Information Discovery
Information Retrieval and Web Search
Recuperação de Informação B
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Latent Semantic Analysis
Presentation transcript:

IR Models: Latent Semantic Analysis

IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext Classic Models boolean vector probabilistic Retrieval: Adhoc Filtering Browsing

Vocabulary Problem The “vocabulary problem” causes classic IR to potentially experience poor retrieval: –Polysemy - same term means many things so unrelated documents might be included in the answer set Leads to poor precision –Synonymy - different terms mean the same thing so relevant documents that do not contain any index term are not retrieved. Leads to poor recall

Latent Semantic Indexing Retrieval based on index terms is vague and noisy The user information need is more related to concepts and ideas than to index terms A document that shares concepts with another document known to be relevant might be of interest

Latent Semantic Indexing The key idea –Map documents and queries into a lower dimensional space –Lower dimensional space represents higher level concepts which are fewer in number than the index terms Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

Latent Semantic Indexing Definitions –Let t be the total number of index terms –Let N be the number of documents –Let (M ij ) be a term-document matrix with t rows and N columns –Each element of this matrix is assigned a weight w ij associated with the pair [k i,d j ] –The weight w ij can be based on a tf-idf weighting scheme

Singular Value Decomposition The matrix (M ij ) can be decomposed into 3 matrices: –(M ij ) = (K) (S) (D) t –(K) is the matrix of eigenvectors derived from (M)(M) t –(D) t is the matrix of eigenvectors derived from (M) t (M) –(S) is an r x r diagonal matrix of singular values where r = min(t,N) that is, the rank of (M ij )

Latent Semantic Indexing In the matrix (S), select only the s largest singular values Keep the corresponding columns in (K) and (D) t The resultant matrix is called (M) s and is given by –(M) s = (K) s (S) s (D) t –where s, s < r, is the dimensionality of the concept space The parameter s should be –large enough to allow fitting the characteristics of the data –small enough to filter out the non-relevant details

LSI Ranking The user query can be modelled as a pseudo- document in the original (M) matrix Assume the query is modelled as the document numbered 0 in the (M) matrix The matrix (M) t (M) s quantifies the relantionship between any two documents in the reduced concept space The first row of this matrix provides the rank of all the documents with regard to the user query (represented as the document numbered 0)

Latent Semantic Analysis as Model of Human Language Learning Psycho-linguistic model: –Acts like children who acquire word meanings not through explicit definitions but by observing how they are used. –LSA is a pale reflection of how humans learn language, but it is a reflection. –LSA offers an explanation of how people can agree enough to share meaning.

LSA Applications In addition for typical query systems, LSA has been used for: –Cross-language search –Reviewer assignment at conferences –Finding experts in an organization –Identifying reading level of documents

Concept-based IR Beyond LSA LSA/LSI uses principle component analysis Principle components are not necessarily good for discrimination in classification. Linear Discriminant Analysis (LDA) identifies linear transformations –maximizing between-class variance while –minimizing within class variance LDA requires training data

Linear Discriminant Analysis Projecting a 2D space to 1 PC B A w. (from slides by Shaoqun Wu)

Linear Discriminant Analysis B A w. B A w. PCA LDA: discovers a discriminating projection

LDA results LDA reduces number of dimensions (concepts) required for classification tasks

Conclusions Latent semantic indexing provides an intermediate representation of concept to aid IR, minimizing the vocabulary problem. It generates a representation of the document collection which might be explored by the user. Alternative methods for identifying clusters (e.g. LDA) may improve results.