INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Querying Structured Text in an XML Database By Xuemei Luo.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
An Efficient Algorithm for Incremental Update of Concept space
IST 516 Fall 2011 Dongwon Lee, Ph.D.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Search
Information Retrieval and Web Design
Latent Semantic Analysis
Presentation transcript:

INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University

Overview Introduction Overview of IR system and basic terminologies Classical models of Information Retrieval Boolean Model Vector Model Modern models of Information Retrieval Latent Semantic Indexing Correlation Method Conclusion

Introduction People need information to solve problem Very simple thing Complex thing “Perfect search machine” defined by Larry Page is something that understand exactly what you mean and return exactly what you want.

Challenges to IR Introduction of www www is large Heterogeneous Gives challenges to IR

Overview of IR and basic terminologies IR can be divided in to 3 components Input Processor Output Processor Query Documents Output Input

The main task: Obtain a good representation of each document and query for computer to use A document representative: a list of extracted keywords Idea: Let the computer process the natural language in the document Input

Input (cont) Obstacles Theory of human language has not been sufficiently developed for every language Not clear how to use it to enhance information retrieval

Input (cont) Representing documents by keywords Step1:Removing common words Step2: Stemming

Step1: Removing common words Very high frequency words very common words They should removed comparing with stop-list So Non-significant words not interfere IR process Reduce the size document between 30->50 per cent (C.J. van Rijsbergen) will of the be are By

Step2:Stemming Def: Stemming : The process of chopping the ending of a term, e.g. removing “ed”, ”ing” Algorithm Porter

Processor This part of the IR system are concerned with the retrieval process Structuring the documents in an appropriate way Performing actual retrieval function, using a predefined model Input Processor

Output The output will be a set of documents, assumed to be relevant to users. Purpose of IR: Retrieved all relevant documents Retrieved irrelevant documents as few as possible Input Processor output

Definition Relevant docs retrieved relevant docs Relevant docs retrieved All docs retrieved

Returns relevant documents but misses many useful ones too Returns most relevant documents but includes lots of junk The ideal 0 Precision Recall 1 1

Information Retrieval Models Classical models Boolean model Vector model Novel models Latent semantic indexing model Correlation method

Boolean model Earliest and simplest method, widely used in IR systems today Based on set theories and Boolean algebra. Queries are Boolean expressions of keywords, connected by operators AND, OR, ANDNOT Ex: (Saint Petersburg AND Russia) | (beautiful city AND Russia)

Inverted files are widely used Ex: Term1 : doc 2, doc 5, doc6 ; Term2 : doc 2, doc4, doc5; Query : q = (term1 AND term2) Result: doc2, doc5 Term-document matrix can be used

Thinking about Boolean model Advantages: Very simple model based on sets theory Easy to understand and implement Supports exact query Disadvantages: Retrieval based on binary decision criteria, without notion of partial matching Sets are easy, but complex Boolean expressions are not The Boolean queries formulated by users are most often so simplistic Retrieves so many or so few documents Gives unranked results

Vector Model Why Vector Model ? Boolean model just takes into account the existence or nonexistence of terms in a document Has no sense about their different contributions to documents

Overview theory of vector model Documents and queries are displayed as vectors in index-term space Space dimension is equal to the vocabulary size Components of these vectors: the weights of the corresponding index term, which reflects its significant in terms of representative and discrimination power Retrieval is based on whether the “query vector” and “document vector” are closed enough.

Set of document: A finite set of terms : Every document can be displayed as vector: the same to the query:

Similarity of query q and document d: Given a threshold, all documents with similarity > threshold are retrieved i j dj q 

Compute a good weight A variety of weighting schemes are available They are based on three proven principles: Terms that occur in only a few documents are more valuable than ones that appear in many The more often a term occur in a document, the more likely it is to be important to that document A term that appears the same number of times in a short document and in a long one is likely to be more available for the former

tf-idf-dl (tf-idf) scheme The term frequency of a term t i in document d j : The length of document d j : DL j = total number of terms occurrences in document d j Inverted document frequency: collection of N documents, inverted document frequency of a term t i that appears in n document is : Weight:

Think about Vector model Advantages: Term weighting improves the quality of the answer Partial matching allows to retrieve the documents that approximate the query conditions Cosine ranking formula sorts the answer Disadvantages Assumes the independences of terms Polysemy and synonymy problem are unsolved

Modern models of IR Why ? Problems with polysemy: Bass (fish or music ?) Problems with synonymy: Car or automobile ? These failures can be traced to : The way index terms are identified is incomplete Lack of efficient methods to deal with polysemy Idea to solve this problem: take the advance of implicit higher order structure(latent) in the association terms with documents.

Latent Semantic Indexing (LSI) LSI overview Representing documents roughly by terms is unreliability, ambiguity and redundancy Should find a method, which can : Documents and terms are displayed as vectors in a k-dim concepts space. Its weights indicating the strength of association with each of these concepts That method should be flexible enough to remove the weak concepts, considered as noises

Document-term matrix A[mxn] are built Matrix A is factored in to 3 matrices, using Singular value decomposition SVD U,V are orthogonal matrices d 1 d 2 …. d m t 1 w 11 w 12 … w 1m t 2 w 21 w 22 … w 2m : : : : t n w n1 w n2 … w nm

These special matrices show a break down of the original relationship (doc-term) to a linearly independent components (factors) Many of these components are very small: ( considered as noises) and should be ignored

Criteria to choose k: Ignore noises Important information are not lost Documents and terms are displayed as vectors in k- dim space Theory Eckart & Young ensures us about not losing important information

Query Should find a method to display a query to k-dim space Query q can be seen as a document. From equation: We have

Similarity between objects Term-Term: Dot product between two rows vector of matrix A k reflects the similarity between two terms Term-term similarity matrix : Can consider the rows of matrix as coordinate of terms. The relation between taking rows of as coordinate and rows of as coordinates is simple

Document-document Dot product between two columns vectors of matrix A k reflect the similarity between two documents. Can consider the row of matrix as coordinates of documents. Term-document This value can be obtained by looking at the element of matrix A k Drawback: between and within comparisons can not be done simultaneously without resizing the coordinate.

Example q1: human machine interface for Lab ABC computer applications q2: a survey of user opinion of computer system response time. q3: the EPS user interface management system q4: System and human system engineering testing of EPS q5:Relation of user-perceived response time to error measurement q6: The generation of random, binary, unordered tree q7: The intersection graph of paths in trees q8: Graph minors IV: Widths of trees and well-quasi-ordering q9: Graph minors: A survey

Numerical results

Query: human computer interaction

Updating Folding in New document New terms and docs has no effect on the presentation of pre-existing docs and terms Re-computing SVD Re-compute SVD Requires times and memory Choosing one of these two methods

Think about LSI Advantages: Synonymy problem is solved Displaying documents in a more reliable space: Concepts space Disadvantages: Polysemy problem is still unsolved A special algorithms for handling with large size matrices should be implemented

Correlation Method Idea: If a keyword is present in the document, correlated keywords should be taken into account as well. So, the concepts containing in the document aren’t obscured by the choices of a specific vocabulary. In vector space model: similarity vector : Depend on the user query q, we now build the best query, taking the correlated keyword into account as well.

The correlation matrix is built based on the term-document matrix Let: A is the term-document matrix D : number of document, is the mean vector. Clearly that : Covariance matrix is computed: Correlation Matrix S :

Better query: We now use SVD to reduce noises in the correlation of keywords: We choose the first k largest factors to obtain the k dimensional of S Generate our best query: Vector of similarity : Define a projection, defined by :

Strength of correlation method In real world, correlation between words are static. Number of terms has a higher stability level when comparing with number of documents Number of documents are many times larger than number of keywords. This method is able to handle database with a very large number of documents and doesn’t have to update the correlation matrix every time adding new documents. Its importance in the electronic networks.

Conclusion IR overview Classical IR models : Boolean model Vector model Modern IR models : LSI Correlation methods

Any questions ?

References Gheorghe Muresan: Using document Clustering and language modeling in mediated IR. Georges Dupret: Latent Concepts and the Number Orthogonal Factors in Latent Semantic Indexing C.J. van RIJSBERGEN: Information Retrieval Ricardo Baeza-Yates: Modern Information Retrieval Sandor Dominich: Mathematical foundation of information retrieval. IR lectures note from : Scott Deerwester, Susan T.Dumais: Indexing by Latent Semantic Analysis

Desktop Search Design the desktop search that satisfies Not affects computer’s performance Privacy is protected Ability to search as many types of files as possible( consider music file) Multi languages search User-friendly Support semantic search if possible