Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號: 89522022.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Chapter 5: Query Operations Hassan Bashiri April
Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
Project Management: The project is due on Friday inweek13.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Relevance Feedback Main Idea:
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
SINGULAR VALUE DECOMPOSITION (SVD)
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
C.Watterscsci64031 Probabilistic Retrieval Model.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Representation of documents and queries
CS 430: Information Discovery
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Query Operations Berlin Chen
Presentation transcript:

Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:

Introduction It is difficult to formulate queries which are well designed for retrieval purposes. Improving the initial query formulation through query expansion and term reweighting. Approaches based on: –feedback information from the user –information derived from the set of documents initially retrieved (called the local set of documents) –global information derived from the document collection

User Relevance Feedback User is presented with a list of the retrieved documents and, after examining them, marks those which are relevant. Two basic operation: –Query expansion : addition of new terms from relevant document –Term reweighting : modification of term weights based on the user relevance judgement

User Relevance Feedback The usage of user relevance feedback to: –expand queries with the vector model –reweight query terms with the probabilistic model –reweight query terms with a variant of the probabilistic model

Vector Model Define: –Weight: Let the k i be a generic index term in the set K = {k 1, …, k t }. A weight w i,j > 0 is associated with each index term k i of a document d j. –document index term vector: the document d j is associated with an index term vector d j representd by

Vector Model (cont’d) Define –from the chapter 2 the term weighting : the normalized frequency : freq i,j be the raw frequency of k i in the document d j nverse document frequency for k i : the query term weight:

Vector Model (cont’d) Define: –query vector: query vector q is defined as –D r : set of relevant documents identified by the: user –D n : set of non-relevant documents among the retrieved documents –C r : set of relevant documents among all documents in the collection –α,β,γ: tuning constants

Query Expansion and Term Reweighting for the Vector Model ideal case C r : the complete set C r of relevant documents to a given query q –the best query vector is presented by The relevant documents C r are not known a priori, should be looking for.

Query Expansion and Term Reweighting for the Vector Model (cont’d) 3 classic similar way to calculate the modified query –Standard_Rochio: –Ide_Regular: –Ide_Dec_Hi: the D r and D n are the document sets which the user judged

simialrity: the correlation between the vectors dj and this correlation can be quantified as: The probabilistic model according to the probabilistic ranking principle. – p(ki|R) : the probability of observing the term k i in the set R of relevant document – p(ki|R) : the probability of observing the term k i in the set R of non-relevant document Term Reweighting for the Probabilistic Model (5.2) djdj Q

The similarity of a document d j to a query q can be expressed as for the initial search –estimated above equation by following assumptions n i is the number of documents which contain the index term k i get Term Reweighting for the Probabilistic Model

Term Reweighting for the Probabilistic Model (cont’d) for the feedback search –The P(k i |R) and P(k i |R) can be approximated as: the D r is the set of relevant documents according to the user judgement the D r,i is the subset of D r composed of the documents contain the term k i –The similarity of d j to q: There is no query expansion occurs in the procedure.

Term Reweighting for the Probabilistic Model (cont’d) Adjusment factor –Because of |D r | and |D r,i | are certain small, take a 0.5 adjustment factor added to the P(k i |R) and P(k i |R) –alternative adjustment factor n i /N

A Variant of Probabilistic Term Reweighting 1983, Croft extended above weighting scheme by suggesting distinct initial search methods and by adapting the probabilistic formula to include within-document frequency weights. The variant of probabilistic term reweighting: the F i,j,q is a factor which depends on the triple [k i,d j,q].

using disinct formulations for the initial search and feedback searches –initial search: the fi,j is a normalized within-document frequency C and K should be adjusted according to the collection. feedback searches: empty text A Variant of Probabilistic Term Reweighting (cont’d)

Automatic Local Analysis Clustering : the grouping of documents which satisfy a set of common properties. Attempting to obtain a description for a larger cluster of relevant documents automatically : To identify terms which are related to the query terms such as: –Synonyms –Stemming –Variations –Terms with a distance of at most k words from a query term

Automatic Local Analysis (cont’d) The local strategy is that the documents retrieved for a given query q are examined at query time to determine terms for query expansion. Two basic types of local strategy: –Local clustering –Local context analysis Local strategies suit for environment of intranets, not for web documents.

Query Expansion Through Local Clustering Local feedback strategies are that expands the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the local document set.

Query Expansion Through Local Clustering (cont’d) Definition: –Stem: A V(s) be a non-empty subset of words which are grammatical variants of each other. A canonical form s of V(s) is called a stem. Example: If V(s) = { polish, polishing, polished} then s=polish –D l :the local document set, the set of documents retrieved for a given query q Strategies for building local clusters: –Association clusters –Metric clusters –Scalar clusters

Association clusters An association cluster is based on the co-occurrence of stems inside the documents Definition: – fsi,j : the frequency of a stem s i in a document d j, –Let m=(m ij ) be an association matrix with |S l | row and |D l | columns, where m ij =f si,j. –The matrix s=mm is a local stem-stem association matrix. –Each element s u,v in s expresses a correlation c u,v between the stems s u and s v :

Association Clusters (cont’d) The correlation factor c u,v qunatifies the absolute frequencies of co-occurrence –The association matrix s unnormalized –Normalized

Association Clusters (cont’d) Build local association clusters: –Consider the u-th row in the association matrix –Let S u (n) be a function which takes the u-th row and returns the set of n largest values s u,v, where v varies over the set of local stems and vnotequaltou –Then s u (n) defines a local association cluster around the stem s u.

Metric Clusters Two terms which occur in the same sentence seem more correlated than two terms which occur far apart in a document. It migh be worthwhile to factor in the distance between two terms in the computation of their correlation factor.

Metric Clusters Let the distance r(ki, kj) between two keywords k i and k j in a same document. If k i and k j are in distinct documents we take r(ki, kj)=  A local stem-stem metric correlation matrix s is defined as : Each element s u,v of s expresses a metric correlation c u,v between the setms s u, and s v

Metric Clusters Given a local metric matrix s, to build local metric clusters: –Consider the u-th row in the association matrix –Let S u (n) be a function which takes the u-th row and returns the set of n largest values s u,v, where v varies over the set of local stems and v –Then s u (n) defines a local association cluster around the stem s u.

Scalar Clusters Two stems with similar neighborhoods have some synonymity relationship. The way to quantify such neighborhood relationships is to arrange all correlation values s u,i in a vector su, to arrange all correlation values s v,i in another vector sv, and to compare these vectors through a scalar measure.

Scalar Clusters Let su=(su1, su2, …,sun ) and sv =(sv1, sv2, svn) be two vectors of correlation values for the stems s u and s v. Let s=(su,v ) be a scalar association matrix. Each s u,v can be defined as Let S u (n) be a function which returns the set of n largest values s u,v, v=u. Then S u (n) defines a scalar cluster around the stem s u.

Interactive Search Formulation Stems(or terms) that belong to clusters associated to the query stems(or terms) can be used to expand the original query. A stem s u which belongs to a cluster (of size n) associated to another stem s v ( i.e. ) is said to be a neighbor of s v.

Interactive Search Formulation (cont’d) figure of stem s u as a neighbor of the stem s v                 svsv susu S v (n)  

Interactive Search Formulation (cont’d) For each stem, select m neighbor stems from the cluster S v (n) (which might be of type association, metric, or scalar) and add them to the query. Hopefully, the additional neighbor stems will retrieve new relevant documents. 新增的鄰近字根會找出新的 relevant documents. S v (n) may composed of stems obtained using correlation factors normalized and unnormalized. –normalized cluster tends to group stems which are more rare. –unnormalized cluster tends to group stems due to their large frequencies.

Interactive Search Formulation (cont’d) Using information about correlated stems to improve the search. –Let two stems s u and s v be correlated with a correlation factor c u,v. –If c u,v is larger than a predefined threshold then a neighbor stem of s u can also be interpreted as a neighbor stem of s v and vice versa. –This provides greater flexibility, particularly with Boolean queries. –Consider the expression (s u + s v ) where the + symbol stands for disjunction. –Let s u ' be an neighbor stem of s u. –Then one can try both(s u '+s v ) and (s u +s u ) as synonym search expressions, because of the correlation given by c u,v.

Query Expansion Through Local Context Analysis The local context analysis procedure operates in three steps: –1. retrieve the top n ranked passages using the original query. This is accomplished by breaking up the doucments initially retrieved by the query in fixed length passages (for instance, of size 300 words) and ranking these passages as if they were documents. –2. for each concept c in the top ranked passages, the similarity sim(q, c) between the whole query q (not individual query terms) and the concept c is computed using a variant of tf-idf ranking.

Query Expansion Through Local Context Analysis –3. the top m ranked concepts(accroding to sim(q, c) ) are added to the original query q. To each added concept is assigned a weight given by × i/m where i is the position of the concept in the final concept ranking. The terms in the original query q might be stressed by assigning a weight equal to 2 to each of them.