Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.

Slides:



Advertisements
Similar presentations
Chapter 5: Query Operations Hassan Bashiri April
Advertisements

Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
IR Models: Overview, Boolean, and Vector
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Modeling Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Relevance Feedback Main Idea:
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 12: Relevance Feedback & Query Expansion - II
Representation of documents and queries
Relevance Feedback & Query Expansion
Text Categorization Berlin Chen 2003 Reference:
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
CS 430: Information Discovery
Information Retrieval and Web Design
Query Operations Berlin Chen
Presentation transcript:

Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen5-2 Paraphrase Problem in IR Users often input queries containing terms that do not match the terms used to index the majority of the relevant documents. relevance feedback and query modification –reweighting of the query terms based on the distribution of these terms in the relevant and nonrelevant documents retrieved in response to those queries –changing the actual terms in the query

Hsin-Hsi Chen5-3 Query Reformulation basic steps –query expansion: expanding the original query with new terms feedback information from the user information derived from the set of documents initially retrieved (local set of documents) global information derived from document collection –term reweighting reweighting the terms in the expanded query

Hsin-Hsi Chen5-4 User Relevance Feedback U: Query is submitted S: A list of the retrieved documents is presented U: The documents are examined and the relevant ones are marked S: The important terms/expressions are selected from the documents that have been identified as relevant The relevance feedback cycle is repeated several times

Hsin-Hsi Chen5-5 User Relevance Feedback (Continued) advantages –Shield the details of the query reformulation –Break down the whole searching task into a sequence of small steps –Provide a controlled process designed to emphasize some terms (relevant ones) and de- emphasize others (non-relevant ones)

Hsin-Hsi Chen5-6 Query Expansion and Term Reweighting for the Vector Model basic idea –Relevant documents resemble each other –Non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents –The reformulated query is moved to closer to the term-weight vector space of relevant documents

Hsin-Hsi Chen5-7

Hsin-Hsi Chen5-8 Query Expansion and Term Reweighting for the Vector Model (Continued) C r : set of relevant documents set of non-relevant documents D r : set of relevant documents, as identified by the user D n : set of non-relevant documents the retrieved documents collection

Hsin-Hsi Chen5-9 Query Expansion and Term Reweighting for the Vector Model (Continued) when complete set C r of relevant documents is known when the set C r are not known a priori –Formulate an initial query –Incrementally change the initial query vector

Hsin-Hsi Chen5-10 Calculate the modified query –Standard-Rochio –Ide-Regular –Ide-Dec-Hi – , ,  : tuning constants (usually,  >  ) –  =1 (Rochio, 1971) –  =  =  =1 (Ide, 1971) –  =0: positive feedback the highest ranked non-relevant document query expansion term reweighting Similar performance

Hsin-Hsi Chen5-11 positive relevance-feedback  =  =1 and  =0

Hsin-Hsi Chen5-12 “dec hi” method: use all relevant information, but subtract only the highest ranked nonrelevant document feedback with query splitting solve problems: (1) the relevant documents identified do not form a tight cluster; (2) nonrelevant documents are scattered among certain relevant ones homogeneous relevant items homogeneous relevant items

Hsin-Hsi Chen5-13 Analysis advantages –simplicity –good results disadvantages –No optimality criterion is adopted

Hsin-Hsi Chen5-14 Term Weighting for the Probabilistic Model The similarity of a document d j to a query q : the probability of observing the term k i in the set R of relevant documents : the probability of observing the term k i in the set R of non-relevant documents Initial search:

Hsin-Hsi Chen5-15 Feedback search: Initial search:

Hsin-Hsi Chen5-16 Feedback search: No query expansion occurs

Hsin-Hsi Chen5-17 For small values of |D r | and |D r,i | (i.e., |D r |=1, |D r,i |=0) Alternative 1: Alternative 2:

Hsin-Hsi Chen5-18 Analysis advantages –Feedback process is directly related to the derivation of new weights for query terms –The term reweighting is optimal disadvantages –Document term weights are not considered –Weights of terms in previous query formulations are disregarded –No query expansion is used

Hsin-Hsi Chen5-19 A Variant of Probabilistic Term Reweighting variant –distinct initial search method –include within-document frequency weights initial search Similar to tf-idf scheme

Hsin-Hsi Chen5-20 C=0 for automatically indexed collections or for feedback searching (allow IDF or the relevance weighting to be the dominant factor) C>0 for manually indexed collections (allow the mere existence of a term within a document to carry more weight) K=0.3 for initial search of regular length documents (documents having many multiple occurrences of a term) K=0.5 for feedback searches K=1 for short documents: the within-document frequency is removed (the within-document frequency plays a minimum role) Feedback search

Hsin-Hsi Chen5-21 Analysis advantages –The within-document frequencies are considered –A normalized version of these frequencies is adopted –Constants C and K are introduced disadvantages –more complex formulation –no query expansion

Hsin-Hsi Chen5-22 Evaluation of relevance feedback Standard evaluation (i.e., recall-precision) method is not suitable, because the relevant documents used to reweight the query terms moving to higher ranks. The residual collection method –the evaluation of the results compares only the residual collections, i.e., the initial run is remade minus the documents previously shown to the user and this is compared with the feedback run minus the same documents Note that q m tend to be lower than the figures for the original query vector q in residual collection

Hsin-Hsi Chen5-23 Residual Collection with Partial Rank Freezing The previously retrieved items identified as relevant are kept “frozen”; and the previously retrieved nonrelevant items are simple removed from the collection. Assume 10 documents are relevant.

Hsin-Hsi Chen5-24 Residual Collection with Partial Rank Freezing

Hsin-Hsi Chen5-25 Automatic Local Analysis user relevance feedback –Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents with assistance from the user (clustering) automatic analysis –Obtain a description (i.t.o terms) for a larger cluster of relevant documents automatically –global strategy: global thesaurus-like structure is trained from all documents before querying –local strategy: terms from the documents retrieved for a given query are selected at query time

Hsin-Hsi Chen5-26 Local Feedback Strategy Internet –client site Retrieving the text of 100 Web documents for local analysis would take too long –server site Analyzing the text of 100 Web documents would spend extra CPU time Applications –Intranet –Specialized document collections, e.g., medical document collections

Hsin-Hsi Chen5-27 Query Expansion- Local Clustering stem –V(s): a non-empty subset of words which are grammatical variants of each other e.g., {polish, polishing, polished} –A canonical form s of V(s) is called a stem e.g., polish local document set D l –the set of documents retrieved for a given query local vocabulary V l (S l ) –the set of all distinct words (stems) in the local document set

Hsin-Hsi Chen5-28 local cluster basic concept –Expanding the query with terms correlated to the query terms –The correlated terms are presented in the local clusters built from the local document set local clusters –association clusters: co-occurrences of pairs of terms in documents –metric clusters: distance factor between two terms –scalar clusters: terms with similar neighborhoods have some synonymity relationship

Hsin-Hsi Chen5-29 Association Clusters idea –Based on the co-occurrence of stems (or terms) inside documents association matrix –f si,j : the frequency of a stem s i in a document d j (  D l ) –m=(f si,j ): an association matrix with |S l | rows and |D l | columns – : a local stem-stem association matrix

Hsin-Hsi Chen5-30 : a correlation between the stems s u and s v : normalized matrix s u,v =c u,v : unnormalized matrix local association cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v) an element in

Hsin-Hsi Chen5-31 Metric Clusters idea –Consider the distance between two terms in the computation of their correlation factor local stem-stem metric correlation matrix –r(k i,k j ): the number of words between keywords k i and k j in a same document –c u,v : metric correlation between stems s u and s v

Hsin-Hsi Chen5-32 : normalized matrix s u,v =c u,v : unnormalized matrix local metric cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v)

Hsin-Hsi Chen5-33 Scalar Clusters idea –Two stems with similar neighborhoods have synonymity relationship –The relationship is indirect or induced by the neighborhood scalar association matrix local scalar cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v) The row corresponding to a specific term in a term co-occurrence matrix forms its neighborhood The correlation value for s u and s v in this matrix may be small

Hsin-Hsi Chen5-34 Interactive Search Formulation neighbors of the query term s v –Terms s u belonging to clusters associated to s v, i.e., s u  S v (n) –s u is called a searchonym of s v x Su Sv Sv(n) x x x x x x x x x x x x x x x x x x x

Hsin-Hsi Chen5-35 Interactive Search Formulation (Continued) Algorithm –For each stem s v  q, select m neighbor stems from the cluster S v (n) and add them to the query –Merge normalized and unnormalized clusters Extension –Let s u and s v be correlated with a c u,v –If c u,v is larger than a predefined threshold, then a neighbor stem s u’ of s u can also be interpreted as a neighbor stem of s v, and vice versa. more rarelarge frequencies

Hsin-Hsi Chen5-36 Query Expansion through Local Context Analysis local analysis –Based on the set of documents retrieved for the original query –Based on term co-occurrence inside documents –Terms closest to individual query terms are selected global analysis –Based on the whole document collection –Based on term co-occurrence inside small contexts and phrase structures –Terms closest to the whole query are selected

Hsin-Hsi Chen5-37 Query Expansion through Local Context Analysis (Continued) candidates –noun groups instead of simple keywords –single noun, two adjacent nouns, or three adjacent nouns query expansion –Concepts are selected from the top ranked documents (as in local analysis) –Passages are used for determining co- occurrence (as in global analysis)

Hsin-Hsi Chen5-38 Query Expansion through Local Context Analysis (Continued) algorithm –Retrieve the top n ranked passages using the original query –For each concept in the top ranked passages, the similarity sim(q,c) between the whole query q and the concept c is computed using a variant of tf-idf ranking –The top m ranked concepts are added to the original query q Each concept is assigned a weight  i/m (i: rank) Each term in the original query is assigned a weight 2  original weight

Hsin-Hsi Chen5-39 n: # of ranked passages correlation between c and k i pf i,j (pf c,j ): freq of ki (c) in j-th passage N: # of passages in the collection np i : # of passages containing term k i np c : # of passages containing concept c 0.1 idf  1 ,當 np 很大 ( 小 ) 時,第二項可能小 ( 大 ) 於 1 association clusters (passage) for infrequent query term

Hsin-Hsi Chen5-40 Automatic Global Analysis local analysis –Extract information from the local set of documents (passages) retrieved global analysis –Expand the query using information from the whole set of documents in the collection –Issues How to build the thesaurus How to select the terms for query expansion –Query expansion based on similarity thesaurus –Query expansion based on statistical thesaurus

Hsin-Hsi Chen5-41 Similarity Thesaurus How to build the thesaurus –Consider term to term relationship instead of co-occurrence How to select the terms for query expansion –Consider the similarity to the whole query instead of individual query terms

Hsin-Hsi Chen5-42 Concept Space basic idea –Each term is indexed by the documents in which it appears –The role of terms and documents is interchanged in the concept space t: the number of terms in the collection N: the number of documents in the collection f i,j : the frequency of term k i in document d j t j : the number of distinct index terms in document d j itf j : inverse term frequency for document d j (dj 用來區辨 index term 的能力, dj 含有的 index terms 越多,區辨力 越低 )

Hsin-Hsi Chen5-43 Each term k i is associated with a vector k i where The relationship between two terms k u and k v is computed as

Hsin-Hsi Chen5-44 Query Expansion using Global Similarity Thesaurus Represent the query in the concept space used for representation of the index terms Based on the global similarity thesaurus, compute a similarity sim(q,k v ) between each term k v correlated to the query terms and the whole query q query term expand term

Hsin-Hsi Chen5-45 Query Expansion using Global Similarity Thesaurus Expand the query with the top r ranked terms according to sim(q,k v )

Hsin-Hsi Chen5-46 Ki Kv Kj Ka Qc Kb Q={Ka,Kb} Expand term

Hsin-Hsi Chen5-47 GVSM vs. Query Expansion Only the top r ranked terms are used for query expansion.