Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Chapter 5: Query Operations Hassan Bashiri April
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
IR Models: Overview, Boolean, and Vector
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Relevance Feedback Main Idea:
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Plan for Today’s Lecture(s)
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
Query Operations Berlin Chen
Presentation transcript:

Query Operations: Automatic Local Analysis

Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient knowledge of the retrieval environment Query reformulation –two basic steps query expansion –Expanding the original query with new terms term reweighting –Reweighting the terms in the expanded query

Automatic Relevance Feedback Basic idea –clustering: known relevant documents contain terms which can be used to describe a larger cluster of relevant documents. –obtain a description for a larger cluster of relevant documents automatically. identifying terms which are related to the query terms synonyms, stemming variations, terms which are close to the query terms in the text, etc. –global analysis vs. local analysis

Global vs. Local Analysis Global analysis –all documents in the collection are used to determine a global thesaurus-like structure which defines term relationships –this structure can be shown to the user who selects clusters or terms for query expansion Local analysis –the documents retrieved for a given query q are examined at query time to determine terms for query expansion –local clustering and local context analysis: without assistance from the user

Local Clustering Operate solely on the documents retrieved for the current query Valuable because term distributions are not uniform across topic areas –Distinguishing terms are different for different topics –Global techniques cannot take these differences into account. Requires significant run-time computation –Not for Web search engines due to cost –Useful in intranet environments and for specialized document collections

Local Clustering Initially use stemming to group terms –For stem s = polish –V(s) = {polish, polishing, polished} Definitions –q: query –D l : local document set (retrieved documents) –V l : vocabulary of D l –S l : set of distinct stems for V l Three types of clusters –association clusters –metric clusters –scalar clusters

Association Clusters Idea: Terms which co-occur frequently inside documents likely relate the same concept. Simple computation based on the frequency of co-occurrence of terms inside documents – correlation between the stems

Association Clusters Definitions –Matrix m has |S l | rows and |D l | columns m ij = f si,j (frequency of stem s i in document d j ) –Correlation c u,v is computed as: –Matrix s = mm t Unnormalized –s u,v = c u,v Normalized –s u,v = c u,v / (c u,u + c v,v – c u,v )

Association Clusters Selecting Clusters –Normally want stems for each query term –Need clusters to be small in order to retain focus –Select fixed size of clusters, n To expand query q –Construct cluster for each query term q Identify s q, the stem for query term q For stem s q, select the top n values s q,v –Union of all query term clusters is expanded query

Metric Clusters Two terms which are near one another are more likely to be correlated than two terms which occur far apart –factor in the distance between two terms in the computation of their correlation factor Same as Association Clusters except for the computation of c u,v

Metric Clusters Same as Association Clusters except for the computation of c u,v Correlation between the stems s u and s v Where r (k i, k j ) = distance between keywords in the same document This is unnormalized. Can be normalized.

Scalar Clusters Idea: two stems with similar neighborhoods have some synonymity relationship The relationship is indirect or induced by the neighborhood. Quantifying such neighborhood relationships –Arrange all correlation values s u,i in a vector –Arrange all correlation values s v,i in another vector –Compare these vectors through a scalar measure –The cosine of the angle between the two vectors is a popular scalar similarity measure.

Clustering Approaches In practice: –Metric clusters outperform association clusters –Using a combination of normalized and unnormalized correlation factors can be beneficial Unnormalized factors tend to group stems due to large frequencies Normalized factors tend to group stems which are more rare

Clustering Approaches Local approaches use the frequencies and correlations of terms and stems within the set of documents retrieved –These frequencies and correlations may not be representative of the overall collection –How is this good? How is this bad?

Query Operations: Automatic Global Analysis

Motivation Methods of local analysis extract information from local set of documents retrieved to expand the query An alternative is to expand the query using information from the whole set of documents Until the beginning of the 1990s these techniques failed to yield consistent improvements in retrieval performance Now, with moderns variants, sometimes based on a thesaurus, this perception has changed

Automatic Global Analysis There are two modern variants based on a thesaurus-like structure built using all documents in collection –Query Expansion based on a Similarity Thesaurus –Query Expansion based on a Statistical Thesaurus

Similarity Thesaurus The similarity thesaurus is based on term-to-term relationships rather than on a matrix of co- occurrence. –These relationships are not derived directly from co- occurrence of terms inside documents. –They are obtained by considering that the terms are concepts in a concept space. –In this concept space, each term is indexed by the documents in which it appears. Terms assume the original role of documents while documents are interpreted as indexing elements

Similarity Thesaurus vs. Vector Model The frequency factor: –In vector model f (i,j) = freq ( term k i in doc d j ) / freq ( most common term in d j ) –In similarity thesaurus f (i,j) = freq ( term k i in doc d j ) / freq ( doc where term k i appears most) –Normalized based on document where the term appears most. The inverse frequency factor: –In vector model Idf (i) = log (# of docs in collection / # of docs with term k i ) –In similarity thesaurus Itf (j) = log (# of terms in collection / # of terms in doc d j ) –Calculates how good a discriminator is this document?

Similarity Thesaurus Definitions: –t: number of terms in the collection –N: number of documents in the collection –f i,j : frequency of occurrence of the term k i in the document d j –t j : vocabulary of document d j –itf j : inverse term frequency for document d j Inverse term frequency for document d j For each term k i where w i,j is a weight associated between the term and the documents.

Similarity Thesaurus The relationship between two terms k u and k v is computed as a correlation factor c u,v given by The global similarity thesaurus is built through the computation of correlation factor C u,v for each pair of indexing terms [k u,k v ] in the collection The computation is expensive but only has to be computed once and can be updated incrementally

Query Expansion based on a Similarity Thesaurus Query expansion is done in three steps as follows:  Represent the query in the concept space used for representation of the index terms 2 Based on the global similarity thesaurus, compute a similarity sim(q,k v ) between each term k v correlated to the query terms and the whole query q. 3 Expand the query with the top r ranked terms according to sim(q,k v )

Statistical Thesaurus Global thesaurus is composed of classes which group correlated terms in the context of the whole collection –Such correlated terms can then be used to expand the original user query –These terms must be low frequency terms –However, it is difficult to cluster low frequency terms –To circumvent this problem, we cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. –This algorithm must produce small and tight clusters.

Complete Link Algorithm Document clustering algorithm –Place each document in a distinct cluster. –Compute the similarity between all pairs of clusters. –Determine the pair of clusters [C u,C v ] with the highest inter-cluster similarity. –Merge the clusters C u and C v –Verify a stop criterion. If this criterion is not met then go back to step 2. –Return a hierarchy of clusters. Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents –Use of minimum ensures small, focussed clusters

Generating the Thesaurus Given the document cluster hierarchy for the whole collection –Which clusters become classes? –Which terms represent classes? Answers based on three parameters specified by operator based on characteristics of the collection –TC: Threshold class –NDC: Number of documents in class –MIDF: Minimum inverse document frequency

Selecting Thesaurus Classes TC is the minimum similarity between two subclusters for the parent to be considered a class. –A high value makes classes smaller and more focussed. NDC is an upper limit on the size of clusters. –A low value of NDC restricts the selection to smaller, more focussed clusters

Picking Terms for Each Class Consider the set of documents in each class selected above Only the lower frequency terms are used for the thesaurus classes The parameter MIDF defines the minimum inverse document frequency for a term to represent the thesaurus class

Initializing TC, NDC, and MIDF TC depends on the collection –Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC. –A high value of TC might yield classes with too few terms –A low value of TC yields too few classes NDC is easier to set once TC is set MIDF can be difficult to set

Query Expansion with Statisitcal Thesaurus Adding Terms: –Use the terms in same class at terms in query Weights of new terms can be based on both –the original query term weights (if any) and –on the degree to which a term represents the class of the query term

Conclusions Automatically generated thesaurus is a method to expand queries Thesaurus generation is expensive but it is executed only once Query expansion based on similarity thesaurus uses term frequencies to expand the query Query expansion based on statistical thesaurus uses document clustering and needs well defined parameters