Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster based Query Expansion 1. Local analysis: derive information from retrieved document set 2. Global analysis: derive information from corpus

Local Analysis “Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents.” MIR In relevance feedback, clusters are built from interaction with user about documents. Local analysis automatically exploits the documents retrieved by identifying terms related to those in the query.

Term Clusters Association Clusters: model co-occurrence of stems in retrieved documents, expand using co-occurring terms unnormalized groups by large frequencies normalized groups by rarity Metric Clusters: factor in intra-document distance Problem: Expensive to compute on the fly

Global Analysis All documents are analyzed for term relationships. Two Approaches: Similarity thesaurus: relates whole query to new terms. Focus is on concept underlying terms: each term is indexed by the documents in which it appears. Statistical thesaurus: cluster documents into class hierarchy

Similarity Thesaurus Basis where inverse term frequency (itf) for doc d j is: N is the number of documents, t is number of distinct terms in collection and t j is number of distinct terms in document j

Similarity Thesaurus Creation Thesaurus is a matrix of correlation factors between indexing terms:

Relationship between terms and Query from Qiu & Frei, “Concept Based Query Expansion”, SIGIR-93

Query Expansion w/Similarity Thesaurus  Represent the query in the concept space of the index terms (weight vector) 2 Based on the global similarity thesaurus, compute a similarity sim(q,k v ): 2 Expand the query with the top r ranked terms and weight with:

Global 2: Statistical Thesaurus Thesaurus construction relies on high discrimination/low frequency terms. Hard to cluster… So, build classes based on clustering similar docs instead. Similarity is minimum of cosine vector model similarity between any two docs (one from each cluster).

Complete Link Algorithm [Crouch & Yang] 1. Place each document in a distinct cluster. 2. Compute the similarity between all pairs of clusters. 3. Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity. 4. Merge the clusters Cu and Cv 5. Verify a stop criterion. If this criterion is not met then go back to step 2. 6. Return a hierarchy of clusters.

Hierarchy Example Doc1=D,D,A,B,C,A,B,C Doc2=E,C,E,A,A,D Doc3=D,C,B,B,D,A,B,C,A Doc4=A from MIR notes

Query Expansion w/Statistical Thesaurus Select the terms for each class: Threshold on similarity determines which clusters NDC determines max number of docs in cluster MIDF determines minimum IDF for any term (i.e., how rare) Compute thesaurus class weight for terms

Global Analysis Summary Thesaurus approach has been effective for improving queries… However requires expensive processing (static corpus required) statistical generation exploits small frequencies better but is sensitive to parameter settings.

Relevance Feedback/Query Reformulation Summary Relevance feedback and query expansion approaches have been shown to be effective at improving relevance, sometimes at expense of precision. Users resist relevance feedback, takes time and understanding. Query reformulation can be costly (expensive computation) for search engines/IR systems.

Search Engine Use of Query Feedback Relevance feedback explicit tried, but mostly abandoned. indirect: Teoma (ranks documents higher that users look at more often) Similar/Related Pages or searches: suggest expanded queries or ask to search for related pages (Altavista and MSN Search used to do this) Google- Find Similar Teoma Web log data mining

Behavior-Based Ranking AskJeeves used user behavior to change results ranking: For each query Q, record which URLs are followed Use click through counts to order URLs for subsequent submissions of Q Pseudo-relevance feedback

Teoma: Indirect Relevance Combines indirect relevancy judgments with their own link analysis “Subject-Specific Popularity ranks a site based on the number of same-subject specific pages that reference it.” [Teoma.com page] Clustering Usage: Refine: Models communities to suggest search classification Resources: Suggests authoritative sites within designated community

Web Log Mining SOP for large search engines to monitor what people are querying Goals: learn associations between common terms based on large number of queries Identify trends in user behavior that should be addressed by system

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Similar presentations

Presentation on theme: "Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Similar presentations

Presentation on theme: "Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster."— Presentation transcript:

Similar presentations

About project

Feedback