Active Feedback: UIUC TREC 2003 HARD Track Experiments Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Goal of Participation Our general goal is to test and extend language modeling approaches for a variety of different tasks Language Modeling Retrieval Methods HARD Active feedback RobustGenomicsWeb Robust feedback Semi-structured query model Relevance propagation model this talk notebook papers
Outline Active Feedback Three Methods HARD Track Experiment Design Results Conclusions & Future Work
What is Active Feedback? An IR system actively selects documents for obtaining relevance judgments If a user is willing to judge k documents, which k documents should we present in order to maximize learning effectiveness? Aim at minimizing a users effort…
Normal Relevance Feedback Feedback Judgments: d 1 + d 2 - … d k - Query Retrieval Engine Top K Results d d … d k 0.5 User Document collection
Active Feedback Feedback Judgments: d 1 + d 2 - … d k - Query Retrieval Engine Which k docs to present ? User Document collection Can we do better than just presenting top-K? (Consider redundancy…)
Active Feedback Methods Top-K (normal feedback) … Gapped Top-K K-cluster centroid Aiming at high diversity …
Evaluating Active Feedback in HARD Track Query Select 6 passages Clarification form User + Completed form Initial Results No feedback (Top-k, gapped, clustering) Feedback Results (doc-based, passage-based)
Retrieval Methods (Lemur toolkit) Query Q Document D Results Kullback-Leibler Divergence Scoring Feedback Docs F={d 1, …, d n } Active Feedback Default parameter settings Mixture Model Feedback Only learn from relevant docs
Results Top-k is always worse than gapped top-k and the clustering method Clustering generates fewer, but higher quality examples Passage-based query model updating performs better than document-based updating
Comparison of Three Active Feedback Methods CollectionActive FB Method #Rel Include judged docsExclude judged docs TREC2003 (Official) Top-K Gapped Clustering *0.514*0.326*0.503* AP88-89 Top-K Gapped *0.342* Clustering *0.328* Top-K is the worst! Clustering uses fewest relevant docs bold font = worst * = best
Appropriate Evaluation of Active Feedback New DB Original DB with judged docs Original DB without judged docs Cant tell if the ranking of un-judged documents is improved Different methods have different test documents See the learning effect more explicitly But the docs must be similar to original docs
Comparison of Different Test Data (Learning on AP88-89) Test DataActive FB Method AP88-89 Including judged docs Top-K Gapped *0.342* Clustering AP88-89 Excluding judged docs Top-K Gapped Clustering *0.328* AP90Top-K Gapped * Clustering *0.282 Top-K is consistently the worst! Clustering generates fewer, but higher quality examples
Effectiveness of Query Model Updating: Doc-based vs. Passage-based JudgmentsUpdating Method NoneBaseline (no updating) GappedDoc-based Passage-based Improvement+ 5.7%+2.7% ClusteringDoc-based Passage-based Improvement+5.4%+4.0% Mixture model query updating methods are effective Passage-based is consistently better than doc-based
Conclusions Introduced the active feedback problem Proposed and tested three methods for active feedback (top-k, gapped top-k, clustering) Studied the issue of evaluating active feedback methods Results show that –Presenting the top-k is not the best strategy –Clustering can generate fewer, higher quality feedback examples
Future Work Explore other methods for active feedback (e.g,. negative feedback, MMR method) Develop a general framework that –Combines all the utility factors (e.g., being informative and best for learning) –Can model different questions (e.g., model both term selection and relevance judgments) Further study how to evaluate active feedback methods