The Relevance Model A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the term weights Top n terms Relevance model Indri query: #weight(w 1 r 1 w 2 r 2.. w n r n ), where, w i = P(r i | I) Interpolation with original query #weight( w Original_Query (1-w) Relevant_Model_Query ) Extending Relevance Model for Relevance Feedback Le Zhao Chenmin Liang Jamie Callan Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Introduction TREC 2008 Relevance Feedback track defines a testbed for evaluating relevance feedback algorithms. It includes different levels of feedback, from only 1 relevant feedback document to over 100 judgments with at least 3 relevant documents per topic. Data Set Documents: GOV2 collection Topics: 50 topics from previous Terabyte tracks 150 topics from Million Query tracks Feedback: Top documents ranked by systems from the previous tracks Judgments also from previous tracks Conclusions & Future Work The extended relevance model works well. (Otherwise would vary based on the number of relevant documents.) One randomly-sampled relevant document is more informative than a top-ranked relevant document. Merging relevance feedback and PRF is significantly better than relevance feedback. Top ranked negative feedback documents probably carry more information for the system than top ranked relevant feedback documents. Future work. Experiments Baseline Dependency model queries, for increased top precision Pseudo relevance feedback (relevance model) for better recall Best runs in 2005, 2006 Terabyte tracks Extended Relevance Model Stability of optimal tuning on a per topic basis gives only 3-4% improvement on feedback set C or D suggest tuning the interpolation of the extended relevance model with the original query Optimal around , significantly better than relevance feedback alone, when only one (the top) relevant document is used for feedback. p<0.004 by paired sign-test No significant difference between merged model w/ top rel doc fdbk and PRF Performance change as amount of feedback information increases Goal The design of feedback algorithms is most challenging when the amount of feedback information is minimal. Thus, we aim at designing a robust relevance feedback algorithm that can utilize even a small number of feedback documents to achieve robust performance. Top Docs User Feedback Term Weighting Relevance Model Initial Query Feedback Retrieval Figure 1. Flowchart of our relevance feedback model The Extended Relevance Model Problem Setup: weight feedback terms according to relevant feedback docs and pseudo relevant docs – instead of building two queries and combining; use single tuning parameter to control how much more important true relevant documents should be than the pseudo ones; Goal: separate out factors that affect term weights from the two sources: #fdbk docs, #rel docs, P(I) etc., so that stable across topics. Key problem: modeling P(I); can no longer be dropped w/o cost! Empirical judged relevance: Extended Relevance Model (decomposed) Uniform empirical document distribution: 1/|Pseudo| Empirical distributions normalize out factors like #fdbk documents, and #relevant docs, thus, correct the bias toward the majority source. Modeling P(I) Generated from Collection model: P(I | C) ~ (approximated with) P(Q | C) Considering documents in the collection: max D in C P(I | D) ~ max D in C P(Q | D) Intuition: relevant document is as good as the best document in C avg D in TopN P(I | D) ~ avg D in TopN P(Q | D) Intuition: relevant document is as good as the average of TopN in C Goal is to make stable, across topics with different P(I | D) values. training topics from previous Terabyte (TB) and MQ tracks different from test – TB only feedback documents randomly sampled from judgments different from test – top ranked by previous TREC runs almost flat curve PRF is gaining a lot need lower ranked relevant documents for effective feedback?