Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

The K-armed Dueling Bandits Problem
An Interactive Learning Approach to Optimizing Information Retrieval Systems Yahoo! August 24 th, 2010 Yisong Yue Cornell University.
ICML 2009 Yisong Yue Thorsten Joachims Cornell University
Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Super Awesome Presentation Dandre Allison Devin Adair.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Optimizing search engines using clickthrough data
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Evaluating Search Engine
Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Online Search Evaluation with Interleaving Filip Radlinski Microsoft.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings  Interleaving.
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Modern Retrieval Evaluations Hongning Wang
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Learning with Humans in the Loop Josef Broder, Olivier Chapelle, Geri Gay, Arpita Ghosh, Laura Granka, Thorsten Joachims, Bobby Kleinberg, Madhu Kurup,
Some Vignettes from Learning Theory Robert Kleinberg Cornell University Microsoft Faculty Summit, 2009.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Engines that Learn from Implicit Feedback Jiawen, Liu Speech Lab, CSIE National Taiwan Normal University Reference: Search Engines that Learn from.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Is Top-k Sufficient for Ranking? Yanyan Lan, Shuzi Niu, Jiafeng Guo, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media J. Bian, Y. Liu, E. Agichtein, and H. Zha ACM WWW, 2008.
Modern Retrieval Evaluations Hongning Wang
NTU & MSRA Ming-Feng Tsai
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Accurately Interpreting Clickthrough Data as Implicit Feedback
Modern Retrieval Evaluations
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Personalizing Search on Shared Devices
How does Clickthrough Data Reflect Retrieval Quality?
Click Chain Model in Web Search
Learning to Rank with Ties
Presentation transcript:

Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department of Information Science Cornell University

Decide between two Ranking Functions Distribution P(u,q) of users u, queries q Retrieval Function 1 f 1 (u,q)  r 1 Retrieval Function 2 f 2 (u,q)  r 2 Which one is better?  (tj,”SVM”)  1. Kernel Machines 2.SVM-Light Support Vector Machine 3.School of Veterinary Medicine at UPenn 4.An Introduction to Support Vector Machines 5.Service Master Company ⁞ 1.School of Veterinary Medicine at UPenn 2.Service Master Company 3.Support Vector Machine 4.Archives of SUPPORT-VECTOR-MACHINES 5.SVM-Light Support Vector Machine light/ ⁞ U(tj,”SVM”,r 1 ) U(tj,”SVM”,r 2 )

Measuring Utility NameDescriptionAggre- gation Hypothesized Change with Decreased Quality Abandonment Rate% of queries with no clickN/AIncrease Reformulation Rate% of queries that are followed by reformulation N/AIncrease Queries per SessionSession = no interruption of more than 30 minutes MeanIncrease Clicks per QueryNumber of clicksMeanDecrease of queries with clicks at position 1 N/ADecrease Max Reciprocal Rank*1/rank for highest clickMeanDecrease Mean Reciprocal Rank*Mean of 1/rank for all clicks MeanDecrease Time to First Click*Seconds before first clickMedianIncrease Time to Last Click*Seconds before final clickMedianDecrease (*) only queries with at least one click count

ArXiv.org: User Study User Study in ArXiv.org – Natural user and query population – User in natural context, not lab – Live and operational search engine – Ground truth by construction O RIG Â S WAP 2 Â S WAP 4 O RIG : Hand-tuned fielded S WAP 2: O RIG with 2 pairs swapped S WAP 4: O RIG with 4 pairs swapped O RIG Â F LAT Â R AND O RIG : Hand-tuned fielded F LAT : No field weights R AND : Top 10 of F LAT shuffled [Radlinski et al., 2008]

ArXiv.org: Experiment Setup Experiment Setup – Phase I: 36 days Users randomly receive ranking from Orig, Flat, Rand – Phase II: 30 days Users randomly receive ranking from Orig, Swap2, Swap4 – User are permanently assigned to one experimental condition based on IP address and browser. Basic Statistics – ~700 queries per day / ~300 distinct users per day Quality Control and Data Cleaning – Test run for 32 days – Heuristics to identify bots and spammers – All evaluation code was written twice and cross-validated

Arxiv.org: Results Conclusions None of the absolute metrics reflects expected order. Most differences not significant after one month of data. Analogous results for Yahoo! Search with much more data. [Radlinski et al., 2008]

Yahoo! Search: Results Retrieval Functions – 4 variants of production retrieval function Data – 10M – 70M queries for each retrieval function – Expert relevance judgments Results – Still not always significant even after more than 10M queries per function – Only consistent with [Chapelle et al., 2012]

Decide between two Ranking Functions Distribution P(u,q) of users u, queries q Retrieval Function 1 f 1 (u,q)  r 1 Retrieval Function 2 f 2 (u,q)  r 2 Which one is better?  (tj,”SVM”)  1. Kernel Machines 2.SVM-Light Support Vector Machine 3.School of Veterinary Medicine at UPenn 4.An Introduction to Support Vector Machines 5.Service Master Company ⁞ 1.School of Veterinary Medicine at UPenn 2.Service Master Company 3.Support Vector Machine 4.Archives of SUPPORT-VECTOR-MACHINES 5.SVM-Light Support Vector Machine light/ ⁞ U(tj,”SVM”,r 1 ) U(tj,”SVM”,r 2 ) What would Paul do? Take results from two retrieval functions and mix them  blind paired comparison. Fedex them to the users. Users assess relevance of papers. Retrieval system with more relevant papers wins. KANTOR, P National, language-specific evaluation sites for retrieval systems and interfaces. Proceedings of the International Conference on Computer-Assisted Information Retrieval (RIAO). 139–147.

Economic Models of Decision Making Rational Choice – Alternatives Y – Utility function U(y) – Decision y * = argmax y 2Y {U(y)} Bounded Rationality – Time constraints – Computation constraints – Approximate U(y) Behavioral Economics – Framing – Fairness – Loss aversion – Handling uncertainty Click

A Model of how Users Click in Search Model of clicking: – Users explore ranking to position k – Users click on most relevant (looking) links in top k – Users stop clicking when time budget up or other action more promising (e.g. reformulation) – Empirically supported by [Granka et al., 2004] Click

Balanced Interleaving 1. Kernel Machines 2.Support Vector Machine 3.An Introduction to Support Vector Machines 4.Archives of SUPPORT-VECTOR-MACHINES SVM-Light Support Vector Machine light/ 1. Kernel Machines 2.SVM-Light Support Vector Machine light/ 3.Support Vector Machine and Kernel... References 4.Lucent Technologies: SVM demo applet 5.Royal Holloway Support Vector Machine 1. Kernel Machines Support Vector Machine2 3.SVM-Light Support Vector Machine 2 light/ 4.An Introduction to Support Vector Machines3 5.Support Vector Machine and Kernel... References3 6.Archives of SUPPORT-VECTOR-MACHINES Lucent Technologies: SVM demo applet 4 f 1 (u,q)  r 1 f 2 (u,q)  r 2 Interleaving(r 1,r 2 ) (u=tj, q=“svm”) Interpretation: ( r 1 Â r 2 ) ↔ clicks(topk(r 1 )) > clicks(topk(r 2 ))  see also [Radlinski, Craswell, 2012] [Hofmann, 2012] Invariant: For all k, top k of balanced interleaving is union of top k 1 of r 1 and top k 2 of r 2 with k 1 =k 2 ± 1. [Joachims, 2001] [Radlinski et al., 2008] Model of User: Better retrieval functions is more likely to get more clicks.

Balanced Interleaving: a Problem Example: – Two rankings r 1 and r 2 that are identical up to one insertion (X) – “Random user” clicks uniformly on results in interleaved ranking 1.“X”  r 2 wins 2.“A”  r 1 wins 3.“B”  r 1 wins 4.“C”  r 1 wins 5.“D”  r 1 wins  biased A B C D X A B C X 1 A 1 B 2 C 3 D 4 r1r1 r2r2

Team-Game Interleaving 1. Kernel Machines 2.Support Vector Machine 3.An Introduction to Support Vector Machines 4.Archives of SUPPORT-VECTOR-MACHINES SVM-Light Support Vector Machine light/ 1. Kernel Machines 2.SVM-Light Support Vector Machine light/ 3.Support Vector Machine and Kernel... References 4.Lucent Technologies: SVM demo applet 5.Royal Holloway Support Vector Machine 1. Kernel Machines T2 2.Support Vector MachineT1 3.SVM-Light Support Vector Machine T2 light/ 4.An Introduction to Support Vector MachinesT1 5.Support Vector Machine and Kernel... ReferencesT2 6.Archives of SUPPORT-VECTOR-MACHINES...T1 7.Lucent Technologies: SVM demo applet T2 f 1 (u,q)  r 1 f 2 (u,q)  r 2 Interleaving(r 1,r 2 ) (u=tj,q=“svm”) Interpretation: ( r 1 Â r 2 ) ↔ clicks(T 1 ) > clicks(T 2 ) Invariant: For all k, in expectation same number of team members in top k from each team. NEXT PICK

Arxiv.org: Interleaving Experiment Experiment Setup – Phase I: 36 days Balanced Interleaving of (Orig,Flat) (Flat,Rand) (Orig,Rand) – Phase II: 30 days Balanced Interleaving of (Orig,Swap2) (Swap2,Swap4) (Orig,Swap4) Quality Control and Data Cleaning – Same as for absolute metrics

Arxiv.org: Interleaving Results % wins O RIG % wins R AND Percent Wins Conclusions All interleaving experiments reflect the expected order. All differences are significant after one month of data. Same results also for alternative data-preprocessing.

Team-Game Interleaving: Results Paired Comparison Tests: Summary and Conclusions All interleaving experiments reflect the expected order. Same results also for larger scale experiment on Yahoo and Bing Most differences are significant after one month of data. Percent Wins

Yahoo and Bing: Interleaving Results Yahoo Web Search [Chapelle et al., 2012] – Four retrieval functions (i.e. 6 paired comparisons) – Balanced Interleaving  All paired comparisons consistent with ordering by NDCG. Bing Web Search [Radlinski & Craswell, 2010] – Five retrieval function pairs – Team-Game Interleaving  Consistent with ordering by NDGC when NDCG significant.

Efficiency: Interleaving vs. Explicit Bing Web Search – 4 retrieval function pairs – ~12k manually judged queries – ~200k interleaved queries Experiment – p = probability that NDCG is correct on subsample of size y – x = number of queries needed to reach same p-value with interleaving  Ten interleaved queries are equivalent to one manually judged query. [Radlinski & Craswell, 2010]

Conclusion Pick Paul’s brain frequently Pick Paul’s brain early Library dust is not harmful