Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

An Interactive Learning Approach to Optimizing Information Retrieval Systems Yahoo! August 24 th, 2010 Yisong Yue Cornell University.
ICML 2009 Yisong Yue Thorsten Joachims Cornell University
Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Practical and Reliable Retrieval Evaluation Through Online Experimentation WSDM Workshop on Web Search Click Data February 12 th, 2012 Yisong Yue Carnegie.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Super Awesome Presentation Dandre Allison Devin Adair.
Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Optimizing search engines using clickthrough data
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Beat the Mean Bandit Yisong Yue (CMU) & Thorsten Joachims (Cornell)
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Practical Online Retrieval Evaluation SIGIR 2011 Tutorial Filip Radlinski (Microsoft) Yisong Yue (CMU)
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Evaluating Search Engine
Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Computing Trust in Social Networks
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Usability Specifications
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings  Interleaving.
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
Modern Retrieval Evaluations Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
An Experimental Comparison of Click Position-Bias Models Nick Craswell Onno Zoeter Michael Taylor Bill Ramsey Microsoft Research.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Some Vignettes from Learning Theory Robert Kleinberg Cornell University Microsoft Faculty Summit, 2009.
The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Engines that Learn from Implicit Feedback Jiawen, Liu Speech Lab, CSIE National Taiwan Normal University Reference: Search Engines that Learn from.
1 Relational Algebra and Calculas Chapter 4, Part A.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
1 Click Chain Model in Web Search Fan Guo Carnegie Mellon University PPT Revised and Presented by Xin Xin.
Modern Retrieval Evaluations Hongning Wang
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
Modern Retrieval Evaluations
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Content-Aware Click Modeling
Chapter 4: Designing Studies
Date : 2013/1/10 Author : Lanbo Zhang, Yi Zhang, Yunfei Chen
CS246: Leveraging User Feedback
How does Clickthrough Data Reflect Retrieval Quality?
Chapter 4: Designing Studies
Interactive Information Retrieval
Presentation transcript:

Online Search Evaluation with Interleaving Filip Radlinski Microsoft

Acknowledgments This talk involves joint work with –Olivier Chapelle –Nick Craswell –Katja Hofmann –Thorsten Joachims –Madhu Kurup –Anne Schuth –Yisong Yue

Motivation Baseline Ranking AlgorithmProposed Ranking Algorithm Which is better?

Retrieval evaluation Two types of retrieval evaluation: Offline evaluation Ask experts or users to explicitly evaluate your retrieval system. This dominates evaluation research today. Online evaluation See how normal users interact with your retrieval system when just using it. Most well known type: A/B tests

A/B testing Each user is assigned to one of two conditions They might see the left or the right ranking Measure user interaction with theirs (e.g. clicks) Look for differences between the populations Ranking A Ranking B

Online evaluation with interleaving A within-user online ranker comparison –Presents results from both rankings to every user The ranking that gets more of the clicks wins –Designed to be unbiased, and much more sensitive than A/B Ranking A Ranking B Shown Users (randomized)

Ranking A 1.Napa Valley – The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4.Been There | Tips | Napa Valley 5.Napa Valley Wineries and Wine 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast NapaValley.org 6.The Napa Valley Marathon Presented Ranking 1.Napa Valley – The authority for lodging Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine Napa Valley Hotels – Bed and Breakfast Napa Valley College 7NapaValley.org A B [Radlinski et al. 2008] Team draft interleaving

Ranking A 1.Napa Valley – The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4.Been There | Tips | Napa Valley 5.Napa Valley Wineries and Wine 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast NapaValley.org 6.The Napa Valley Marathon Presented Ranking 1.Napa Valley – The authority for lodging Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine Napa Valley Hotels – Bed and Breakfast Napa Balley College 7NapaValley.org Tie! Click [Radlinski et al. 2008]

Why might mixing rankings help? Suppose results are worth money. For some query: –Ranker A:,,  User clicks –Ranker B:,,  User also clicks Users of A may not know what they’re missing –Difference in behaviour is small But if we can mix up results from A & B  Strong preference for B

Comparison with A/B metrics p-value Query set size Experiments with real Yahoo! rankers (very small differences in relevance) Yahoo! Pair 1 Yahoo! Pair 2 Disagreement Probability [Chapelle et al. 2012]

The interleaving click model Click == Good Interleaving corrects for position bias Yet there other sources of bias, such as bolding vs [Yue et al. 2010a]

The interleaving click model Bars should be equal if there was no effect of bolding [Yue et al. 2010a] Rank of Results Click frequency on bottom result

Sometimes clicks aren’t even good Satisfaction of a click can be estimated –Time spent on URLs is informative –More sophisticated models also consider the query and document (some documents require more effort) Time before clicking is another efficiency metric [Kim et al. WSDM 2014] Click No…

Newer A/B metrics Newer A/B metrics can incorporate these signals –Time before clicking –Time spent on result documents –Estimated user satisfaction –Bias in click signal, e.g. position –Anything else the domain expert cares about Suppose I’ve picked an A/B metric and assume it to be my target –I just want to measure it more quickly –Can I use interleaving?

An A/B metric as a gold standard Does interleaving agree with these AB metrics? AB MetricTeam Draft Agreement Is Page Clicked?63 % 1?71 % Satisfied Clicked?71 % Satisfied 1?76 % Time – to – click53 % Time – to – 145 % Time – to – satisfied – click47 % Time – to – satisfied – 142 % [Schuth et al. SIGIR 2015]

An A/B metric as a gold standard [Schuth et al. SIGIR 2015]

An A/B metric as a gold standard AB Metric Team Draft Agreement (1/80 th size) Learned (to each metric) AB Self-Agreement on Subset (1/80 th size) Is Page Clicked?63 %84 % +63 % 1?71 % *75 % +62 % Satisfied Clicked?71 % *85 % +61 % Satisfied 1?76 % *82 % +60 % Time – to – click53 %68 % +58 % Time – to – 145 %56 % +59 % Time – to – satisfied – click47 %63 % +59 % Time – to – satisfied – 142 %50 % +60 %

The right parameters AB Metric Team Draft Agreement Learned Combined Learned (P(Sat) only) Learned (Time to click * P(Sat)) Satisfied Clicked?71 %85 % +84 % +48 % – P(Sat) > 0.5 P(Sat) > 0.76 The optimal filtering parameter need not match the metric definition But having the right feature is essential P(Sat) > 0.26

Does this cost sensitivity? Statistical Power Team Draft Is Sat clicked (A/B)

What if you instead know how you value user actions? Suppose we don’t have an AB metric in mind Instead, suppose we instead know how to value users’ behavior on changed documents: –If user clicks on a document that moved up k positions, how much is it worth? –If a user spends time t before clicking, how much is it worth? –If a user spends time t’ on a document, how much is it worth? [Radlinski & Craswell, WSDM 2013]

Example credit function

Interleaving (making the rankings) Ranker A Ranker B We generate a set of rankings that are similar to those returned by A and B in an A/B test Team Draft 50%

We have an optimization problem!

Sensitivity The optimization problem so far is usually under- constrained (lots of possible rankings). What else do we want? Sensitivity! Intuition: –When we show a particular ranking (i.e. something combining results from A and B), it is always biased (interleaving says that we should be unbiased on average) –The more biased, the less informative the outcome –We want to show individual rankings that are least biased I’ll skip the maths here...

Allowed interleaved rankings % % %25% % 0.50 Illustrative optimized solution A B

Summary Interleaving is a sensitive online metric for evaluating rankings –Very high agreement when reliable offline relevance metrics are available –Agreement of simple interleaving algorithms with AB metrics & small / ambiguous relevance differences can be poor Solutions: –Can de-bias user behaviour (e.g. presentation effects) –Can optimize to a known AB metric (if one is trusted) –Can optimize to a known user model

Thanks! Questions?