Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Similar presentations


Presentation on theme: "Online Search Evaluation with Interleaving Filip Radlinski Microsoft."— Presentation transcript:

1 Online Search Evaluation with Interleaving Filip Radlinski Microsoft

2 Acknowledgments This talk involves joint work with –Olivier Chapelle –Nick Craswell –Katja Hofmann –Thorsten Joachims –Madhu Kurup –Anne Schuth –Yisong Yue

3 Motivation Baseline Ranking AlgorithmProposed Ranking Algorithm Which is better?

4 Retrieval evaluation Two types of retrieval evaluation: Offline evaluation Ask experts or users to explicitly evaluate your retrieval system. This dominates evaluation research today. Online evaluation See how normal users interact with your retrieval system when just using it. Most well known type: A/B tests

5 A/B testing Each user is assigned to one of two conditions They might see the left or the right ranking Measure user interaction with theirs (e.g. clicks) Look for differences between the populations Ranking A Ranking B

6 Online evaluation with interleaving A within-user online ranker comparison –Presents results from both rankings to every user The ranking that gets more of the clicks wins –Designed to be unbiased, and much more sensitive than A/B Ranking A Ranking B Shown Users (randomized)

7 Ranking A 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3.Napa Valley College www.napavalley.edu/homex.asp 4.Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5.Napa Valley Wineries and Wine www.napavintners.com 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging... www.napavalley.com 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5.NapaValley.org www.napavalley.org 6.The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6.Napa Valley College www.napavalley.edu/homex.asp 7NapaValley.org www.napavalley.org A B [Radlinski et al. 2008] Team draft interleaving

8 Ranking A 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3.Napa Valley College www.napavalley.edu/homex.asp 4.Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5.Napa Valley Wineries and Wine www.napavintners.com 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging... www.napavalley.com 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5.NapaValley.org www.napavalley.org 6.The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6.Napa Balley College www.napavalley.edu/homex.asp 7NapaValley.org www.napavalley.org Tie! Click [Radlinski et al. 2008]

9 Why might mixing rankings help? Suppose results are worth money. For some query: –Ranker A:,,  User clicks –Ranker B:,,  User also clicks Users of A may not know what they’re missing –Difference in behaviour is small But if we can mix up results from A & B  Strong preference for B

10 Comparison with A/B metrics p-value Query set size Experiments with real Yahoo! rankers (very small differences in relevance) Yahoo! Pair 1 Yahoo! Pair 2 Disagreement Probability [Chapelle et al. 2012]

11 The interleaving click model Click == Good Interleaving corrects for position bias Yet there other sources of bias, such as bolding vs [Yue et al. 2010a]

12 The interleaving click model Bars should be equal if there was no effect of bolding [Yue et al. 2010a] Rank of Results Click frequency on bottom result

13 Sometimes clicks aren’t even good Satisfaction of a click can be estimated –Time spent on URLs is informative –More sophisticated models also consider the query and document (some documents require more effort) Time before clicking is another efficiency metric [Kim et al. WSDM 2014] Click No…

14 Newer A/B metrics Newer A/B metrics can incorporate these signals –Time before clicking –Time spent on result documents –Estimated user satisfaction –Bias in click signal, e.g. position –Anything else the domain expert cares about Suppose I’ve picked an A/B metric and assume it to be my target –I just want to measure it more quickly –Can I use interleaving?

15 An A/B metric as a gold standard Does interleaving agree with these AB metrics? AB MetricTeam Draft Agreement Is Page Clicked?63 % Clicked @ 1?71 % Satisfied Clicked?71 % Satisfied Clicked @ 1?76 % Time – to – click53 % Time – to – click @ 145 % Time – to – satisfied – click47 % Time – to – satisfied – click @ 142 % [Schuth et al. SIGIR 2015]

16 An A/B metric as a gold standard [Schuth et al. SIGIR 2015]

17 An A/B metric as a gold standard AB Metric Team Draft Agreement (1/80 th size) Learned (to each metric) AB Self-Agreement on Subset (1/80 th size) Is Page Clicked?63 %84 % +63 % Clicked @ 1?71 % *75 % +62 % Satisfied Clicked?71 % *85 % +61 % Satisfied Clicked @ 1?76 % *82 % +60 % Time – to – click53 %68 % +58 % Time – to – click @ 145 %56 % +59 % Time – to – satisfied – click47 %63 % +59 % Time – to – satisfied – click @ 142 %50 % +60 %

18 The right parameters AB Metric Team Draft Agreement Learned Combined Learned (P(Sat) only) Learned (Time to click * P(Sat)) Satisfied Clicked?71 %85 % +84 % +48 % – P(Sat) > 0.5 P(Sat) > 0.76 The optimal filtering parameter need not match the metric definition But having the right feature is essential P(Sat) > 0.26

19 Does this cost sensitivity? Statistical Power Team Draft Is Sat clicked (A/B)

20 What if you instead know how you value user actions? Suppose we don’t have an AB metric in mind Instead, suppose we instead know how to value users’ behavior on changed documents: –If user clicks on a document that moved up k positions, how much is it worth? –If a user spends time t before clicking, how much is it worth? –If a user spends time t’ on a document, how much is it worth? [Radlinski & Craswell, WSDM 2013]

21 Example credit function 123 1 23 1 2 3

22 Interleaving (making the rankings) 123 123 123 Ranker A Ranker B 123 123 123 123 We generate a set of rankings that are similar to those returned by A and B in an A/B test Team Draft 50%

23 We have an optimization problem!

24 Sensitivity The optimization problem so far is usually under- constrained (lots of possible rankings). What else do we want? Sensitivity! Intuition: –When we show a particular ranking (i.e. something combining results from A and B), it is always biased (interleaving says that we should be unbiased on average) –The more biased, the less informative the outcome –We want to show individual rankings that are least biased I’ll skip the maths here...

25 Allowed interleaved rankings 0.8725% 0.7325% 0.7435%25% 0.6040% 0.50 Illustrative optimized solution A B

26 Summary Interleaving is a sensitive online metric for evaluating rankings –Very high agreement when reliable offline relevance metrics are available –Agreement of simple interleaving algorithms with AB metrics & small / ambiguous relevance differences can be poor Solutions: –Can de-bias user behaviour (e.g. presentation effects) –Can optimize to a known AB metric (if one is trusted) –Can optimize to a known user model

27 Thanks! Questions?


Download ppt "Online Search Evaluation with Interleaving Filip Radlinski Microsoft."

Similar presentations


Ads by Google