Beliefs & Biases in Web Search Ryen White Microsoft Research

Slides:



Advertisements
Similar presentations
Beliefs & Biases in Web Search
Advertisements

Predicting User Interests from Contextual Information
Kotler / Armstrong 11e, Chapter 4 Managers today often receive _____ information. 1.too much 2.too little 3.irrelevant 4.both 1 and 3.
Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
Struggling or Exploring? Disambiguating Long Search Sessions
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Preparing Data for Quantitative Analysis
Chapter 4 – Reliability Observed Scores and True Scores Error
“Do users do what they think they do?” – a comparative study of user perceived and actual information searching behaviour in the National electronic Library.
Ryen W. White PhD and Eric Horvitz MD, PhD Microsoft Research {ryenw,
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Psychological Research Methods
Meta-Evaluation of USAID Evaluations: American Evaluation Association Annual Conference Molly Hageboeck and Micah Frumkin.
Studies of the Onset & Persistence of Medical Concerns in Search Logs Ryen White and Eric Horvitz Microsoft Research, Redmond
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Evaluating Search Engine
Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
How Science Works Glossary AS Level. Accuracy An accurate measurement is one which is close to the true value.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
ICS 463, Intro to Human Computer Interaction Design: 8. Evaluation and Data Dan Suthers.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Data and Data Collection Quantitative – Numbers, tests, counting, measuring Fundamentally--2 types of data Qualitative – Words, images, observations, conversations,
Modern Retrieval Evaluations Hongning Wang
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
@ 2012 Wadsworth, Cengage Learning Chapter 11 The Ecology of the Experiment: The Scientist and Research Participant in Relation to Their
Copyright © 2010 Pearson Education, Inc. Chapter 22 Comparing Two Proportions.
Week 8 Fundamentals of Hypothesis Testing: One-Sample Tests
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 7-1 Module II Overview PLANNING: Things to Know BEFORE You Start… Why SEM? Goal Analysis.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Psychological Methods Original Content Copyright by HOLT McDougal. Additions and changes to the original content are the responsibility of the instructor.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
1 Can People Collaborate to Improve the relevance of Search Results? Florian Eiteljörge June 11, 2013Florian Eiteljörge.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Presenter: Lung-Hao Lee Nov. 3, Room 310.  Introduction  Related Work  Methods  Results ◦ General Gaze Distribution on SERPs ◦ Effects of Task.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Week 2 The lecture for this week is designed to provide students with a general overview of 1) quantitative/qualitative research strategies and 2) 21st.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Post-Ranking query suggestion by diversifying search Chao Wang.
A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Experiments Textbook 4.2. Observational Study vs. Experiment Observational Studies observes individuals and measures variables of interest, but does not.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
Potential for Personalization Transactions on Computer-Human Interaction, 17(1), March 2010 Data Mining for Understanding User Needs Jaime Teevan, Susan.
Evaluation of IR Systems
Personalizing Search on Shared Devices
Beliefs and Biases in Web Search
Struggling and Success in Web Search
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Presentation transcript:

Beliefs & Biases in Web Search Ryen White Microsoft Research

Bias in IR and elsewhere In IR, e.g., Domain bias – People prefer particular Web domains Rank bias – People favor high-ranked results Caption bias – People prefer captions with certain terms In psychology, e.g., Anchoring-and-adjustment, confirmation, availability, etc. All impact user behavior Opportunity to intersect psychology and IR

Our Interest in Biases Bias can be observed in IR in situations where searchers seek or are presented with information that significantly deviates from the truth

Our Interest in Biases Bias can be observed in IR in situations where searchers seek or are presented with information that significantly deviates from the truth User behavior Search engine behavior

Example: The answer is yes, a health seeker may favor a particular out-come in light of their beliefs about the value of the oil, and seek or unconsciously prefer disaffirming information. More on the “truth” later… Question : Can tea tree oil treat canker sores?

Outline for Remainder of Talk Initial Exploratory Questionnaire Log Analysis Labeling Content and Truth Findings Conclusions

Initial Exploratory Questionnaire Gain early insight into possible biases in search Focus on Yes-No questions (answered with “Yes” or “No”) Simplicity: Answers along single dimension (Yes  No) Microsoft employees; recall recent Yes-No query (in last 2 weeks) Asked about belief beforehand and afterwards Multi-point scale: Yes / Lean Yes / Equal / Lean No / No 200 respondents. Recalled questions such as: “Does chocolate contain caffeine?” “Are shingles contagious?”

Survey Results Figure 1a provides evidence of a positive skew in respondents’ beliefs before they search. Specifically, 58% of respondents leaned toward yes and only 21% leaned toward no (the remaining 21% reported an equal belief in both outcomes). The fraction of searchers who believed yes after the search is more than double that of any other outcome, suggesting that respondents mostly shifted their prior beliefs from lean yes to yes. 48% believed lean yes,equal, or lean no compared to more than three quarters of respondents (77%) before searching

Survey Results Two main findings: 1. Respondents kept strongly-held beliefs (Yes-Yes and No- No) 2. If Before = Equal, then 2x as likely to believe Yes after search Motivated us to: Further explore possible impact of biases on behavior and outcomes Post-search belief given Pre-search belief

Log-Based Study of Yes-No Queries Queries, clicks, and results from Bing logs (2 weeks) Mined yes-no questions: start with “can”, “is”, “does”, etc. Focused on health since it’s important and we could get truth Randomly selected set of 1000 yes-no health questions Each issued by at least 10 users, same top 10, same captions Examples include: “Is congestive heart failure a heart attack?” (answer = No) “Do food allergies make you tired?” (answer = Yes) we wanted to obtain answer labels from medical professionals, who were time constrained.

Other Data Collected Physician answers for the Yes-No questions Answer on four scale:yes,no,50/50,don’t know 50/50: there really was an equal split between yes and no more information be needed to provide an answer. don’t know: did not know the answer the query was not medical or was not a yes-no question Yes-No Answer labels for captions and content of results

Physician Answers Two physicians reviewed the 1000 questions and gave answers The agreement matrix: The Cohen’s free-marginal kappa (κ) inter-rater agreement Cohen's kappa coefficient is a statistical measure of inter- rater agreement for qualitative items.

Physician Answers (38.8%+31.5%)/(38.8%+8.2%+5.7%+31.5%)= 83.4% (38.8%+8.2%)/ (38.8%+8.2%+5.7%+31.5%)=55.82% (38.8%+5.7%)/ (38.8%+8.2%+5.7%+31.5%)=52.85% 55.82%*52.85%+( %)*( %)=50.33% K=(83.4%50.33%)/(150.33%)=0.668

Physician Answers there is a high amount of uncertainty for the 50/50 and the don’t know categories, and they occur infrequently we focus on the cases where both judges were sufficiently confident to assign a rating of yes or no. 960*(38.8%+31.5%)=674 Distribution: 55% Yes and 45% No (used as TRUTH in our study)

Answer Labeling Captions and result content Crowdsourced (Clickworker.com) 3-5 judges/caption (consensus) Task was to assign label of: - Yes only - No only - Both (Yes and No) - Neither (not Yes and not No) Agreement on 96% of captions Performed similar labeling for each top 10 search results - Crowdsourced judges, agreement on 92% of pages Yes only No only Both Neither Example Caption Labels

Answer Labeling Captions and result content Crowdsourced (Clickworker.com) 3-5 judges/caption (consensus) Task was to assign label of: - Yes only - No only - Both (Yes and No) - Neither (not Yes and not No) Agreement on 96% of captions Performed similar labeling for each top 10 search results - Crowdsourced judges, agreement on 92% of pages Yes only No only Both Neither Example Caption Labels

Answer Labeling Captions and result content Crowdsourced (Clickworker.com) 3-5 judges/caption (consensus) Task was to assign label of: - Yes only - No only - Both (Yes and No) - Neither (not Yes and not No) Agreement on 96% of captions Performed similar labeling for each top 10 search results - Crowdsourced judges, agreement on 92% of pages Yes only No only Both Neither Example Caption Labels

Answer Labeling Captions and result content Crowdsourced (Clickworker.com) 3-5 judges/caption (consensus) Task was to assign label of: - Yes only - No only - Both (Yes and No) - Neither (not Yes and not No) Agreement on 96% of captions Performed similar labeling for each top 10 search results - Crowdsourced judges, agreement on 92% of pages Yes only No only Both Neither Example Caption Labels

Using Physician Answers as Truth Used consensus physician answers as truth in three ways: How closely does distribution of results match the truth? How closely does interaction behavior match the truth? How closely do answers that people reach match the truth? Bias = Distributions significantly differ from Yes-No base rates

Taking Stock of Our Data We have: 680 Yes-No health questions from search logs Ground truth for each q via physicians’ consensus judgments For each question we have: HTML content of top 10 search results, plus: Caption labels for Yes/No/Both/Neither Result labels for Yes/No/Both/Neither Clickthrough behavior from logs

Analysis Four directions for analysis: Study ranking of results with Yes-No content Study user behavior w.r.t. Yes-No content Study answer accuracy for Yes-No questions Study answer transitions for Yes-No questions

Result Ranking Volume of Yes-No content in the results Percentage of captions or results with answer  More Yes content in top-10 than No content Relative ranking of top Yes-No content when both in top 10 Percentage of SERPs where top yes caption or result appears above (nearer the top of the ranking than) the top no  Yes content ranked above No more often (when both shown)

User Behavior (Clickthrough rate) Studied clickthrough rates on captions containing answers Controlled for rank by just considering top result (r=1) SERP click likelihoods for different captions given variations in answer presence in SERPs/captions, and rank 3-4x as likely to click on captions with Yes content, even though TRUTH = 55% Yes / 45% No Just considering top search result

User Behavior (Result skipping) Beyond clickthrough behavior, it is also worth considering the nature of the results that users skipped over prior to clicking on a particular caption. To study this in our context, we targeted SERPs with both yes-only and no-only answers in captions, and identified clicks on a caption with a yes or no answer where the user had skipped over another caption prior to clicking

User Behavior (Result skipping) Studied result skipping behavior Frequency with which people skipped caption w/answer to click other caption Distribution of clicks and skips by answer Users more likely (5x) to skip No to click Yes than vice versa No Ye s Caption 1 Caption 2 Caption 3 Caption 4

Answer Accuracy Examined accuracy of the top search result, as well as first click and last click in session Findings show: 1. Top result accurate only 45% of time, less when truth is No 2. Users improve accuracy, but only slightly (limited by top 10)

Answer Transitions people reported being much more likely to search for confirmatory information than for information that challenged their hypothesis 1. no one transitioned from yes to no Findings show: 2. confirmation was the primary motivation for pursuing information after the initial answer.

Summary of Main Findings We observed: 1.Engines more likely to rank Yes above No, and return more Yes 2.People much more likely to click on Yes than No 3.Engine had wrong top rank for half of questions* * Given that answer present at top position (~80% of queries) Caveats: Findings for our particular set of Yes-No health questions More work needed to validate with other question sets, domains beyond health, etc.

Discussion Possible causes for observed bias include: Search engines use behavior (hurt by common misconceptions) Ranking algorithms consider query match e.g., for query: [can acid reflux cause back pain?]: Yes docs w/ “Acid reflux can cause back pain” better match (6 of 6 terms) than No docs w/ “Acid reflux cannot cause back pain” (5 of 6 terms) missing from query

Conclusions Studied potential bias in user behavior and outcomes Showed effects on both from search engines 2% of queries are Yes-No questions; Searchers want answers!! To get users to accurate answers, engines should consider truth Future directions: Study availability of Yes-No content online; Move beyond Yes-No Consider how truth should be determined and used in ranking Follow-up user studies

2014/5/23