An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware.

Slides:



Advertisements
Similar presentations
Beliefs & Biases in Web Search
Advertisements

Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Optimizing search engines using clickthrough data
What makes an image memorable?
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Applying Crowd Sourcing and Workflow in Social Conflict Detection By: Reshmi De, Bhargabi Chakrabarti 28/03/13.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Precision and Recall.
Evaluating Search Engine
Welcome to Turnitin.com’s Peer Review! This tour will take you through the basics of Turnitin.com’s Peer Review. The goal of this tour is to give you.
Experimental Components for the Evaluation of Interactive Information Retrieval Systems Pia Borlund Dawn Filan 3/30/04 610:551.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Some Internet Videos May be More Appropriate For Teaching Than Others Jeffrey Bell 1,3 and Jim Bidlack 2,3 Abstract IntroductionMaterials & Methods Results.
Modern Retrieval Evaluations Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
 What does 21 st century assessment look like?  How does 21 st century assessment encourage learning?  How do effective teachers use assessment?
Mondrians, Aesthetics, and the Horizontal Effect 1 Department of Psychological and Brain Sciences, University of Louisville 2 Department of Ophthalmology.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Question Answering over Implicitly Structured Web Content
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Diversifying Search Results Rakesh AgrawalSreenivas GollapudiSearch LabsMicrosoft Research Alan HalversonSamuel.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Post-Ranking query suggestion by diversifying search Chao Wang.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Modern Retrieval Evaluations Hongning Wang
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Evaluation Anisio Lacerda.
Differentiating Instruction Using Nettrekker
Evaluation of IR Systems
IR Theory: Evaluation Methods
Beliefs and Biases in Web Search
Presentation transcript:

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware

Objective Analysis of assessor behavior in our pilot study to determine the optimal placement of images among search results Identify some patterns that may be broadly indicative of unreliable assessments Inform future experimental design and analysis when using crowdsourced human judgments

Description of Our Pilot Study Objective: determine the optimal placement of images among search results Objects to be assessed: full page layouts consisting of both ranked results and results from image search verticals

Description of Our Pilot Study Method: preference judgments pros: can be made quickly less prone to disagreements between assessors more robust to missing judgments correlate well to traditional evaluation measures based on absolute relevance judgments on documents can be mapped to much finer-grained grades of utility cons: number of pairs grows quadratically number of objects grows factorially unreliable data

Experimental Design Platform: Amazon Mechanical Turk HIT (Human Intelligent Task) question  our own survey website, which allows us to show each MTurker a sequence of survey questions and to log additional information such as IP address, time-on-task, etc.. US$0.13 for every 17 survey questions.

Experimental Design Survey Website: A simple question at the beginning of the survey to identify a user's favorite search engine Then a sequence of 17 full-page preferences – each preference was for a different query – order of queries randomized Confirmation code at the end to be submitted via HIT

Experimental Design Each survey question has two tasks Full Page Layouts Preference Judgments Task Preferences on variations of the Yahoo! SERP layout for 47 queries formed from 30 topics taken from the TREC 2009 Web track and the TREC 2009 Million Query track Absolute-scale Image Rating Task rate the pictures by relevance on a ternary scale (“poor”, “OK”, “good”).

Experimental Design Full Page Layouts Preference Judgments Task Full page layouts consisted of ranked results plus – inline images to the left or right of summaries – image search vertical results on the top, middle, or bottom of page Up to 6 variant layouts possible 2 of 17 queries are “trap” questions – two identical result pages – placed at 6 th and 15 th in the sequence

Data Analysis So far we have had 48 approved HITs (36-Google, 11-Yahoo, 1-Bing) 20 rejected HITs (15-G, 3-Y, 1-Bing, 1-Ask) Rejection Criteria: Failure on the trap question.

Analysis of Approved Data Time analysis Normal pattern: the assessor starts out slow, quickly gets faster as he or she learns about the task, and then roughly maintains time. 37 out of 48 assessors (77%) fall into this category

Analysis of Approved Data Time analysis Periodic or Quasi-periodic pattern: some judgments are made fast and some are made slow, but the fast and slow tend to alternate. One possible explanation: absent-minded periodically 7 out of 48 assessors (15%) fall into this category

Analysis of Approved Data Time analysis Interrupted pattern: occasional peaks appears under the background of Normal pattern. 4 out of 48 assessors (8%) fall into this category

Analysis of Approved Data Image rating analysis Normal pattern: Users give rating 2 or 3 most often and give rating 1 occasionally. 34 out of 48 assessors (71%) fall into this category

Analysis of Approved Data Image rating analysis Periodic or Quasi-periodic pattern: a user shows a tendency to alternate between a subset of ratings 8 out of 48 assessors (17%) fall into this category

Analysis of Approved Data Image rating analysis Fixed pattern: all or most of the images rating are the same. 6 out of 48 assessors (12%) fall into this category

Analysis of Approved Data Preference judgment analysis We analyze preference judgments for the mixed set (both inline and vertical image results) to determine whether we can identify one or the other as the primary factor in the assessor's preference.

Analysis of Approved Data Preference judgment analysis We separately analyzed the inline placement and vertical placement preferences for the time being. We assign TMB (top/middle/bottom vertical variants) and LR (left/right inline variants) scores to indicate the layout preferences according to the following method: Given a T-B pair, if the user prefers T, we assign 1.5 as TMB score of that pair. Otherwise, we assign Given an M-B pair, if the user prefers M, we assign 1 as TMB score of that pair. Otherwise, we assign -1. Given a T-M pair, if the user prefers T, we assign 1 as TMB score of that pair. Otherwise, we assign -1. Given an L-R pair, if the user prefers L, we assign 1 as LR score of that pair. Otherwise, we assign -1.

Analysis of Approved Data Preference judgment analysis Layout preference curves: moving averages of the TMB and LR scores (red and blue lines, respectively) against query number. Window size equals to 5. There are roughly two patterns of layout preference curves: either only one of the scores changes over time, or both do.

Analysis of Approved Data Preference judgment analysis Only one of the scores changes over time. We infer from this that it is the layout preference associated with the varying curve that is the leading factor in making preferences. 21/48 (44%) assessors showed this pattern.

Analysis of Approved Data Preference judgment analysis Both TMB and LR curves vary over time. We may need to look at each SERP pair individually to determine if TMB and LR positions have a combinational effect on layout preference. 27/48 (56%) assessors showed this pattern.

Analysis of Rejected Data 14/20 (70%) exhibited unusual behavior patterns 4 showed an abnormal time pattern of taking longer on the last few queries; this was not observed among assessors that passed the trap questions.

Analysis of Rejected Data 2 showed an interrupted time 4 showed a fixed image rating pattern, 1 of them also showed an interrupted time pattern 4 showed a quasi-periodic image rating pattern, 1 of them also showed a quasi-periodic time pattern

Conclusion Turkers prefer – vertical image results near the top of the page – inline image results on the left

Conclusions Unreliable assessors detectable – periodic timings – abnormal timings – periodic ratings – fixed ratings Assessors may be reliable on one task but unreliable for another Trap questions are useful Trap questions in conjunction with timings/ratings filters best MTurkers learning how to avoid being detected when cheating

Acknowledgements

Q&A