Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.

Slides:



Advertisements
Similar presentations
Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Results Need to be Diverse Mark Sanderson University of Sheffield.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Evaluating Search Engine
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
Search Engines and Information Retrieval
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
User Interface Design Chapter 11. Objectives  Understand several fundamental user interface (UI) design principles.  Understand the process of UI design.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Physics CPD Presentation Programme Welcome and outline for event Overview of Internal Assessment in Physics Assessment of Outcome 1 Workshop 1, assessing.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Chapter 10  2000 by Prentice Hall Information Systems for Managerial Decision Making Uma Gupta Introduction to Information Systems.
An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 3rd Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
The Internet 8th Edition Tutorial 4 Searching the Web.
Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
How robust is CLIR? Proposal for a new robust task at CLEF Thomas Mandl Information Science Universität Hildesheim 6 th Workshop.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Newspaper in Education Web Site (NEWS) Usability Evaluation Conducted by Terry Vaughn School of Information The University of Texas at Austin November.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Research Paper on a Physical Science Topic Joan Kanya Margaret Hickey.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Bayesian Query-Focused Summarization Slide 1 Hal Daumé III Bayesian Query-Focused Summarization Hal Daumé III and Daniel Marcu Information.
PowerTeacher Gradebook PTG and PowerTeacher Pro PT Pro A Comparison The following slides will give you an overview of the changes that will occur moving.
Developments in Evaluation of Search Engines
Evaluation Anisio Lacerda.
Q1 issues: Lack of staff agreement in the decisions as to if and how to utilize volunteers within the agency Absence of job descriptions or support and.
Walid Magdy Gareth Jones
Software Testing With Testopia
… guaranteed because our system makes implementing EIGHT NECESSARY STEPS very , very straightforward NEXT SLIDE EXPLAINS.
Lecture 10 Evaluation.
Built by Schools for Schools
IR Theory: Evaluation Methods
To the ETS – Encumbrance Online Training Course
Lecture 6 Evaluation.
To the ETS – Encumbrance Online Training Course
Retrieval Evaluation - Reference Collections
Presentation and project
Relevance in ISR Peter Ingwersen Department of Information Studies
Assessment Needs Analysis
Preference Based Evaluation Measures for Novelty and Diversity
Presentation transcript:

Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1

Outline Relevance Assessment and TREC (4 slides) Crowdsourcing Interface (4 slides) Research Questions and Results (6 slides) Conclusions and Best Practices (1 slide) 2

Relevance Assessment and TREC Slides 4-7/20 3

Relevance Assessment Relevance assessments are vital when evaluating information retrieval (IR) systems at TREC Is this document relevant to the information need expressed in the user query? Created by human assessors Specialist paid assessors, e.g. TREC assessors Typically, only one assessor per judgement (for cost reasons) Researchers themselves 4

Limitations Creating relevance assessments is costly $$$ Time Equipment (lab, computers, electricity, etc) May not scale well How many people are available to make assessments Can the work be done in parallel? 5

Task Could we do relevance assessment using crowdsourcing at TREC? TREC 2010 Blog Track Top news stories identification subtask “What are the newsworthy stories on day d for a category c?” Was the story ‘Sony Announces NGP’ an important story on the 1 st February for the Science/Technology category? 6 System Task: Crowdsourcing Task:

Crowdsourcing Interface Slides 9-12/20 7

Instructions... Crowdsourcing HIT Interface 8 Category c Day d Story Comment Box Submit button Judgment Assigned [+] Important [-] Not Important [x] Wrong Category [?] Not Yet Judged Externally Hosted iframe List of stories to be judged

External Interface Interface was hosted from our servers Requires Interaction Catches out bots which only look for simple input fields to fill 9 Glasgow Server Worker

Manual Summary Evaluation Hosting the judging interface externally allows us to record and reproduce what each worker sees Can see at a glance whether the judgments make sense Can compare across judgments easily Can check whether the work has been done at all 10 Worker 1/2/3 Is this a bot?

Submitting Larger HITs We have each worker judge 32 stories from a single day and category per HIT Two reasons: Newsworthiness is relative: Provides background for workers as to the stories of the day. Promotes worker commitment in the task Stories

Experimental Results Slides 14-20/20 12

Research Questions 1.Was crowdsourcing Blog Track judgments fast and cheap? 2.Are there high levels of agreement between assessors? 3.Is having redundant judgments even necessary? 4.If we use worker agreement to infer multiple grades of importance, how would this effect the final ranking of systems at TREC? 13 Was crowdsourcing a good idea? Can we do better?

Experimental Setup 14 $0.50 per HIT $ total (Includes 10% fees) US Worker Restriction 6 Batches Incremental improvements 8,000 news stories statMAP pooling depth topic days Three workers per HIT 24,000 judgments total 750 HITs total [O. Alonso, SIGIR’09]

Is Crowdsourcing Relevance Assessments Fast and Cheap? Quick? First HITs accepted within 10 minutes of launch Each batch took less than 5 hours Cheap? $ ($ per judgment) 38% above $2 per hour wage on average 15 Batches quickly completed – not an issue Few HITs per batch – might be difficult to find soon after launch Workers took less time than expected Workers get faster over time

Assessment Quality Are the assessments of good quality? Evaluate agreement between workers Mean Agreement – 69% Ellen Voorhees reported only 32.8% 16 [ E.M. Voorhees. IPM, 2000]

Do we need redundant judgments? What would have happened to the ranking of TREC systems if we had only used a single worker per HIT? Consider each of the three judgments per HIT from a Meta- Worker System rankings are not stable in the top ranks Do we need to average over three workers? 17 Two Groups: Top 3: ~0.15 apart Bottom 3: ~0.3 apart Multiple Ranking Swaps by the top systems Yes!

Conclusions and Best Practices Crowdsourcing top stories relevance assessments can be done successfully at TREC... But we need at least three assessors for each story Best Practices Don’t be afraid to use larger HITs If you have an existing interface integrate it with MTurk Gold Judgments are not the only validation method Re-cost your HITs as necessary 18 Questions?