Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.

Slides:



Advertisements
Similar presentations
Beliefs & Biases in Web Search
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Beliefs & Biases in Web Search Ryen White Microsoft Research
Effective Use of Assessment and Data Winterhill School – October 2014.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Search Engines and Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Booking Rules SLCM_AD_315 1 SLCM_AD_315 Booking Rules.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Year 7 Independent Learning Task 1
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Search Engines and Information Retrieval Chapter 1.
Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
How to structure good history writing Always put an introduction which explains what you are going to talk about. Always put a conclusion which summarises.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
The College Board (best known for the SAT) has these eight tips for writing a solid college essay: t-in/essays/8-tips-for-crafting-your-
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Peer feedback on a draft TMA answer: Is it usable? Is it used? Mirabelle Walker Department of Communication and Systems.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Sampath Jayarathna Cal Poly Pomona
SCHOlar – Live Online Advanced Higher Physics
Beliefs and Biases in Web Search
Presentation transcript:

Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0

Introduction What is news query classification and why would we build a dataset to examine it? − Binary classification task performed by Web search engines − Up to 10% of queries may be news-related [Bar-Ilan et al, 2009] Have workers judge Web search queries as news-related or not gunman Web Search Engine News-Related Non-News-Related News Results Web Search Results User 1

Introduction But: News-relatedness is subjective Workers can easily `game` the task News Queries change over time for query in task return Random(Yes,No) end loop 2 Query: Octopus? Sea Creature? Work Cup Predications? News-related? How can we overcome these difficulties to create a high quality dataset for news query classification?

 Introduction (1--3)  Dataset Construction Methodology (4--14)  Research Questions and Setting (15--17)  Experiments and Results (18--20)  Conclusions (21) Talk Outline 3

Dataset Construction Methodology How can we go about building a news query classification dataset? 1. Sample Queries from the MSN May 2006 Query Log 2. Create Gold judgments to validate the workers 3. Propose additional content to tackle the temporal nature of news queries and prototype interfaces to evaluate this content on a small test set 4. Create the final labels using the best setting and interface 5. Evaluate in terms of agreement 6. Evaluate against `experts` 4

Dataset Construction Methodology Sampling Queries: − Create 2 query sets sampled from the MSN May 2006 Query-log Poisson Sampling − One for testing (testset) Fast crowdsourcing turn around time Very low cost − One for the final dataset (fullset) 10x queries Only labelled once 5 Date Time Query :00:08 What is May Day? :43:42 protest in Puerto rico Testset Queries : 91 Fullset Queries : 1206

Dataset Construction Methodology How to check our workers are not `gaming’ the system? − Gold Judgments (honey-pot) Small set (5%) of queries Catch out bad workers early in the task `Cherry-picked` unambiguous queries Focus on news-related queries − Multiple workers per query 3 workers Majority result 6 Date Time Query Validation :00:08 What is May Day? No :43:42 protest in Puerto rico Yes

Dataset Construction Methodology How to counter temporal nature of news queries? − Workers need to know what the news stories of the time were... − But likely will not remember what the main stories during May 2006 Idea: add extra information to interface − News Headlines − News Summaries − Web Search results Prototype Interfaces − Use small testset to keep costs and turn-around time low − See which works the best 7

Interfaces : Basic What the workers need to do Clarify news- relatedness 8 Query and Date Binary labelling

Interfaces : Headline 12 news headlines from the New York Times... 9 Will the workers bother to read these?

Interfaces : HeadlineInline news headlines from the New York Times 10 Maybe headlines are not enough?

Interfaces : HeadlineSummary news headline... news summary 11 Query: Tigers of Tamil?

Interfaces : LinkSupported... Links to three major search engines Triggers a search containing the query and its date Also get some additional feedback from workers 12

Dataset Construction Methodology How do we evaluate our the quality of our labels? − Agreement between the three workers per query The more the workers agree, the more confident that we can be that our resulting majority label is correct − Compare with `expert’ (me) judgments See how many of the queries that the workers judged news-related match the ground truth 13 Date Time Query Worker Expert :31:23 abcnews Yes No :43:42 protest in Puerto rico Yes Yes

 Introduction (1--3)  Dataset Construction Methodology (4--14)  Research Questions and Setting (15--17)  Experiments and Results (18--20)  Conclusions (21) Talk Outline 14

Experimental Setup How do our interface and setting effect the quality of our labels 1. Baseline quality? How bad is it? 2. How much can the honey-pot bring? 3. How about our extra information {headlines,summaries,result rankings} Can we create a good quality dataset? 1. Agreement? 2. Vs ground truth testset fullset 15 Research Questions

Experimental Setup Crowdsourcing Marketplace − Portal − Judgments per query : 3 Costs (per interface) − Basic $1.30 − Headline $4.59 − HeadlineInline $4.59 − HeadlineSummary $5.56 − LinkSupported $ Restrictions − USA Workers Only − 70% gold judgment cutoff Measures − Compare with ground truth Precision, Recall, Accuracy over our expert ground truth − Worker agreement Free-marginal multirater Kappa (Κfree) Fleiss multirater Kappa (Κfleiss)

 Introduction (1--3)  Dataset Construction Methodology (4--14)  Research Questions and Setting (15--17)  Experiments and Results (18--20)  Conclusions (21) Talk Outline 17

Baseline and Validation 18 Basic Interface Validation is very important: 32% of judgments were rejected How is our Baseline? Precision : The % of queries labelled as news-related that agree with our ground truth Recall : The % of all news- related queries that the workers labelled correctly Accuracy : Combined Measure (assumes that the workers labelled non-news- related queries correctly) Kfree : Kappa agreement assuming that workers would label randomly Kfleiss : Kappa agreement assuming that workers will label according to class distribution As expected the baseline is fairly poor, i.e. Agreement between workers per query is low (25-50%) What is the effect of validation? 20% of those were completed VERY quickly: Bots? Watch out for bursty judging... and new users

Adding Additional Information By providing additional news-related information does label quality increase and which is the best interface? − Answer: Yes, as shown by the performance increase − The LinkSuported interface provides the highest performance 19 Basic Headline HeadlineInline HeadlineSumary LinkSupported Agreement. We can help users by providing more information More information increases performance Web results provide just as much information as headlines... but putting the information with each query causes workers to just match the text

Labelling the FullSet We now label the fullset − 1204 queries − Gold Judgments − LinkSupported Interface Are the resulting labels of sufficient quality? High recall and agreement indicate that the labels are of high quality 20 testset fullset Link-Supported Interface Recall Workers got all of the news-related queries right! Precision. Workers finding other queries to be news-related Agreement. Workers maybe learning the task? Majority of work done by 3 users

Conclusions & Best Practices Crowdsourcing is useful for building a news-query classification dataset We are confident that our dataset is reliable since agreement is high Best Practices − Online worker validation is paramount, catch out bots and lazy workers to improve agreement − Provide workers with additional information to help improve labelling quality − Workers can learn, running large single jobs may allow workers to become better at the task Questions? 21

Results Is crowdsourcing useful for post labelling quality assurance? − Creating expert validation is time consuming − We asked a second group of workers to validate the initial labels based on the links that the original workers provided 22 Do they agree with the original labels based upon the provided Web page? From the link provided, does it support the label 82%

Introduction Researchers are often in need of specialist datasets to investigate new problems − Time consuming to produce Alternative: Crowdsourcing − Cheap labor − Fast Prototyping But: − Low Quality Work − Unmotivated Workers − Susceptible to malicious work Z Z Z 23

Contributions 1. Examine the suitability of crowdsourcing for creating a news-query classification dataset 2. Propose and evaluate means to overcome the temporal nature of news queries 3. Investigate crowdsourcing for automatic quality assurance 4.Provide some best practices based on our experience 24

Results How well do workers agree on news query classification labels? 25 Percentage % `Expert’ Validation Agreement Basic Interface - no online validation Precision : The % of queries labelled as news-related that agree with our expert labels Recall : The % of all news- related queries that the workers labelled correctly Accuracy : Combined Measure (assumes that the workers labelled non-news- related queries correctly) Kfree : Kappa agreement assuming that workers would label randomly Kfleiss : Kappa agreement assuming that workers will label according to class distribution Agreement is low (25-50%)

Results How important is online validation of work? − 32 % of labels removed based on online validation − ~ 20 % of bad labels made in the first 2 minutes of job Evidence of bots attempting jobs on MTurk 26 Percentage % No Validation Online Validation Basic Interface No random labelling But still missing news queries Agreement markedly increases

Introduction No freely available dataset exists to evaluate news query classification We build a news query classification dataset using Crowdsourcing Classifier AClassifier B vs ? 27

Crowdsourcing Task Build a news query classification dataset − End user queries − Binary Classification labels Each worker will be shown a end user query Q and the time it was made t The worker must label each query as holding a news-related intent or not 28

Challenges Classifying user queries is difficult Even humans have trouble distinguishing news queries Query terms are often not enough − News terms change over time − Depends on the news stories of time the query was made Queries to be labelled from the MSN Query log from May 2006 − Workers will not remember stories from that far back −... and are not likely to look them up on their own 29

Methodology Build two datasets − testset : 1/10 th size dataset to prototype with − fullset : final news query classification dataset Test worker labelling performance on testset − With various interfaces providing additional news-related information from: − News headlines − News summaries − Web search results Use best interface to build the fullset − Evaluate quality 30

Methodology: Sampling testset and fullset queries sampled from the MSN May 2006 query log Poisson Sampling [ozmutlu et al, 2004] − representative − unbiased testset would be sparse over the whole month − sample from a single day only Query log fullsettestset Time-range01/05 > 31/05 15/05 # queries14,921, Mean queries per day 481, Mean query length

Methodology: Validation It is important to validate worker judgements − eject poorly performing workers − increase overall quality We manually judge ~5- 10% of queries in each set for use as validation − Focus on news-related queries as most queries are non-news-related fullsettestset Time-range01/05 > 31/0515/05 # queries619 Mean queries per day % of target query-set 5%10% News-Related475 Non-News- Related

Interfaces The query likely provides too little information for workers to make accurate judgements − news terms change over time − workers wont remember the stories from the time of the queries We test 5 different interfaces incorporating different types of additional news-related information − Basic : Query only − Headline : Query + 12 news headlines in instructions − Headline_Inline : Query + 12 headlines provided for each query − Headline+Summary : Query + 12 headlines and news summaries in instructions − Link-Supported : Query + links to Web search results New York Times Bing,Google,Yahoo! 33

 Introduction  Challenges  Methodology  Job Interfaces  Experimental Setting  Evaluating the Testset  Baseline Worker Agreement  Effect of Validation  Supplementing Worker Information  Evaluating the Fullset  Set Comparison  Crowdsourcing Additional Agreement Talk Outline 34

Experimental Setup : Measures Agreement Free-Marginal Multirater Kappa : K free − Assumes 50% prior probability for each class Fleiss Multirater Kappa: K fleiss − Accounts for the relative size of each class Accuracy Over a set of `expert` labels produced by the author Quality should be higher due to longer spent on task, and access to news content from the time Focus on news stories 35

Evaluating the testset : Worker Agreement How well do workers agree on news query classification labels? − Low levels of agreement with the basic interface − Lack of validation and insufficient worker information likely at fault InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset Basictestset Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedtestset 36

Evaluating the testset : Importance of Validation How does online validation effect labelling quality? − Agreement and accuracy markedly increase − Highlights the importance of validation − 32% of judgements were rejected ~ 20% of judgements were made during the first 2 minutes of the job Evidence that bots are attempting jobs on MTurk InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset Basictestset Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedtestset 37

Evaluating the testset : Importance of Validation Were any of our alternative interfaces more effective ? − Agreement and accuracy markedly increase for the Headline+Summary and Link- supported interfaces – giving workers more information helps − Lower accuracy on headline_inline, likely due to workers matching headlines against the query − Link-supported provides highest overall performance. InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset Basictestset Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedtestset 38

Evaluating the fullset : Evaluating the Full Dataset Is the resulting dataset of good quality? − Higher overall performance than over the testset − Possibly due to workers learning the task − Lower precision indicates disagreement with the expert labels – caused by news queries relating to tail events that would not be represented in our labels − Dataset is of overall of good quality InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset Basictestset Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedfullset

Evaluating the Fullset : Crowdsourcing additional agreement Manually testing queries is time consuming − Can this also be done via crowdsourcing? We asked a second group of workers to validate the initial labels based on the links that the original workers provided 40