Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.

Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0

Introduction What is news query classification and why would we build a dataset to examine it? − Binary classification task performed by Web search engines − Up to 10% of queries may be news-related [Bar-Ilan et al, 2009] Have workers judge Web search queries as news-related or not gunman Web Search Engine News-Related Non-News-Related News Results Web Search Results User 1

Introduction But: News-relatedness is subjective Workers can easily `game` the task News Queries change over time for query in task return Random(Yes,No) end loop 2 Query: Octopus? Sea Creature? Work Cup Predications? News-related? How can we overcome these difficulties to create a high quality dataset for news query classification?

 Introduction (1--3)  Dataset Construction Methodology (4--14)  Research Questions and Setting (15--17)  Experiments and Results (18--20)  Conclusions (21) Talk Outline 3

Dataset Construction Methodology How can we go about building a news query classification dataset? 1. Sample Queries from the MSN May 2006 Query Log 2. Create Gold judgments to validate the workers 3. Propose additional content to tackle the temporal nature of news queries and prototype interfaces to evaluate this content on a small test set 4. Create the final labels using the best setting and interface 5. Evaluate in terms of agreement 6. Evaluate against `experts` 4

Dataset Construction Methodology Sampling Queries: − Create 2 query sets sampled from the MSN May 2006 Query-log Poisson Sampling − One for testing (testset) Fast crowdsourcing turn around time Very low cost − One for the final dataset (fullset) 10x queries Only labelled once 5 Date Time Query 2006-05-01 00:00:08 What is May Day? 2006-05-08 14:43:42 protest in Puerto rico Testset Queries : 91 Fullset Queries : 1206

Dataset Construction Methodology How to check our workers are not `gaming’ the system? − Gold Judgments (honey-pot) Small set (5%) of queries Catch out bad workers early in the task `Cherry-picked` unambiguous queries Focus on news-related queries − Multiple workers per query 3 workers Majority result 6 Date Time Query Validation 2006-05-01 00:00:08 What is May Day? No 2006-05-08 14:43:42 protest in Puerto rico Yes

Dataset Construction Methodology How to counter temporal nature of news queries? − Workers need to know what the news stories of the time were... − But likely will not remember what the main stories during May 2006 Idea: add extra information to interface − News Headlines − News Summaries − Web Search results Prototype Interfaces − Use small testset to keep costs and turn-around time low − See which works the best 7

Interfaces : Basic What the workers need to do Clarify news- relatedness 8 Query and Date Binary labelling

Interfaces : Headline 12 news headlines from the New York Times... 9 Will the workers bother to read these?

Interfaces : HeadlineInline... 12 news headlines from the New York Times 10 Maybe headlines are not enough?

Interfaces : HeadlineSummary news headline... news summary 11 Query: Tigers of Tamil?

Interfaces : LinkSupported... Links to three major search engines Triggers a search containing the query and its date Also get some additional feedback from workers 12

Dataset Construction Methodology How do we evaluate our the quality of our labels? − Agreement between the three workers per query The more the workers agree, the more confident that we can be that our resulting majority label is correct − Compare with `expert’ (me) judgments See how many of the queries that the workers judged news-related match the ground truth 13 Date Time Query Worker Expert 2006-05-05 07:31:23 abcnews Yes No 2006-05-08 14:43:42 protest in Puerto rico Yes Yes

Experimental Setup How do our interface and setting effect the quality of our labels 1. Baseline quality? How bad is it? 2. How much can the honey-pot bring? 3. How about our extra information {headlines,summaries,result rankings} Can we create a good quality dataset? 1. Agreement? 2. Vs ground truth testset fullset 15 Research Questions

Experimental Setup Crowdsourcing Marketplace − Portal − Judgments per query : 3 Costs (per interface) − Basic $1.30 − Headline $4.59 − HeadlineInline $4.59 − HeadlineSummary $5.56 − LinkSupported $8.78 16 Restrictions − USA Workers Only − 70% gold judgment cutoff Measures − Compare with ground truth Precision, Recall, Accuracy over our expert ground truth − Worker agreement Free-marginal multirater Kappa (Κfree) Fleiss multirater Kappa (Κfleiss)

Baseline and Validation 18 Basic Interface Validation is very important: 32% of judgments were rejected How is our Baseline? Precision : The % of queries labelled as news-related that agree with our ground truth Recall : The % of all news- related queries that the workers labelled correctly Accuracy : Combined Measure (assumes that the workers labelled non-news- related queries correctly) Kfree : Kappa agreement assuming that workers would label randomly Kfleiss : Kappa agreement assuming that workers will label according to class distribution As expected the baseline is fairly poor, i.e. Agreement between workers per query is low (25-50%) What is the effect of validation? 20% of those were completed VERY quickly: Bots? Watch out for bursty judging... and new users

Adding Additional Information By providing additional news-related information does label quality increase and which is the best interface? − Answer: Yes, as shown by the performance increase − The LinkSuported interface provides the highest performance 19 Basic Headline HeadlineInline HeadlineSumary LinkSupported Agreement. We can help users by providing more information More information increases performance Web results provide just as much information as headlines... but putting the information with each query causes workers to just match the text

Labelling the FullSet We now label the fullset − 1204 queries − Gold Judgments − LinkSupported Interface Are the resulting labels of sufficient quality? High recall and agreement indicate that the labels are of high quality 20 testset fullset Link-Supported Interface Recall Workers got all of the news-related queries right! Precision. Workers finding other queries to be news-related Agreement. Workers maybe learning the task? Majority of work done by 3 users

Conclusions & Best Practices Crowdsourcing is useful for building a news-query classification dataset We are confident that our dataset is reliable since agreement is high Best Practices − Online worker validation is paramount, catch out bots and lazy workers to improve agreement − Provide workers with additional information to help improve labelling quality − Workers can learn, running large single jobs may allow workers to become better at the task Questions? 21

Results Is crowdsourcing useful for post labelling quality assurance? − Creating expert validation is time consuming − We asked a second group of workers to validate the initial labels based on the links that the original workers provided 22 Do they agree with the original labels based upon the provided Web page? From the link provided, does it support the label 82%

Introduction Researchers are often in need of specialist datasets to investigate new problems − Time consuming to produce Alternative: Crowdsourcing − Cheap labor − Fast Prototyping But: − Low Quality Work − Unmotivated Workers − Susceptible to malicious work Z Z Z 23

Contributions 1. Examine the suitability of crowdsourcing for creating a news-query classification dataset 2. Propose and evaluate means to overcome the temporal nature of news queries 3. Investigate crowdsourcing for automatic quality assurance 4.Provide some best practices based on our experience 24

Results How well do workers agree on news query classification labels? 25 Percentage % `Expert’ Validation Agreement Basic Interface - no online validation Precision : The % of queries labelled as news-related that agree with our expert labels Recall : The % of all news- related queries that the workers labelled correctly Accuracy : Combined Measure (assumes that the workers labelled non-news- related queries correctly) Kfree : Kappa agreement assuming that workers would label randomly Kfleiss : Kappa agreement assuming that workers will label according to class distribution Agreement is low (25-50%)

Results How important is online validation of work? − 32 % of labels removed based on online validation − ~ 20 % of bad labels made in the first 2 minutes of job Evidence of bots attempting jobs on MTurk 26 Percentage % No Validation Online Validation Basic Interface No random labelling But still missing news queries Agreement markedly increases

Introduction No freely available dataset exists to evaluate news query classification We build a news query classification dataset using Crowdsourcing Classifier AClassifier B vs ? 27

Crowdsourcing Task Build a news query classification dataset − End user queries − Binary Classification labels Each worker will be shown a end user query Q and the time it was made t The worker must label each query as holding a news-related intent or not 28

Challenges Classifying user queries is difficult Even humans have trouble distinguishing news queries Query terms are often not enough − News terms change over time − Depends on the news stories of time the query was made Queries to be labelled from the MSN Query log from May 2006 − Workers will not remember stories from that far back −... and are not likely to look them up on their own 29

Methodology Build two datasets − testset : 1/10 th size dataset to prototype with − fullset : final news query classification dataset Test worker labelling performance on testset − With various interfaces providing additional news-related information from: − News headlines − News summaries − Web search results Use best interface to build the fullset − Evaluate quality 30

Methodology: Sampling testset and fullset queries sampled from the MSN May 2006 query log Poisson Sampling [ozmutlu et al, 2004] − representative − unbiased testset would be sparse over the whole month − sample from a single day only Query log fullsettestset Time-range01/05 > 31/05 15/05 # queries14,921,286120691 Mean queries per day 481,33138.991 Mean query length 2.292.392.49 31

Methodology: Validation It is important to validate worker judgements − eject poorly performing workers − increase overall quality We manually judge ~5- 10% of queries in each set for use as validation − Focus on news-related queries as most queries are non-news-related fullsettestset Time-range01/05 > 31/0515/05 # queries619 Mean queries per day 1.979 % of target query-set 5%10% News-Related475 Non-News- Related 144 32

Interfaces The query likely provides too little information for workers to make accurate judgements − news terms change over time − workers wont remember the stories from the time of the queries We test 5 different interfaces incorporating different types of additional news-related information − Basic : Query only − Headline : Query + 12 news headlines in instructions − Headline_Inline : Query + 12 headlines provided for each query − Headline+Summary : Query + 12 headlines and news summaries in instructions − Link-Supported : Query + links to Web search results New York Times Bing,Google,Yahoo! 33

 Introduction  Challenges  Methodology  Job Interfaces  Experimental Setting  Evaluating the Testset  Baseline Worker Agreement  Effect of Validation  Supplementing Worker Information  Evaluating the Fullset  Set Comparison  Crowdsourcing Additional Agreement Talk Outline 34

Experimental Setup : Measures Agreement Free-Marginal Multirater Kappa : K free − Assumes 50% prior probability for each class Fleiss Multirater Kappa: K fleiss − Accounts for the relative size of each class Accuracy Over a set of `expert` labels produced by the author Quality should be higher due to longer spent on task, and access to news content from the time Focus on news stories 35

Evaluating the testset : Worker Agreement How well do workers agree on news query classification labels? − Low levels of agreement with the basic interface − Lack of validation and insufficient worker information likely at fault InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset0.50.57140.92630.58330.2647 Basictestset Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedtestset 36

Evaluating the testset : Importance of Validation How does online validation effect labelling quality? − Agreement and accuracy markedly increase − Highlights the importance of validation − 32% of judgements were rejected ~ 20% of judgements were made during the first 2 minutes of the job Evidence that bots are attempting jobs on MTurk InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset0.50.57140.92630.58330.2647 Basictestset1.00.57140.96810.73730.3525 Headlinetestset Headline_Inlinetestset Headline+Summarytestset Link-Supportedtestset Link-Supportedtestset 37

Evaluating the testset : Importance of Validation Were any of our alternative interfaces more effective ? − Agreement and accuracy markedly increase for the Headline+Summary and Link- supported interfaces – giving workers more information helps − Lower accuracy on headline_inline, likely due to workers matching headlines against the query − Link-supported provides highest overall performance. InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset0.50.57140.92630.58330.2647 Basictestset1.00.57140.96810.73730.3525 Headlinetestset0.50.57140.92630.80.5148 Headline_Inlinetestset1.00.28570.94740.78660.3018 Headline+Summarytestset1.00.57140.96840.830.5327 Link-Supportedtestset1.00.57140.96840.83670.5341 Link-Supportedtestset 38

Evaluating the fullset : Evaluating the Full Dataset Is the resulting dataset of good quality? − Higher overall performance than over the testset − Possibly due to workers learning the task − Lower precision indicates disagreement with the expert labels – caused by news queries relating to tail events that would not be represented in our labels − Dataset is of overall of good quality InterfaceQuery SetValidationPrecisionRecallAccuracyKfreeKfleiss Basictestset0.50.57140.92630.58330.2647 Basictestset1.00.57140.96810.73730.3525 Headlinetestset0.50.57140.92630.80.5148 Headline_Inlinetestset1.00.28570.94740.78660.3018 Headline+Summarytestset1.00.57140.96840.830.5327 Link-Supportedtestset1.00.57140.96840.83670.5341 Link-Supportedfullset0.67611.00.97480.83580.7677 39

Evaluating the Fullset : Crowdsourcing additional agreement Manually testing queries is time consuming − Can this also be done via crowdsourcing? We asked a second group of workers to validate the initial labels based on the links that the original workers provided 40

Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.

Similar presentations

Presentation on theme: "Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.

Similar presentations

Presentation on theme: "Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0."— Presentation transcript:

Similar presentations

About project

Feedback