Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Introduction to Mechanized Labor Marketplaces: Mechanical Turk Uichin Lee KAIST KSE.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
IDENTIFICATION OF CERTAIN EMOTIONS IN TEXT IDENTIFICATION OF CERTAIN EMOTIONS IN TEXT (NATURAL LANGUAGE PROCESSING) Mentor: Prof. Amitabha Mukherjee Pranjal.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Pollyanna Gonçalves (UFMG, Brazil) Matheus Araújo (UFMG, Brazil) Fabrício Benevenuto (UFMG, Brazil) Meeyoung Cha (KAIST, Korea) Comparing and Combining.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Crowdsourcing research data UMBC ebiquity,
Evaluation.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Systematization of Crowdsoucing for Data Annotation Aobo, Feb
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
1 Sentiment Polarity Identification in Financial News: A Cohesion-based Approach Author:Ann Devitt Khurshid Ahmad (School of Computer Science & Statistics,
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Analyzing Statistical Inferences How to Not Know Null.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Emotions from text: machine learning for text-based emotion prediction Cecilia Alm, Dan Roth, Richard Sproat UIUC, Illinois HLT/EMPNLP 2005.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Chapter 6: Analyzing and Interpreting Quantitative Data
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su and Katja Markert School of Computing, University of Leeds Human Language.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Brian Lukoff Stanford University October 13, 2006.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Franklin Kramer.   Background  Experiment  Results  Conclusions Overview.
Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Leadership Roles in Workgroups Vitor R. Carvalho, Wen Wu and William W. Cohen Carnegie Mellon University CEAS-2007, Aug 2 nd 2007.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Chapter 13 Understanding research results: statistical inference.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Lesson 3 Measurement and Scaling. Case: “What is performance?” brandesign.co.za.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
University of Rochester
Erasmus University Rotterdam
Quanzeng You, Jiebo Luo, Hailin Jin and Jianchao Yang
Recognizing Partial Textual Entailment
Topics in Linguistics ENG 331
iSRD Spam Review Detection with Imbalanced Data Distributions
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores Labs Jurafsky Linguistics Stanford Andrew Y. Ng CS Stanford

Agenda Introduction Task Design: Amazon Mechanical Turk (AMT) Annotation Tasks – Affective Text Analysis – Word Similarity – Recognizing Textual Entailment (RTE) – Event Annotation – Word Sense Disambiguation (WSD) Bias Correction Training with Non-Expert Annotations Conclusion

Purpose Annotation is important for NLP research. Annotation is expensive. – Time: annotator-hours – Money: financial cost Non-Expert Annotations – Quantity – Quality?

Motivation Amazon’s Mechanical Turk system – Cheap – Fast – Non-expert labelers over the Web Collect datasets from AMT instead of expert annotators.

Goal Comparing non-expert annotations with expert annotations on the same data in 5 typical NLP task. Providing a method for bias correction for non-expert labelers. Comparing machine learning classifiers trained on expert annotations vs. non-expert annotations.

Amazon Mechanical Turk (AMT) AMT is an online labor market where workers are paid small amounts of money to complete small tasks. Requesters can restrict which workers are allowed to annotate a task by requiring that all workers have a particular set of qualifications. Requesters can give a bonus to individual workers. Amazon handles all financial transactions.

Task Design Analyze the quality of non-expert annotations on five tasks. – Affective Text Analysis – Word Similarity – Recognizing Textual Entailment – Event Annotation – Word Sense Disambiguation For every task, authors collect 10 independent annotations for each unique item.

Affective Text Analysis Proposed by Strapparava & Mihalcea (2007) Judging headlines with 6 emotions and a valence value. Emotions ranged in [0, 100] – Anger, disgust, fear, joy, sadness, and surprise. Valence ranged in [-100, 100] Outcry at N Korea ‘nuclear test’ AngerDisgustFearJoySadnessSurpriseValence

Expert and Non-Expert Correlations 5 Experts (E) and 10 Non-Experts (NE)

Non-Expert Correlation for Affect Recognition Overall, 4 non-expert annotations per example to achieve the equivalent correlation as a single expert annotator non-expert annotations / USD => 875 expert-equivalent annotations / USD

Word Similarity Proposed by Rubenstein & Goodenough (1965) Numerically Judging the word similarity for 30 word pairs on a scale of [0, 10]. {boy, lad} => highly similar {noon, string} => unrelated 300 annotations completed by 10 labelers within 11 minutes.

Word Similarity Correlation

Recognizing Textual Entailment Proposed in the PASCAL Recognizing Textual Entailment task (Dagan et al., 2006) Presented with 2 sentences and given a binary choice of whether the second hypothesis sentence can be inferred from the first. Oil Prices drop. – Crude Oil Prices Slump. (True) – The Government announced last week that it plans to raise oil prices. (False) 10 annotations each for all 800 sentence pairs.

Non-Expert RTE Accuracy

Event Annotation Inspired by the TimeBank corpus (Pustejovsky et al., 2003) It just blew up in the air, and then we saw two fireballs go down to the water, and there was a big small, ah, smoke, from ah, coming up from that. Determine which event occurs first total annotations by 10 labelers.

Temporal Ordering Accuracy

Word Sense Disambiguation SemEval Word Sense Disambiguation Lexical Sample task (Pradhan et al., 2007) Present the labeler with a paragraph of text containing the word “present” and ask the labeler which one of the following three sense labels is most appropriate. – executive officer of a firm, corporation, or university. – head of a country. – head of the U.S., President of the United States 10 annotations for each of 177 examples given in SemEval.

WSD Correlation There is only single disagreement that is in fact found be an error in the original gold standard annotation. After correcting this error, the non-expert accuracy rate is 100%. Non-expert annotations can be used to correct expert annotations.

Costs for Non-Expert Annotations Time is given as the total amount of time in hours elapsed from submitting the requester to AMT until the last assignment is submitted by the last worker.

Bias Correction for Non-Expert Labelers More labelers. Amazon’s compensation mechanism. Model the reliability and biases of individual labelers and correct for them.

Bias Correction Model for Categorical Data Using small expert-labeled sample to estimate labeler response likelihood. Each labeler’s vote is weighted by her log likelihood ratio for her given response. – Labelers who are more then 50% accurate have positive votes. – Labelers whose judgments are pure noise have zero votes. – Anti-correlated labelers have negative votes.

Bias Correction Results: RTE & Event Annotation Evaluated with 20-fold cross-validation.

Training with Non-Expert Annotations Comparing a supervised affect recognition system with expert vs. non-expert annotations. bag-of-words unigram model similar to the SWAT system (Katz et al., 2007) on the SemEval Affective Text task.

Performance of Expert-Trained vs. Non-Expert-Trained Classifiers Why is a single set of non-expert annotations better than a single expert annotation?

Conclusion It is effective using AMT for a variety of NLP annotation tasks. Only a small number of non-expert annotations per item are necessary to equal the performance of an expert annotator. Significant improvement by controlling for labeler bias.

THE END

Pearson Correlation