Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Slides:

Advertisements

Similar presentations

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Incentivize Crowd Labeling under Budget Constraint

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Learning Algorithm Evaluation

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)

What is Statistical Modeling

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Evaluating Search Engine

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Part III: Inference Topic 6 Sampling and Sampling Distributions

Experimental Evaluation

Thanks to Nir Friedman, HU

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.

Writing the Research Paper

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.

Hypothesis Testing in Linear Regression Analysis

Bayesian Networks. Male brain wiring Female brain wiring.

Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.

by B. Zadrozny and C. Elkan

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.

Discussion: So Who Won. Announcements Looks like you’re turning in reviews… good! – Some of you are spending too much time on them!! Key points, what.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.

Academic Research Academic Research Dr Kishor Bhanushali M

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Chapter 12: Correlation and Linear Regression 1.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Chapter 6 - Standardized Measurement and Assessment

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Demand Forecasting Production and Operations Management Judit Uzonyi-Kecskés Ph.D. Student Department of Management and Corporate Economics Budapest University.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Cost-Sensitive Boosting algorithms: Do we really need them?

Machine Learning with Spark MLlib

Chapter 7. Classification and Prediction

An Empirical Comparison of Supervised Learning Algorithms

Modeling Annotator Accuracies for Supervised Learning

Evaluation of IR Systems

Aspect-based sentiment analysis

Analyzing Reliability and Validity in Outcomes Assessment Part 1

Research Methods in Behavior Change Programs

iSRD Spam Review Detection with Imbalanced Data Distributions

Analyzing Reliability and Validity in Outcomes Assessment

Sofia Pediaditaki and Mahesh Marina University of Edinburgh

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Presentation transcript:

Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai, Microsoft Research Cambridge

Motivation for Consensus Task Recover actual relevance of a topic-document pair based on noisy predictions from multiple labelers. Obtain a more reliable signal from the crowd and/or benefit from scale (expert quality from inexperienced assessors). Variety of proposed approaches in the literature and in competition. – Supervised: Classification models. – Semi-supervised: EM-style algorithms. – Unsupervised: majority vote.

Common Axes of Generalization Topics Documents Relevance Observed in Training Relevance Not Observed in Training Observed In Training Not Observed In Training Compute consensus for “new documents” on known topics. Compute consensus on new topics for documents with known relevance on other topics. Use rules or observed worker accuracies on other topics/documents to compute consensus on new topics and documents. Note hidden axis of observed workers.

Our Approach Supervised – Given gold truth judgments on a topic set and worker responses, learn a consensus model to generalize to new documents on same topic set. – Must be able to generalize to new workers. Want a well-founded probabilistic method – Need to handle major sources of worker error. Worker skill/accuracy. Topic difficulty. – Needs to handle correlation in labels. Correlation expected because of underlying label. Note: will use “assessor” for ground truth labeler and “worker” for noisy labelers.

Basic Model

Exchangeability Related Assumptions Given two identical sets of voting history, we assume two workers have the same response distribution. Whether or not a worker’s opinion is elicited is not informative. The ordering of responses/elicitation is not informative.

Relevance Conditional Independence Assume conditional independence of worker response given document relevance. – implies workers have comparable accuracies across tasks. Assume one topic independent prior on relevance Referred to as naïve Bayes. Probability of relevance across all topics. Probability of a random worker’s response given relevance (across all topics).

Topic and Relevance Conditional Independence Assume response conditionally independent given topic and relevance. – Implies workers have comparable accuracy within a topic, but varying across topics. Assume topic dependent prior on relevance. Referred to as nB Topic. Probability of relevance for this topic. Probability of a random worker’s response given relevance for this topic.

Worker and Relevance Conditional Independence Probability of relevance across all topics. Probability of this worker’s response given relevance (across all topics).

Evaluation Which Label – Gold: evaluate using expert assessor’s label as truth. – Consensus: evaluate using consensus of participants’ responses as truth. – Other Participant: evaluate using a particular participant’s responses as truth. Methodology – Use development validation as test to decide what method to submit. – Split development train into 80/20 train/validation by topic- docID pair (i.e. for a given topic all responses for a docID were completely in/out of the validation set.

Development Set ModelTruePosTrueNegFalsePosFalseNegAccuracyDefaultAccPrecRecallSpecificity Majority Vote %82.8%85.6%84.2%32.0% naive Bayes % 100.0%0.0% nB Topic %82.8%86.5%95.8%28.0% nB Worker %82.8%83.0%97.5%4.0% Skew and scarcity of development set, made model selection challenging. Chose nB Topic since only method that outperformed the baseline (predicting most common class).

TeamAccuracy Soft AccuracyRecallPrecisionSpecificityLog LossRMSE Accuracy Rank Soft Accuracy Rank MSRC69.3%64.0%79.0%66.2%59.6% %36 uogTr36.7%44.1%13.6%25.3%59.8% %10 LingPipe67.6%66.2%76.2%65.0%59.0% %54 GeAnn60.7%57.7%88.4%56.9%33.0% %78 UWaterlooMDS69.4%67.4%80.2%66.0%58.6% %23 uc3m69.9% 75.4%67.9%64.4% %11 BUPT-WILDCAT68.5% 78.6%65.4%58.4% %42 TUD_DMIR66.2% 76.4%63.5%56.0% %65 UTaustin60.4% 90.8%56.5%30.0% %87 qirdcsuog52.9% 82.4%51.8%23.4% %99 Results Methods that report probabilities did better on probability measures in almost all cases and almost always improve on decision theoretic threshold. Outlier’s performance in Log loss and conversion to accuracy implies poorly calibrated wrt decision threshold, but likely good overall. Our method best on probability measures and near top in general.

Conclusions Simple topic and relevance conditional assumption model produces – Best performance on probability measures on gold set. – Nearly best performance on accuracy. Topic-level effects explain the majority of variability in judgments (on this data and over set of submissions). Future: – Worker-relevance on test set – Worker-topic-relevance conditional independence model – Method performance versus best/median individual worker (sufficient data to evaluate?)

Thoughts for Future Crowdsourcing Tracks Is consensus independent of elicitation? – Can consensus be studied independent of the design for worker response collection? – Probably okay if development and test sets are collected with the same methodology. Likely collection design impact factors worth analyzing. – Number of gold standard in “training set” on topic – Number of labels per worker – Number of labels per item – Number of worker responses on observed items – Stability of topic-conditional prior of relevance

Questions?