Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.

Slides:

Advertisements

Similar presentations

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users Panos Ipeirotis, Stern School of Business, New York University Evgeniy Gabrilovich, Google.

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

501: The Employee Performance Review Process. The Pennsylvania Child Welfare Resource Center Learning Objectives Learning Objectives: Participant will.

How Crowdsourcable is Your Task? Carsten Eickhoff Arjen P. de Vries WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011), Hong Kong,

Establish Access To, Making Contact With, and Selecting Participants A Sharon A Jamie.

Fundamentals Getting Started. Go to:

Introduction to Mechanized Labor Marketplaces: Mechanical Turk Uichin Lee KAIST KSE.

Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.

Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore,

Amazon Mechanical Turk (Mturk) What is MTurk? – Crowdsourcing Internet marketplace that utilizes human intelligence to perform tasks that computers are.

Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.

PUBLISH OR PERISH Skills Building Workshop. Journal of the International AIDS Society Workshop Outline 1.Journal of the International.

APPRAISAL An evaluation into the progress and work of an employee over a period of time. Appraisal is part of the job for any person who is responsible.

Projects March 29, Project Requirements Think Aloud –At least two people OR Difficulty Factors Assessment –Ideally >25 (at least one class), but.

Crowdsourcing research data UMBC ebiquity,

CrowdFlow Integrating Machine Learning with Mechanical Turk for Speed-Cost-Quality Flexibility Alex Quinn, Ben Bederson, Tom Yeh, Jimmy Lin.

ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

+ Selection Tests. + Selection Tests Lecture content Types of tests Employers’ views Preparing to take selection tests During the tests Resources to help.

StaffCV Recruitment Management Solution A New Paradigm In Recruiting StaffCV Recruitment Management Solution A New Paradigm In Recruiting.

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Yan, Kumar & Ganesan Presented by Tim Calloway.

TELEPHONE INTERVIEWS : Telephone Interviews are very popular in modern fast work culture. Telephone interviews are often conducted by employers in the.

Logging the Search Self-Efficacy of Amazon Mechanical Turkers Henry Feild* (UMass) Rosie Jones* (Akamai) Robert Miller (MIT) Rajeev Nayak (MIT) Elizabeth.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:

Principle of Management

Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution John Le - CrowdFlower Andy Edmonds - eBay Vaughn.

Human Performance Training and Job Aids for Nuclear Materials Applications William S. Brown Brookhaven National Laboratory November 25, 2007.

Investigating Factors Influencing Crowdsourcing Tasks with High Imaginative Load Raynor Vliegendhart Martha Larson Christoph Kofler Carsten Eickhoff (speaker)

Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

©2010 John Wiley and Sons Chapter 6 Research Methods in Human-Computer Interaction Chapter 6- Diaries.

Text-Free UI for Illiterate Users Microsoft Research India.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

 Open ended questions: no options provided. ◦ Pros:  access ideas the researcher has not considered  insight into respondents’ vocabulary  insight.

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Tingxin Yan, Vikas Kumar, Deepak Ganesan Presented.

Collecting Qualitative Data

Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Hi [name]! Welcome to [camp]. I’m camp counselor Biebs. Over the years our camp as grown so much, and each summer we have hundreds of boys and girls come.

Topic 5: Preparing for the world of work. Activity 1: My skills.

40 questions in 35 minutes Calculators may not be used SCIENCE TEST.

Good Afternoon! 10/22/13 Today we are: – Learning about Interviews – types, tips for success – Practice interviewing – together and then in front of class.

How Do Humans Sketch Objects? SIGGRAPH 2012 Mathias Technische Universität Berlin ( 柏林工业 )Technische Universität Berlin James Brown UniversityBrown.

Program Evaluation Making sure instruction works..

Franklin Kramer.   Background  Experiment  Results  Conclusions Overview.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

STAFFING. Outlines of the chapter  Strategic HRP  Job analyses  purpose of job analyses  Recruitment  factors effecting recruitment  Selection process.

Stages of Test Development By Lily Novita

Conventions The Four Writing Traits. What are conventions? The mechanical correctness of a piece of writing.

Answering Questions In PE. Within Higher PE the questions you will be asked will challenge you to think for yourself. The knowledge you gain during the.

Action to manage your support An interactive workshop at Autscape 2008 by Yo 1 of 13 © Yo 2008.

Chapter 8 The Social Enterprise: From Recruiting to Problem Solving and Collaboration.

6/27/20161 Interviewing Chapter Section Objectives Identify methods of preparing for interviews, including researching and rehearsing Recognize.

1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.

SME Instrument How will my proposal be assessed? Bruce Ainsley (in my opinion, based on my experience)

The Job Interview Purpose: The employer thinks you are qualified, but … Fit? Professionalism? Motivation?

Chapter 7 Module 27 Résumé.

REPORT WRITING.

Many hands make light work…

CrowdDB : Answering queries with Crowdsourcing

Research methods AQA A Jan 2012

Evaluating Data.

Topic 5: Preparing for the world of work

Topic 5: Preparing for the world of work

Fundamentals Getting Started.

Actively Learning Ontology Matching via User Interaction

What is a WebQuest? Guided search for information

K.S. School of Business Management (MIS)

Presentation transcript:

Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011

Background & motivation Experimental design Results Conclusions & Feedback Future extensions

Technology gains not universal –Repetitive subjective tasks difficult to automate Example: HR resume screening –Large number of submissions –Recall important, but precision important too –Semantic advances help, but not the total solution

Objective – reduce a pile of 100s of resumes to a list of those deserving further consideration –Cost –Time –Correctness Good use of crowdsourcing?

Can a high-recall event, such as resume screening, be crowdsourced effectively? What role do positive and negative incentives play in accuracy of ratings? Do workers take more time to complete HITs when accuracy is being evaluated?

Set up collections of HITs (Human Intelligence Tasks) on Amazon Mechanical Turk –Initial screen for English comprehension –Screen participants for attention to detail on the job description (free text entry)

Start with 3 job positions –Each position with 16 applicants –Pay is $0.06 per HIT –Rate resume-job application fit on scale of 1 (bad match) to 5 (excellent match) –Compare to Gold Standard rating

Same 3 job positions –Same number of applicants (16) per position & base pay –Rated application fit on same scale of 1 to 5 –Compare to Gold Standard rating If same rating as GS, double money for that HIT ( 1- in-5 chance if random) If no match, still get standard pay for that HIT

Same 3 job positions –Again, same no of applicants per position & base pay –Rated application fit on same scale of 1 to 5 –Compare to Gold Standard rating No positive incentive - if same rating as our GS, get standard pay for that HIT, BUT… If more than 50% of ratings don’t match, Turkers paid only 0.03 per HIT for all incorrect answers!

Same 3 job positions –Again, same no of applicants per position & base pay –Rated application fit on same scale of 1 to 5 –Compare to Gold Standard rating If same rating as our GS, double money for that HIT If not, still get standard pay for that HIT, BUT… If more than 50% of ratings don’t match, Turkers paid only 0.03 per HIT for all incorrect answers!

Same 3 job positions –Again, same no of applicants per position & base pay –Rated fit on a binary scale (Relevant/Non-relevant) –Compare to Gold Standard rating GS rated 4 or 5 = Relevant, GS rated 1-3 = Not Relevant Same incentive models apply as in Exp 1-3 –Baseline, no incentive - Exp 5, neg incentive –Exp 4, pos incentive - Exp 6, pos/neg incentive

Pos Incentive skewed right No Incentive has largest  Neg Incentive has smallest 

Attention to Detail Checks

expert Baseline acceptreject accept81624 reject expert pos acceptreject accept14721 reject expert neg acceptreject accept reject expert pos/neg acceptreject accept14418 reject

PrecisionRecallF-score Baseline Pos Neg Pos/Neg

Incentives play a role in crowdsourcing performance –More time taken –More correct answers –Answer skewness Better for recall-oriented tasks than precision-oriented tasks

Anonymizing the dataset takes time How long can “fear of the oracle” exist? Can we get reasonably good results with few participants? Are cultural and group preferences may differ from those of HR screeners? –Can more training help offset this?

“A lot of work for the pay” “A lot of scrolling involved, which got tiring” “Task had a clear purpose” “Wished for faster feedback on [incentive matching]”

Examine pairwise preference models Expand on incentive models Limit noisy data Compare with machine learning methods Examine incentive models in GWAP