Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD

Similar presentations


Presentation on theme: "Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD"— Presentation transcript:

1 Bringing the crowdsourcing revolution to research in communication disorders
Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore, PhD, CCC-SLP 2014 ASHA Convention Orlando, Florida

2 Disclosure The individuals presenting this information are involved in recruiting individuals to complete tasks through AMT or other online platforms. This session may focus on one specific approach, with limited coverage of other alternative approaches. Portions of the research were supported by funding from IES. No other conflicts to disclose.

3 What is crowdsourcing? Traditional method: Assign a task (rating, analysis) to a small number of specially trained individuals. Crowdsourcing: Assign the same task to a large number of non-experts, typically recruited online. Taken individually, experts outperform non-specialists. In the aggregate, crowdsourcing has been successful in solving remarkably complex problems. Foldit: Non-experts playing an online game solved a problem in protein structure modeling that had eluded scientists (Khatib et al., 2011).

4 What is Amazon’s Mechanical Turk?
Amazon’s crowdsourcing platform Requesters electronically post human intelligence tasks (HITs). Members of AMT worker community sign up to complete HITs for payment. What are HITs? Simple, repetitive microtasks Things that humans do better than computers (for now)

5 Why do they call it Mechanical Turk?

6 Why do they call it Mechanical Turk?
“The man inside the machine” Requester sees only the computer interface, as if task were automated “Artificial artificial intelligence”

7 Using AMT in research In the past, used primarily for commercial purposes. Recent surge of interest in AMT as vast, inexpensive participant pool for behavioral research. Psychology (e.g. Goodman, Cryder, & Cheema, 2012; Paolacci, Chandler, & Ipeirotis, 2010) Linguistics (e.g. Sprouse, 2011; Gibson, Piantadosi, & Federenko, 2011) Communication sciences and disorders (McAllister Byun, Halpin, & Szeredi, under review) Published studies suggest crowdsourced data are broadly comparable to results collected from typical laboratory samples.

8 Benefits for use in research
Ease of access to participant pool Get away from overused college student population. Inexpensive AMT workers choose whether or not to complete a given task. Crump, McDonnell, & Gureckis (2013) found participants willing to complete a minute study for only $0.75. But important not to be an exploitative requester. Speed of data collection “revolutionary” (Crump et al., 2013). Sprouse (2011): Task that required 88 experimenter hours in the laboratory setting was replicated on AMT in two hours

9 Points to consider Workers may be less attentive than in lab-based studies (faster clicking  more $). Requesters can screen workers and decline to pay for poor performance. Less control over experimental environment (sound volume, processor speed, background noise level, etc.). Researchers recognize that there is more noise in crowdsourced data than lab-collected data. Idea is to offset this noise by collecting data from larger n of listeners (Ipeirotis et al., 2013).

10 Getting started: Basics of navigating AMT
To post a job… Create an account on Get IRB compliance (if necessary)

11 Getting started: Basics of navigating AMT
Create a task Title and description

12 Getting started: AMT basics
Create a task Compensation offered Number of assignments Time allotted per HIT Click “Advanced” to set preferences for workers: Percent of worker’s previous HITs that were accepted by requesters Worker location (IP address)

13 Getting Started: Basics of navigating AMT
To collect data… Internal HIT: Build a task or survey with AMT’s standard interface External HIT: Link to a task hosted on another website

14 Getting Started: Basics of Navigating MTurk
To verify validity/reliability of data… Reviewing HIT completions

15 Crowdsourcing for CSD research
Case study 1: Stimulus development Michelle Moore

16 Case Study 1: Stimulus Development
Investigating Prosodic Influences on Word Class- Specific Deficits in Aphasia Background: Noun-verb dissociation in expressive aphasia Lexical stress dissociation in expressive aphasia Grammatical class – lexical stress confound: 90% of disyllabic nouns have primary stress on the first syllable, compared to 67% of verbs (Howard & Smith, 2002) Study purpose: To investigate lexical stress influence on word class- specific deficits in expressive-based aphasia Experimental tasks: Single word reading, sentence completion

17 Case Study 1: Stimulus Development
Sentence completion task Adapted from Berndt et al., 2002 Auditory presentation, fill-in-the-blank at the end of a sentence with a single word Target words Four Categories (10 of each): N1, N2, V1, V2 All 40 randomly presented in each block Examples Last week, I went for a ride in a hot air _____ . (noun, N2: balloon) The volcano looks active, like it is going to _____ . (verb, V2: erupt)

18 Case Study 1: Stimulus Development
Target words Controlling: part of speech, stress placement, length, concreteness, imageability, frequency, etc. Goal: items/condition -depending on how many items we could create good sentences for -57 submitted to AMT

19 Case Study 1: Stimulus Development

20 Case Study 1: Stimulus Development
Assignments approved 168/169 submitted Less stringent ‘approval’ than research study parameters Assignments rejected 1/169 submitted Target: antique; The item is an antique of great _________. Collected over 4 days

21 Case Study 1: Stimulus Development
Benefits of using AMT Time Cost Fresh eyes, unbiased Considerations in using AMT Specifying some, not all, control parameters for stimuli AMT workers’ perceptions from Turker: Welcome to Mechanical Turk. Your HITS are a bit difficult for 5 cents since they have to be fill in the blank sentences that are not ambiguous. I hope that you will be somewhat lenient as I am not sure if I am finishing these to your standards. Some of these I had to go a little bit out of the way to make them work. Anyway, I will be leaving reviews on Turkopticon once you approve.

22 Crowdsourcing for CSD research
Case study 2: Obtaining speech ratings Tara McAllister Byun

23 Challenges of obtaining speech ratings
Large proportion of speech research, particularly on interventions for speech disorders, involves collection of blinded listeners’ ratings of speech accuracy or intelligibility. Multistep process: Identify potential raters Provide training and/or administer eligibility test Collect ratings Compare raters against each other to establish reliability Can be lengthy, frustrating, expensive.

24 Questions about AMT for speech research
IRB issues? Must consider rights of patients/participants whose speech samples will be shared for rating, as well as AMT workers acting as raters. Can't control playback volume, headphone quality, background noise Listeners are nonexpert But previous research suggests that with enough raters, crowdsourced responses will converge with experts’. This study: What is the level of agreement between crowdsourced ratings of speech and ratings obtained from more experienced listeners?

25 Protocol Stimuli: 100 /r/ words collected from 15 children with /r/ misarticulation over course of treatment Roughly half rated correct based on mode across 3 SLP listeners External HIT developed and hosted on Experigen (Becker & Levine, 2010) Training: 20 items with feedback Task: 100 WAV files in random order

26 Raters Trained listeners: AMT:
26 listeners, self-reported native speakers of American English. Recruited through listservs, social media, conference announcements. All had previous training in CSD: 21/26 reported MS or higher Entered in drawing for $25 gift card. Responses collected over 3 months. 1 listener failed to pass quality control measures; final n = 25. AMT: 203 listeners, US IP addresses, self-reported native speakers of American English. Received $0.75 for 100-word sample. Ratings were completed in 23 hours. 50 listeners discarded for failure to pass attentional catch trials. Final n = 153.

27 Results Strong correlation between % of experienced listeners, % AMT raters scoring a given item as correct (r = .98). Mode across raters in a group differed for only 7 items. Both groups have poor agreement for some items. AMT listeners slightly more lenient than experienced listeners.

28 Conclusions In a binary rating task, the mode across a large group of AMT listeners yielded the same response as the mode across a smaller group of experienced listeners for 93/100 items. Possible that untrained listeners' judgments may be more naturalistic, functional than trained listeners'. We advocate for further evaluation and awareness of crowdsourcing for speech data rating.

29 Crowdsourcing for CSD research
Case study 3: Obtaining ratings of sentence contexts for vocabulary instruction Suzanne Adlof

30 Goals To develop a web based platform that provides individualized, effective vocabulary instruction to high school students Individualized based on content of study, and pace of study Instruction includes dictionary definitions, and real-world contexts To be able to teach any word in the English language Beginning with a seed corpus of 1000 target words and >70,000 contexts Want ≥ 20 good contexts per word This research is supported by a grant from the Institute of Education Sciences: R305A (Adlof, PI).

31 Which contexts are most “nutritious” for vocabulary instruction?
Initial corpus of texts randomly retrieved in mass quantities from the Internet; the quality of retrieved contexts is highly variable. Example contexts for target word “guile”: There are some people, like Nathanael, who truly have no guile. They are very transparent and open. They accept people at face value and, since they have no guile themselves, are bewildered when they are faced with wickedness and deceit in others. But, truly guileless people are rare. They are both refreshing and frustrating at the same time. Show me the dirtpile and I will pray that the soul can take three stowaways" confuses me. What are the three stowaways? One of them could be him - like he wants to go with her, but what are the other two? Also, why does she vanish with no guile? Why would she vanish with guile? guile is the program, the -c switch instructs guile to evaluate the statement after the switch (similar to the -e switch for perl). The use-modules directive will ask guile to load the slib module in the ice-9 directory. After the use-modules statement is evaluated, it will proceed to call functions available through Slib, namely require and printf.

32 Challenges of obtaining ratings
Scale of task: 70,000 contexts is a lot! Need multiple rating of each context to ensure reliability Traditional lab setup: 50 undergrad students rate 100 contexts per day for $8.00 each 140 consecutive days & $56,000 to get 10 ratings of each context! AMT setup: AMT workers each rate 5 contexts at a time, for cents Speed of acquisition depends on many factors, but primary factor is building up a qualified worker pool

33 Step 1: Qualification Test
10 questions Pays $0.12 80% accuracy required to receive qualification for future ratings

34 Step 1: Qualification Test
10 questions Pays $0.12 80% accuracy required to receive qualification for future ratings

35 2. Building a Pool of Qualified Workers
Began posting QTs and HITs in August 2013 Also advertised on listservs to recruit a larger pool of workers interested in language, word learning 2317 AMT workers have taken the QT 947 (41%) workers qualified

36 2. Worker Retention We have posted > 11,000 HITs for 947 qualified workers (soliciting 10 ratings each for >55,000 contexts) 75% of all context ratings have come from 27 “very high productivity raters” who have rated > 1000 contexts each But only 737 (78%) qualified workers ever rated contexts after QT And only 491 (52%) ever completed > 1 context rating HIT after QT It turns out that 75% of the work has been done by 27 “high-productivity” raters (<3% of qualified raters) Most of these raters have been with us since at least April. . . But, others have had less time to rate contexts,

37 3. Reliability and Validity Checks
93 contexts each rated by expert and 10 AMT raters 176 AMT raters represented across contexts AMT average rating correlates with expert rating at r=.71, p<.001

38 3. Reliability and Validity Checks
Spot checking suggests ratings are generally valid Average AMT Rating (SD) Context for target word “collusion” 1.5 (.53) In his discussion of this issue in the context of the fallout from California's recent attempt at electricity deregulation, Dr. Rapp notes that claims of collusion must be reconciled with the specific market facts and regulatory rules that affect suppliers' bidding behavior and capacity decisions. This is not always easy. 2.0 (.67) 3.0 (.94) We provide a collusive framework with heterogeneity among firms, investment, entry, and exit. It is a symmetric-information model in which it is hard to sustain collusion when there is an active firm that is likely to exit in the near future. Numerical analysis is used to compare a collusive to a noncollusive environment. 3.7 (.48) Some poker players think that by sharing information with their friends on Party Poker, they can gain an advantage and cheat their opponents. This is known as poker collusion, two or more players will use a chatroom, instant messages or even the telephone to tell their friends what cards they have.

39 5. What we learned along the way
Importance of clear instructions Importance of “customer service,” e.g., fast payment & good communication TurkOpticon reviews Include quality control measures Inter-rater agreement Premier rater training Attention checks

40 6. Next Steps Students generate their own contexts as part of instructional program Students rate contexts as part of instructional program Machine learning for automated ratings “Authentic artificial intelligence”

41 Questions? Interested in trying AMT?


Download ppt "Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD"

Similar presentations


Ads by Google