Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine.

Similar presentations


Presentation on theme: "The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine."— Presentation transcript:

1 The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine University Abstract Mechanical Turk, an online crowdsourcing platform, has received increased attention among psychologists as a potentially reliable source of experimental data. Given the ease with which participants can be quickly and inexpensively recruited, it is worth examining whether Mechanical Turk can provide accurate data for analyses that require large samples. One such type of analysis is Item Response Theory, a psychometric paradigm that defines test items by a mathematical relationship between a respondent’s ability and the probability of item endorsement. To test whether Mechanical Turk can serve as a reliable source of data for Item Response Theory modeling, researchers administered a verbal reasoning scale to Mechanical Turk workers and compared the resulting Item Response Theory model to that of an existing normative sample. While Item Characteristic Curves did significantly differ, both models had high agreement on the fit of participants’ response patterns and on participant ability estimation. Such findings lend support to the use of Mechanical Turk for research purposes and suggest its use for quick, inexpensive Item Response Theory modeling. Results Descriptive Statistics. The average test score was 10 out of 16 (SD=3.03), and the average task completion time was 13.4 minutes (SD=10.4 minutes). Reliability was consistent with the published estimate (α=.728). Examples of ICCs for the MTurk and normative samples appear in Figure 1. Differential Item Functioning. DIF analysis found that most items behaved differently, and 12 out of 16 items exhibited a large DIF effect size. DIF analysis results appear in Table 1. Literature Mechanical Turk (MTurk) is an online marketplace where workers can complete small tasks in exchange for payment. Recently, it has seen increased use as a research tool for psychologists, since it provides access to a large, diverse workforce (from over 190 countries) at very low costs (Paolacci & Chandler, 2014). Some studies have specifically focused on using MTurk for assessment purposes, and have generally found such assessment data to be reliable as compared to traditional sampling methods (Buhrmester, 2011; Rouse, 2015). One psychometric paradigm that has not yet received attention in the Mechanical Turk literature is Item Response Theory (IRT). IRT assumes that each examinee has a latent ability level (denoted as θ and standardized to a z-scale) that affects the probability of correctly endorsing each item, a relationship modeled by the Item Characteristic Curve (ICC) for each test question (Ayala, 2008). Since IRT models require large sample sizes, verifying MTurk as a viable participant pool could facilitate data collection for test developers. This study administered a Verbal Reasoning scale to an MTurk sample and compared the resulting IRT model to the model derived from a normative sample. Researchers hypothesized that the models would not statistically differ. Method Participants: Researchers recruited a sample of 500 MTurk workers, 471 of whom passed a validity check, resulting in a final sample size of 471. Most participants (85%) spoke English as their first language, and most resided in North America (69%) or Asia (21%). Due to a coding error, gender data was only available for 208 participants, of whom 52.9% were male and 47.1% were female. Measures: The scale used was the 16-item Verbal Reasoning subscale of the International Cognitive Abilities Resource (ICAR), an open source collection of test items designed to measure intelligence. Procedures: Researchers created a Human Intelligence Task (HIT; a short task to be completed by workers) consisting of the study description, informed consent, demographic questions, and the ICAR verbal reasoning scale. Researchers made the HIT available to any worker who wished to participate, and participants were paid $.50 for completion of the questionnaire. Analysis: ICCs were constructed for each test item based on data from the MTurk and normative samples, resulting in two IRT models. Models were compared using Differential Item Functioning (to assess differences between the ICCs), person-fit analysis (to identify responders with improbable response patterns), comparison of participant score estimates, and comparison of item information values (the IRT equivalent of reliability). References Ayala, R. J. (2008). The theory and practice of Item Response Theory. New York: Guilford. Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data?. Perspectives On Psychological Science, 6(1), 3-5. doi:10.1177/1745691610393980 Condon, D. M., & Revelle, W. (2014). The international cognitive ability resource: Development and initial validation of a public-domain measure. Intelligence, 43, 52-64. doi:10.1016/j.intell.2014.01.004 DeMars, C. E. (2011). An analytic comparison of effect sizes for differential item functioning. Applied Measurement in Education, 24(3), 189-209. Magis, D., Beland, S., & Raiche, G. (2015). difR: Collection of methods to detect dichotomous differential item functioning (DIF). R package version 4.6 Rizopoulos, D. (2006). ltm: An R package for Latent Variable Modelling and Item Response Theory Analyses. Journal of Statistical Software, 17 (5), 1-25. Rouse, S. V. (2015). A reliability analysis of Mechanical Turk data. Computers In Human Behavior, 43, 304-307. doi:10.1016/j.chb.2014.11.004 Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1), i-30. Table 1 DIF Analysis using the Mantel-Haenszel procedure Item Log odds-ratiop Δ MH 1 4.98<.0001 -1.79 C 24.78<.0001 -1.49 B 3-5.52<.0001 1.99 C 49.93<.0001 -4.87 C 55.44 <.0001 -1.86 C 69.30<.0001 -3.44 C 78.56 <.0001 -3.29 C 8-0.30.7619 0.19 A 98.47 <.0001 -3.27 C 109.80<.0001 -4.25 C 118.85<.0001 -3.53 C 125.85 <.0001 -2.05 C 138.18<.0001 -2.82 C 1410.50 <.0001 -5.04 C 151.84.0659 -0.61 A 16-0.84.40110.26 A a=Negligible effect size b=Moderate effect size c=Large effect size θ score difference (standardized) Frequency Figure 2. Score differences for MTurk participants as estimated by the two models Item 1Item 2Item 3Item 4 Figure 1. ICC’s for the MTurk (red) and normative (black) IRT models Item 13 Item 14 Item 15Item 16 Latent ability level (standardized) Test information Figure 3. Test information curves for the normative model (black) and MTurk model (red). Test and Item Information: The total test information as calculated by the MTurk IRT model and normative IRT model were, respectively, 18.31 and 19.08, and when plotted, the test information functions peaked at approximately the same latent trait values, roughly θ =-0.5 (Figure 3). The correlation between the information values for each item under the two models was.825. Conclusions Although findings were mixed, results indicate overall consistency between the models. While the ICC’s of 13 items differed significantly, subsequent analyses revealed useful similarities between the two IRT models: The high agreement in person fit analysis suggests similarity between the IRT models themselves. The MTurk-based IRT model was relatively accurate in its score estimates of responders; differences between the models’ score estimates were small, and their large correlation indicates high rank-order stability. Both models had moderately high agreement regarding item and test information, suggesting the MTurk model could somewhat accurately identify effective and ineffective items. The MTurk model also accurately identified the ability level at which the scale was most precise. This is the first study to investigate IRT modeling with MTurk data. Future studies should continue to establish the reliability of Mturk data for IRT analyses, as well as investigate the use of MTurk for other classes of IRT models. Still, this study suggests that MTurk may be a reliable source of data for even complex psychometric analyses, and a potentially viable way for researchers with limited resources to conduct IRT analysis. Both IRT models classified the MTurk participants as aberrant (improbable response patterns) or non-aberrant, with 98.3% agreement between the models. Ability scores for the MTurk participants were estimated based on both the MTurk-based and normative- based IRT models; the differences between these two estimates for each participant appear in Figure 2. The correlation between participants’ scores calculated under each model was.992. Person-Fit Analysis and Score Comparison


Download ppt "The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine."

Similar presentations


Ads by Google