Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore,

Slides:



Advertisements
Similar presentations
Method Participants 184 five-year-old (M age=5.63, SD=0.22) kindergarten students from 30 classrooms in central Illinois Teacher ratings The second edition.
Advertisements

Standardized Scales.
Training Assistive Technology for Cognition Post-ABI: Results of a randomized controlled trial 1.
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assistance Team Procedures East Iredell Middle School Sept 15, 2010.
Civics End-of-Course (EOC) Review Camps
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore,
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
1 The New Adaptive Version of the Basic English Skills Test Oral Interview Dorry M. Kenyon Funded by OVAE Contract: ED-00-CO-0130 The BEST Plus.
Graduate Program Assessment Report. University of Central Florida Mission Communication M.A. Program is dedicated to serving its students, faculty, the.
International student success – do the raw materials meet the specification? David Bell.
Validity, Sampling & Experimental Control Psych 231: Research Methods in Psychology.
Lesson Seven Reliability. Contents  Definition of reliability Definition of reliability  Indication of reliability: Reliability coefficient Reliability.
A Micro-Enterprise Development Organization. Best Practices in Business Coaching: Lessons from the Field 2 Presented by – Jay Savulich- Managing Director,
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
CSD 2230 HUMAN COMMUNICATION DISORDERS
BY: CARLI SHANKLAND SPEECH AND LANGUAGE PATHOLOGY.
Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore,
WIP – Using Information Technology to Author, Administer, and Evaluate Performance-Based Assessments Mark Urban-Lurain Guy Albertelli Gerd Kortemeyer Division.
Discussion Gitanjali Batmanabane MD PhD. Do you look like this?
@ 2012 Wadsworth, Cengage Learning Chapter 11 The Ecology of the Experiment: The Scientist and Research Participant in Relation to Their
Using formative assessment. Aims of the session This session is intended to help us to consider: the reasons for assessment; the differences between formative.
Recent public laws such as Individuals with Disabilities Education Improvement Act (IDEIA, 2004) and No Child Left Behind Act (NCLB,2002) aim to establish.
Colorado Department of Education Every Student Every Step of the Way Tanni L. Anthony, Ph.D., Project Director Colorado Services for Children and Youth.
1 Predicting Trainee Success Jason Gold, Ph.D. Center Mental Health Consultant Edison Job Corps Center Edison, New Jersey Robert-Wood Johnson Medical School.
Chapter 7: Generating Funds Part 2 Meg Giddings. 3 Types of Individual Fundraising A) Annual Giving: campaigns run each year soliciting past and new donors.
Reliability Lesson Six
Bridging Between ENGL 0307 and ENGL 1301 Presented by Suzy Page Associate Professor, English.
Marsha Lovett, Oded Meyer and Candace Thille Presented by John Rinderle More Students, New Instructors: Measuring the Effectiveness of the OLI Statistics.
Documenting Ongoing ELD Progress Using the Secondary ELD Assessment REF Amaris Rivas Coordinator’s Meeting April 10, 2008.
Preparing for the 2 nd Hourly. What is an hourly? An hourly is the same thing as an in-class test. How many problems will be on the hourly? There will.
Mindset & Grit Whittney Smith, Ed.D.. Grit & Mindset O Grit is a combination of being resilient in the face of failure and having deep commitments (focused.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Alternate Assessments: A Case Study of Students and Systems: Gerald Tindal UO.
OHIO’S ALTERNATE ASSESSMENT FOR STUDENTS WITH SIGNIFICANT COGNITIVE DISABILITIES(AASCD).. AT A GLANCE.
 Federal mandates exist from both NIH and FDA on including children in clinical research. However, when and how to include children, particularly in clinical.
Does Phonological Awareness Intervention Impact Speech Production in a 3-year-old? Kayla Knueppel, Department of Communication Sciences and Disorders Vicki.
Chapter 4 – Research Methods in Clinical Psych Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
Berkeley Policy Associates Lessons Learned About Random Assignment Evaluation Implementation in Educational Settings SREE Conference March 4, 2010 Raquel.
Unit 5: Improving and Assessing the Quality of Behavioral Measurement
Common to some 90% of organizations Acknowledged by CEOs to drive strategy Failure rates of 80%-90% Produces conflict & competition Some have advocated.
Impact of T-ACASI on Estimates of Youth Smoking Prevalence: Results of UMASS Tobacco Study Lois Biener, 1 Charles F. Turner, 2 & Amy L. Nyman 1 1 Center.
DMD Senior Design Projects CIS 497 Joseph T. Kider Jr.
David Ackerman, Associate VP Crystal Butler, Research Associate.
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
Online students’ perceived self-efficacy: Does it change? Presenter: Jenny Tseng Professor: Ming-Puu Chen Date: July 11, 2007 C. Y. Lee & E. L. Witta (2001).
IMPLEMENTING ANTI-PLAGIARISM POLICY
Chapter 6 - Standardized Measurement and Assessment
HEADQUARTERS ”A Language Needs Assessment (LNA) Study at SHAPE/NATO HQ: What Lies Behind the Standardised Language Profiles (SLPs) in the Job Descriptions?"
The Role of Close Family Relationships in Predicting Multisystemic Therapy Outcome: An Investigation of Sex Differences ABSTRACT BACKGROUND: Multisystemic.
Collections Accountability Leases that Terminate into Term Code TMDCR Tom Merry October 23, 2003 Week 4 Presentation.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
TKT COURSE SUMMARY UNIT –14 Differences between l1 and l2 learning learners characteristics LEARNER NEEDS DIANA OLIVA VALDÉS RAMÍREZ.
Depression in children and young people referred to Specialist CAMHS: An audit of screening procedures. Dr. Michelle Rydon-Grange Clinical Psychologist,
Quality Improvement Tools for Intervention Determination Presenters: Kris Hartmann, MS Healthcare Analyst, Performance Improvement Projects Don Grostic,
Talk Boost A targeted intervention for 4-7 year olds with language delay Wendy Lee Professional Director, The Communication Trust Mary Hartshorne Head.
Overview of Types of Measures Margaret Kasimatis, PhD VP for Academic Planning & Effectiveness.
Dream It, Be It MARSHA MCGRATH – REGION CHAIR MICHELLE G. STRAWSER - PRESENTING.
Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD
Observational Study Working Group
Bi-dialectalism: the investigation of the cognitive advantage and non-native dialect perception in noise Brittany Moore, Jackie Rayyan, & Lynn Gilbertson,
Test Validity.
Classroom Assessment Validity And Bias in Assessment.
How can one measure intelligence?
Interviewing witnesses
Treatment Research Institute
Presentation transcript:

Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore, PhD, CCC-SLP 2014 ASHA Convention Orlando, Florida

Disclosure The individuals presenting this information are involved in recruiting individuals to complete tasks through AMT or other online platforms. This session may focus on one specific approach, with limited coverage of other alternative approaches. Portions of the research were supported by funding from IES. No other conflicts to disclose.

Crowdsourcing for CSD research Case study 2: Obtaining speech ratings Tara McAllister Byun

Challenges of obtaining speech ratings Large proportion of speech research, particularly on interventions for speech disorders, involves collection of blinded listeners’ ratings of speech accuracy or intelligibility. Multistep process: Identify potential raters Provide training and/or administer eligibility test Collect ratings Compare raters against each other to establish reliability Can be lengthy, frustrating, expensive.

Questions about AMT for speech research IRB issues? Must consider rights of patients/participants whose speech samples will be shared for rating, as well as AMT workers acting as raters. Can't control playback volume, headphone quality, background noise Listeners are nonexpert But previous research suggests that with enough raters, crowdsourced responses will converge with experts’. This study: What is the level of agreement between crowdsourced ratings of speech and ratings obtained from more experienced listeners?

Protocol Stimuli: 100 /r/ words collected from 15 children with /r/ misarticulation over course of treatment Roughly half rated correct based on mode across 3 SLP listeners External HIT developed and hosted on Experigen (Becker & Levine, 2010) Training: 20 items with feedback Task: 100 WAV files in random order

Raters AMT: 203 listeners, US IP addresses, self-reported native speakers of American English. Received $0.75 for 100-word sample. Ratings were completed in 23 hours. 50 listeners discarded for failure to pass attentional catch trials. Final n = 153. Trained listeners: 26 listeners, self-reported native speakers of American English. Recruited through listservs, social media, conference announcements. All had previous training in CSD: 21/26 reported MS or higher Entered in drawing for $25 gift card. Responses collected over 3 months. 1 listener failed to pass quality control measures; final n = 25.

Results Strong correlation between % of experienced listeners, % AMT raters scoring a given item as correct (r =.98). Mode across raters in a group differed for only 7 items. Both groups have poor agreement for some items. AMT listeners slightly more lenient than experienced listeners.

Conclusions In a binary rating task, the mode across a large group of AMT listeners yielded the same response as the mode across a smaller group of experienced listeners for 93/100 items. Possible that untrained listeners' judgments may be more naturalistic, functional than trained listeners'. We advocate for further evaluation and awareness of crowdsourcing for speech data rating.

Questions? Interested in trying AMT?