Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

Slides:



Advertisements
Similar presentations
Observational Research
Advertisements

Maintaining data quality: fundamental steps
Conceptualization, Operationalization, and Measurement
You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.
EViews Student Version. Today’s Workshop Basic grasp of how EViews manages data Creating Workfiles Importing data Running regressions Performing basic.
Issues in factorial design
The Research Consumer Evaluates Measurement Reliability and Validity
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
Chapter 4 – Reliability Observed Scores and True Scores Error
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Interpreting Kappa in Observational Research: Baserate Matters Cornelia Taylor Bruckner Vanderbilt University.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 7 Using Nonexperimental Research.
Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Concept of Measurement
Discussion Questions 3 Analyzing Qualitative Data.
Chi Square: A Nonparametric Test PSYC 230 June 3rd, 2004 Shaun Cook, ABD University of Arizona.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Assumption of Homoscedasticity
Quantifying Data.
Measures of Central Tendency
Leedy and Ormrod Ch. 11 Gray Ch. 14
But What Does It All Mean? Key Concepts for Getting the Most Out of Your Assessments Emily Moiduddin.
Reliability, Validity, & Scaling
Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll.
Difference Two Groups 1. Content Experimental Research Methods: Prospective Randomization, Manipulation Control Research designs Validity Construct Internal.
PY550 Research and Statistics Dr. Mary Alberici Central Methodist University.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
1 Describing distributions with numbers William P. Wattles Psychology 302.
Observation & Analysis. Observation Field Research In the fields of social science, psychology and medicine, amongst others, observational study is an.
Study of the day Misattribution of arousal (Dutton & Aron, 1974)
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
Multilevel Linear Models Field, Chapter 19. Why use multilevel models? Meeting the assumptions of the linear model – Homogeneity of regression coefficients.
Downloading and Installing Autodesk Revit 2016
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
INTRODUCTION TO ANALYSIS OF VARIANCE (ANOVA). COURSE CONTENT WHAT IS ANOVA DIFFERENT TYPES OF ANOVA ANOVA THEORY WORKED EXAMPLE IN EXCEL –GENERATING THE.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
POL 242 Introduction to Research Methods Assignment Five Tutorial Indexes July 12, 2011 Anthony Sealey
Section 10.1 Confidence Intervals
SW318 Social Work Statistics Slide 1 Frequency: Nominal Variable Practice Problem This question asks the frequency of widowed respondents of the survey.
VALUE/Multi-State Collaborative (MSC) to Advance Learning Outcomes Assessment Pilot Year Study Findings and Summary These slides summarize results from.
Measures of variability: understanding the complexity of natural phenomena.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Using Nonexperimental Research.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,
Calculating Inter-coder Reliability
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
PSY6010: Statistics, Psychometrics and Research Design Professor Leora Lawton Spring 2007 Wednesdays 7-10 PM Room 204.
Basics of Biostatistics for Health Research Session 4 – February 28, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health Sciences.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Statistical analysis.
Statistical analysis.
Coding Manual and Process
Bi-variate #1 Cross-Tabulation
Introduction to Measurement
Machine Learning in Practice Lecture 7
Week 3 Lecture Notes PSYC2021: Winter 2019.
15.1 The Role of Statistics in the Research Process
Making Use of Associations Tests
REVIEW I Reliability scraps Index of Reliability
Chapter 13 Excel Extension: Now You Try!
Presentation transcript:

Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston

Accessing Workshop Materials Go to: isites.harvard.edu/research_technologies – Click on Workshops tab (on the left) and then the Inter-Rater Reliability folder (near the bottom) – Save all of the files to the desktop (right click and ‘Save Link As’)

Agenda I.Introducing IRR II.What is Kappa? I.‘By-hand’ Example III.Limitations and Complications of Kappa IV.Working Through a Complex Example I.Data setup II.Estimation III.Interpretation V.Reporting Results

What is Inter-Rater Reliability? IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters. Why does it matter in educational research?: In IRR we trust  The quality of a coding scheme and the ability to replicate results is connected with the overall ‘believability’ of the results. To publish results, we must demonstrate that our coding scheme is reliable. Why does it matter in educational research?: In IRR we trust  The quality of a coding scheme and the ability to replicate results is connected with the overall ‘believability’ of the results. To publish results, we must demonstrate that our coding scheme is reliable.

What is Inter-Rater Reliability? IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters. Why does it matter in language & literacy research?  Language data can be challenging to code and can fall prey to subjectivity given that interlocutors will not always say or write all that may be inferred by scorers. Why does it matter in language & literacy research?  Language data can be challenging to code and can fall prey to subjectivity given that interlocutors will not always say or write all that may be inferred by scorers.

What is Inter-Rater Reliability? IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters. Why does it matter in language & literacy research?  Language data can be challenging to code and can fall prey to subjectivity given that interlocutors will not always say or write all that may be inferred by scorers. Why does it matter in language & literacy research?  Language data can be challenging to code and can fall prey to subjectivity given that interlocutors will not always say or write all that may be inferred by scorers. Our Task: To design coding schemes that are not subjective.

IRR: A Beginning and an End How does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?  If your coding scheme is being developed, calculating IRR can tell you if your codes are functioning in the same way across raters.  If you are using an existing coding scheme, calculating IRR can tell you if your raters may need additional training.

IRR: A Beginning How does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?  If your coding scheme is being developed, calculating IRR on 15%-20% of your data can tell you if your codes are functioning in the same way across raters. If there are numerous disagreements, this signals a need to revise your coding scheme (and recode the data), to locate clear examples to help raters to understand codes better, or (when all else fails) to abandon codes that do not function well. If there are numerous disagreements, this signals a need to revise your coding scheme (and recode the data), to locate clear examples to help raters to understand codes better, or (when all else fails) to abandon codes that do not function well. Disagreements are OK at this point!

IRR: A Beginning How does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?  If are using an existing coding scheme, calculating IRR on 15%-20% of your data can tell you if your codes are functioning in the same way across raters. If there are numerous disagreements, this signals a need to retrain your raters (and recode the data) or to evaluate if the coding scheme you are working with may need to be revised for your data. Disagreements are OK at this point!

IRR (Agreement) by Hand! While kappa statistics give us insight into user disagreements on the aggregate, a ‘confusion matrix’ can help us to identify specific codes on which raters disagree. (Bakeman & Gottman, 1991)

Let’s try it Scenario: Marie and Janet are coding students’ definitions for the presence of nominalized words. For each definition, 0=no nominalizations and 1=nominalizations. They are at the beginning of scoring the data with a new coding scheme that Janet has developed. How does it seem to be working?

Are Janet & Marie still friends?

IRR: The Middle? How does IRR tell us during the middle of a study?  In the middle of a study, especially if we are coding data over a long period, we may again conduct IRR analysis (a ‘reliability check’) to be sure we are still coding the data reliably.  involves selecting 20% of the data at random to assess for agreement.

IRR: An End Why does summative IRR matter during the analysis phase of a study?  At the end of a study, we calculate IRR to demonstrate to the academic community that our coding scheme functioned reliably.  Generally, if we have been diligent in developing our coding scheme and training our raters there are few surprises.

What is Kappa? Cohen’s Kappa (Cohen, 1960) is a numeric summation of agreement that accounts for agreements simply by chance. p o = Proportion of agreement that is actually observed p c = Proportion of agreement by chance See pages in the Bakeman & Gottman (1991) for an excellent example of how p o, p c, and Cohen’s Kappa are calculated.

Limitations of Kappa What if there are more than two possible ratings and the size of the discrepancy between raters matters?  Use weighted Kappa What if there are more than two raters?  Use Fleiss’ Kappa What if different participants have different numbers of raters?  Use Krippendorff’s alpha

A Detailed Example The sample: 37 students The data structure: 8 key variables 2 different work explanation tasks “bicycle” and “debate” 2 coders coder1 and coder2 2 rating subscales Super ordinate scale (0-5 points) Syntax scale (0-6 points) Which statistic should we be using? Weighted Kappa! (p. 66 in B & G)

Estimating IRR in Stata Start with a simple cross tabulation: coder1_bic | ycle_super | coder2_bicyclel_superordinate ordinate | | Total | | 14 2 | | 5 3 | | 3 4 | | 5 5 | | Total | | 37 What do you notice? Very strong agreement! Why might this be? Neither reviewer ranked anybody a “1”. Why might this be? This has implications for how we do this in Stata…

Estimating IRR in Stata Because of the nature of the ratings, we have to make some changes to the data in order for things to run. Our weight matrix implies there are 6 possible ratings but only 5 are used by the raters  we must use the “ absolute ” option in Stata BUT… in order for this to work we need to change the scale from 0,1,2…5 to 1,2,3…6. replace c1_bicycle_so = c1_bicycle_so+1 replace c2_bicycle_so = c2_bicycle_so+1 tab c1_bicycle_so c2_bicycle_so

Estimating IRR in Stata The command: kap c1_bicycle_so c2_bicycle_so, wgt(s_o_wgt) absolute The output: Expected Agreement Agreement Kappa Std. Err. Z Prob>Z % 53.81%

Apply Your Knowledge For each of the three remaining subscales repeat the steps: Inspect a cross tabulation to get an idea of the data distributions Create a weighting matrix “w” or “w2” are built-in weight matrices that you can use (w is linear and w2 is quadratic) Estimate the Kappa and interpret the results!

Estimating IRR with web resources If you do not have access to Stata, there are some great web resources you can use as well:

Reporting Results “inter-rater reliability, calculated based on double coding of 20% of the tasks, was very high (Agreement = 98%; Cohen’s Kappa =.96).” (Kieffer & Lesaux, 2010) “To calculate inter-rater reliability for the coding scheme developed in the first phase, a research coordinator and a graduate research assistant randomly selected 20 writing samples for each of four tasks from different years, cohorts, and writing ability levels: narratives by pen, the sentence integrity task by pen, essays by keyboard, and the sentence integrity task by keyboard. One rater served as the anchor for computing percent agreement between coders. The inter-rater reliability was generally very good for each coded category. Except for two categories, initial percent agreement ranged from 84.6% to 100%. For the only two categories of low interrater reliability, subordinate and adverbial clauses, additional training and reliability checks improved inter-rater reliability for these to acceptable levels of over 0.80.” (Berninger, Nagy, & Scott, 2011)