Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.

Slides:



Advertisements
Similar presentations
Observational Research
Advertisements

Psychology Practical (Year 2) PS2001 Correlation and other topics.
Correlation, Reliability and Regression Chapter 7.
The Research Consumer Evaluates Measurement Reliability and Validity
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Measuring Inequality A practical workshop On theory and technique San Jose, Costa Rica August 4 -5, 2004.
Interpreting Kappa in Observational Research: Baserate Matters Cornelia Taylor Bruckner Vanderbilt University.
Simple Logistic Regression
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Categorical Data Tests for Homogeneity.
JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 7 Using Nonexperimental Research.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Reliability n Consistent n Dependable n Replicable n Stable.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Reliability and Validity Dr. Roy Cole Department of Geography and Planning GVSU.
Discussion Questions 3 Analyzing Qualitative Data.
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
15a.Accessing Data: Frequencies in SPSS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
15b. Accessing Data: Frequencies in SAS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.
5-3 Inference on the Means of Two Populations, Variances Unknown
Summary of Quantitative Analysis Neuman and Robson Ch. 11
LIS 570 Summarising and presenting data - Univariate analysis continued Bivariate analysis.
Overview of Meta-Analytic Data Analysis
PTP 560 Research Methods Week 11 Question on article If p
Introduction to SAS Essentials Mastering SAS for Data Analytics
Funded through the ESRC’s Researcher Development Initiative
Advanced Statistics for Researchers Meta-analysis and Systematic Review Avoiding bias in literature review and calculating effect sizes Dr. Chris Rakes.
1 Applied Statistics Using SAS and SPSS Topic: Chi-square tests By Prof Kelly Fan, Cal. State Univ., East Bay.
18b. PROC SURVEY Procedures in SAS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Chi Square 22. Parametric Statistics Everything we have done so far assumes that data are representative of a probability distribution (normal curve).
Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
AnnMaria De Mars, Ph.D. The Julia Group Santa Monica, CA Categorical data analysis: For when your data DO fit in little boxes.
 Some variables are inherently categorical, for example:  Sex  Race  Occupation  Other categorical variables are created by grouping values of a.
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
Analysis Overheads1 Analyzing Heterogeneous Distributions: Multiple Regression Analysis Analog to the ANOVA is restricted to a single categorical between.
Inference for Distributions of Categorical Variables (C26 BVD)
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Session 13: Correlation (Zar, Chapter 19). (1)Regression vs. correlation Regression: R 2 is the proportion that the model explains of the variability.
McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Using Nonexperimental Research.
Why this is useful  Failure as a statistician/ analyst often is failure to clearly communicate  Need to communicate results to non-technical decision-
Reading Report: A unified approach for assessing agreement for continuous and categorical data Yingdong Feng.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Calculating Inter-coder Reliability
Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Inter-observer variation can be measured in any situation in which two or more independent observers are evaluating the same thing Kappa is intended to.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Tutorial I: Missing Value Analysis
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
ANOVA and Multiple Comparison Tests
Getting the most out of interactive and developmental data Daniel Messinger
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc Tests for Homogeneity and Independence in a Two-Way Table Data resulting from observations.
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Practical Solutions Comparing Proportions & Analysing Categorical Data.
Measures of Agreement Dundee Epidemiology and Biostatistics Unit
Testing for moderators
The Chi-Square Distribution and Test for Independence
Natalie Robinson Centre for Evidence-based Veterinary Medicine
COM 633: Content Analysis Reliability
15.1 The Role of Statistics in the Research Process
Inference for Two Way Tables
Presentation transcript:

Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability

Aim of co-judge procedure, to discern:  Consistency within coder  Consistency between coders  Take care when making inferences based on little information,  Phenomena impossible to code become missing values Interrater reliability

 Percent agreement: Common but not recommended  Cohen’s kappa coefficient  Kappa is the proportion of the optimum improvement over chance attained by the coders, 1 = perfect agreement, 0 = agreement is no better than that expected by chance, -1 = perfect disagreement  Kappa’s over.40 are considered to be a moderate level of agreement (but no clear basis for this “guideline”)  Correlation between different raters  Intraclass correlation. Agreement among multiple raters corrected for number of raters using Spearman-Brown formula ( r ) Interrater reliability

 Percent exact agreement = Number of observations agreed on Total number of observations Interrater reliability of categorical IV (1) Categorical IV with 3 discreet scale-steps 9 ratings the same % exact agreement = 9/12 =.75

Interrater reliability of categorical IV (2) unweighted Kappa Kappa: Positive values indicate how much the raters agree over and above chance alone Negative values indicate disagreement If agreement matrix is irregular Kappa will not be calculated, or misleading

Interrater reliability of categorical IV (3) unweighted Kappa in SPSS CROSSTABS /TABLES=rater1 BY rater2 /FORMAT= AVALUE TABLES /STATISTIC=KAPPA /CELLS= COUNT /COUNT ROUND CELL.

Interrater reliability of categorical IV (4) Kappas in irregualar matrices If rater 2 is systmatically “above” rater 1 when coding an ordinal scale, Kappa will be misleading  possible to “fill up” with zeros K =.51K = -.16

Interrater reliability of categorical IV (5) Kappas in irregular matrices If there are no observations in some row or column, Kappa will not be calculated  possible to “fill up” with zeros K not possible to estimate K =.47

Interrater reliability of categorical IV (6) weighted Kappa using SAS macro PROC FREQ DATA = int.interrater1 ; TABLES rater1 * rater2 / AGREE; TEST KAPPA; RUN; Papers and macros available for estimating Kappa when unequal or misaligned rows and columns, or multiple raters:

Interrater reliability of continuous IV (1)  Average correlation r = ( ) / 3 =.873  Coders code in same direction!

Interrater reliability of continuous IV (2)

Interrater reliability of continuous IV (3)  Design 1 one-way random effects model when each study is rater by a different pair of coders  Design 2 two-way random effects model when a random pair of coders rate all studies  Design 3 two-way mixed effects model ONE pair of coders rate all studies

Comparison of methods (from Orwin, p. 153; in Cooper & Hedges, 1994) Low Kappa but good AR when little variability across items, and coders agree

Interrater reliability in meta-analysis and primary study

 Meta-analysis: coding of independent variables  How many co-judges?  How many objects to co-judge? (sub-sample of studies, versus sub-sample of codings)  Use of “Golden standard” (i.e., one “master-coder”)  Coder drift (cf. observer drift): are coders consistent over time?  Your qualitative analysis is only as good as the quality of your categorisation of qualitative data Interrater reliability in meta-analysis vs. in other contexts