Natalie Robinson Centre for Evidence-based Veterinary Medicine Measures of agreement Natalie Robinson Centre for Evidence-based Veterinary Medicine
Why might we measure agreement? Measures of reliability Compare 2 or more different methods E.g. SNAP FeLV test vs virus isolation
Why might we measure agreement? To look at inter-rater reliability E.g. several ‘raters’ using the same body condition scoring method on the same animals
Why might we measure agreement? To look at repeatability intra-rater reliability Test-retest reliability E.g. same ‘rater’ using the same BCS method on the same animals on 2 days in a row
Categorical/Ordinal data Binary/nominal/ordinal data Positive or negative test result Breeds of dog Grade of disease (mild, moderate, severe) Percentage agreement Cohen’s Kappa Weighted Kappa (Ordinal) Lots of variations e.g. Fleiss’ Kappa Banerjee and Capozzoli (1999) Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27, 3-23.
Percentage agreement 2 different tests performed on 100 samples Test A +ve -ve 27 2 5 66 Test B
So why don’t we just use this…? Some agreement will occur by chance Depends on the number of categories/frequency of each category For example…
Cohen’s Kappa Agreement > expected by chance? Can only compare two raters/methods at a time Values between 0 and 1 0 = agreement no better than chance 1 = perfect agreement Negative values are possible
Getting your data into SPSS If data is in ‘long form’ (one ‘case’ per row) will need to enter as frequencies instead
Getting your data into SPSS Can do this by producing an ‘n x n’ table were n is the no. of categories In SPSS, select ‘Analyze’ then ‘Descriptive Statistics’ then ‘Crosstabs’
Getting your data into SPSS Select 2 variables you want to compare This will generate an ‘n x n’ table - use to enter frequency data into a new dataset
Getting your data into SPSS
Getting your data into SPSS So you dataset should look something like this were the ‘count’ is the frequency from your ‘n x n’ table…
What results will I get? Point estimate with standard error 95% confidence intervals +/- 1.96 (SE) P value – significance but not magnitude Will generally be significant if Kappa >0 unless small sample size
What is a ‘good’ K value? Cohen’s Kappa Landis & Koch (1977) McHugh (2012) 0.00-0.20 Slight None 0.21-0.40 Fair Minimal 0.41-0.60 Moderate Weak 0.61-0.80 Substantial 0.81-0.90 Almost perfect Strong 0.91-1.00 Landis and Koch (1977) The measurement of observer agreement for categorical data. Biometrics, 33: 159-174 McHugh (2012) Interrater reliability: The Kappa Statistic. Biochem Med (Zagreb), 22: 276-282
Weighted Kappa Ordinal data Takes into account intermediate levels of agreement Clinician A Mild Moderate Severe 24 5 2 10 26 8 1 11 13 Clinician B http://graphpad.com/quickcalcs/kappa1.cfm
Continuous data Scale/numerical/discrete data e.g. Patient age Rating on a visual analogue scale
Continuous data Need ‘degrees’ of agreement Incorrect to use e.g. Pearson’s correlation Intraclass correlation Lin’s concordance correlation coefficient Bland-Altman plot Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310.
Intraclass correlation Values 0 -1 0 = no agreement 1 = perfect agreement Use same guidelines as Kappa for interpretation ICC Landis & Koch (1977) McHugh (2012) 0.00-0.20 Slight None 0.21-0.40 Fair Minimal 0.41-0.60 Moderate Weak 0.61-0.80 Substantial 0.81-0.90 Almost perfect Strong 0.91-1.00
Options in SPSS Should I select… Consistency or absolute agreement? One-way random/Two-way random/Two-way fixed model? May have slightly different terminology in different stats programs This article explains it well… http://neoacademic.com/2011/11/16/computing-intraclass-correlations-icc-as-estimates-of-interrater-reliability-in-spss
Absolute agreement or consistency? E.g. Measure 2 always 1 point higher than Measure 1 Consistency would be perfect Absolute agreement would be not
One way or two way model? E.g. raters recording no. of cells in sample Sample No Raters Sample 1 Raters A + B Sample 2 Raters B + C Sample 3 Raters A + C Sample 4 Raters B + D Sample 5 Raters A + D One way = don’t have same raters for all ratees Two way model = do have same raters for all ratees Sample No Raters Sample 1 Raters A, B + C Sample 2 Sample 3 Sample 4 Sample 5
Random or mixed model? One way model always random, two way can be random or mixed Random = a random sample of raters from a population of ‘potential raters’ E.g. two examiners marking exam papers These are a ‘sample’ of the population of all possible examiners who could mark the paper
Random or mixed model? Mixed = a whole population of raters i.e. the raters are the only possible raters anyone would be interested in Rare! Usually there will always be another potential rater
What will my output look like? Point estimate/95% confidence interval P value Single measures or average measures?
Single or average measures Single measures = reliability of one rater How accurate would a single person be making measurements on their own? Usually more appropriate: future studies will likely not use multiple raters for each measurement Average measures = reliability of different raters averaged together Will be higher than single measures Not usually justified in using this
What to report? Which program used % agreement + Kappa/ICC Point estimate (95% confidence interval) P value? ICC – type of model selected consistency/absolute agreement “Cohen’s kappa (κ) was calculated for categorical variables such as breed. Intra-class correlation coefficient (ICC) was calculated for age, in a two-way random model with measures of absolute agreement” Robinson et al. (in press) Agreement between veterinary patient data collected from different sources. The Veterinary Journal.
Exercises Calculate the Kappa for dog breed data collected from two different sources Calculate the ICC for cat age data collected from two different sources
References Landis and Koch (1977) The measurement of observer agreement for categorical data. Biometrics, 33: 159-174 McHugh (2012) Interrater reliability: The Kappa Statistic. Biochem Med (Zagreb), 22: 276-282 Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310. Banerjee and Capozzoli (1999) Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27, 3-23. Petrie and Sabin (2009) Medical Statistics at a Glance. 3rd Ed. Robinson et al. (in press) Agreement between veterinary patient data collected from different sources. The Veterinary Journal. http://www.sciencedirect.com/science/article/pii/S1090023315001653 Computing ICC in SPSS: http://neoacademic.com/2011/11/16/computing-intraclass-correlations-icc-as-estimates-of-interrater-reliability-in-spss Graphpad Kappa/Weight Kappa calculator: http://graphpad.com/quickcalcs/kappa1.cfm