Reliability & Agreement DeShon - 2006. Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.

Slides:



Advertisements
Similar presentations
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Advertisements

Correlation, Reliability and Regression Chapter 7.
Topics: Quality of Measurements
RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-
Correlation Chapter 6. Assumptions for Pearson r X and Y should be interval or ratio. X and Y should be normally distributed. Each X should be independent.
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
Chapter 5 Measurement, Reliability and Validity.
Defining, Measuring and Manipulating Variables. Operational Definition  The activities of the researcher in measuring and manipulating a variable. 
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Methods for Estimating Reliability
Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
2005 All Hands Meeting Measuring Reliability: The Intraclass Correlation Coefficient Lee Friedman, Ph.D.
RELIABILITY consistency or reproducibility of a test score (or measurement)
Lecture 5 Outline – Tues., Jan. 27 Miscellanea from Lecture 4 Case Study Chapter 2.2 –Probability model for random sampling (see also chapter 1.4.1)
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
Lecture 5 Outline: Thu, Sept 18 Announcement: No office hours on Tuesday, Sept. 23rd after class. Extra office hour: Tuesday, Sept. 23rd from 12-1 p.m.
Conny’s Office Hours will now be by APPOINTMENT ONLY. Please her at if you would like to meet with.
Practical Meta-Analysis -- D. B. Wilson
Today Concepts underlying inferential statistics
Research Methods in MIS
PTP 560 Research Methods Week 11 Question on article If p
PTP 560 Research Methods Week 3 Thomas Ruediger, PT.
Foundations of Educational Measurement
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Statistical Evaluation of Data
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
1 Review of ANOVA & Inferences About The Pearson Correlation Coefficient Heibatollah Baghi, and Mastee Badii.
All Hands Meeting 2005 The Family of Reliability Coefficients Gregory G. Brown VASDHS/UCSD.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
1 Inferences About The Pearson Correlation Coefficient.
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Reliability: Introduction. Reliability Session Definitions & Basic Concepts of Reliability Theoretical Approaches Empirical Assessments of Reliability.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Chapter 13 Understanding research results: statistical inference.
Measuring Research Variables
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Test-Retest Reliability (ICC) and Day to Day Variation.
© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Classical Test Theory Psych DeShon. Big Picture To make good decisions, you must know how much error is in the data upon which the decisions are.
Measurement Reliability
ESTIMATION.
Measures of Agreement Dundee Epidemiology and Biostatistics Unit
Classical Test Theory Margaret Wu.
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Evaluation of measuring tools: reliability
The first test of validity
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
15.1 The Role of Statistics in the Research Process
Psych 231: Research Methods in Psychology
Presentation transcript:

Reliability & Agreement DeShon

Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half reliability Cronbach's alpha – Tau equivalent Cronbach's alpha – Tau equivalent Spearman-Brown Prophesy formula Spearman-Brown Prophesy formula Longer is more reliable Longer is more reliable

Test-Retest Reliability Correlation between the same test administered at two time points Correlation between the same test administered at two time points Assumes stability of construct Assumes stability of construct Need 3 or more time points to separate error from instability (Kenny & Zarutta, 1996) Need 3 or more time points to separate error from instability (Kenny & Zarutta, 1996) Assumes no learning, practice, or fatigue effects (tabula rasa) Assumes no learning, practice, or fatigue effects (tabula rasa) Probably the most important form of reliability for psychological inference Probably the most important form of reliability for psychological inference

Interrater Reliability Could be estimated as correlation between two raters or alpha for 2 or more raters Could be estimated as correlation between two raters or alpha for 2 or more raters Typically estimated using intra-class correlation using ANOVA Typically estimated using intra-class correlation using ANOVA Shrout & Fleiss (1979); McGraw & Wong (1996) Shrout & Fleiss (1979); McGraw & Wong (1996)

Interrater Reliability

Intraclass Correlations What is a class of variables? What is a class of variables? Variables that share a metric and variance Variables that share a metric and variance Height and Weight are different classes of variables. Height and Weight are different classes of variables. There is only 1 Interclass correlation coefficient – Pearson’s r. There is only 1 Interclass correlation coefficient – Pearson’s r. When interested in the relationship between variables of a common class, use an Intraclass Correlation Coefficient. When interested in the relationship between variables of a common class, use an Intraclass Correlation Coefficient.

Intraclass Correlations An ICC estimates the reliability ratio directly An ICC estimates the reliability ratio directly Recall that... Recall that... An ICC is estimated as the ratio of variances: An ICC is estimated as the ratio of variances:

Intraclass Correlations The variance estimates used to compute this ratio are typically computed using ANOVA The variance estimates used to compute this ratio are typically computed using ANOVA Person x Rater design Person x Rater design In reliability theory, classes are persons In reliability theory, classes are persons between person variance between person variance The variance within persons due to rater differences is the error The variance within persons due to rater differences is the error

Intraclass Correlations Example...depression ratings Example...depression ratings Rater Rater4Rater2Rater1Persons

Intraclass Correlations 3 sources of variance in the design: 3 sources of variance in the design: persons, raters, & residual error persons, raters, & residual error No replications so the Rater x Ratee interaction is confounded with the error No replications so the Rater x Ratee interaction is confounded with the error ANOVA results... ANOVA results...

Intraclass Correlations Based on this rating design, Shrout & Fleiss defined three ICCs Based on this rating design, Shrout & Fleiss defined three ICCs ICC(1,k) – Random set of people, random set of raters, nested design, rater for each person is selected at random ICC(1,k) – Random set of people, random set of raters, nested design, rater for each person is selected at random ICC(2,k) – Random set of people, random set of raters, crossed design ICC(2,k) – Random set of people, random set of raters, crossed design ICC(3,k) - Random set of people, FIXED set of raters, crossed design ICC(3,k) - Random set of people, FIXED set of raters, crossed design

ICC(1,k) A set of raters provide ratings on a different sets of persons. No two raters provides ratings for the same person A set of raters provide ratings on a different sets of persons. No two raters provides ratings for the same person In this case, persons are nested within raters. In this case, persons are nested within raters. Can't separate the rater variance from the error variance Can't separate the rater variance from the error variance k refers to the number of judges that will actually be used to get the ratings in the decision making context k refers to the number of judges that will actually be used to get the ratings in the decision making context

ICC(1,k) Agreement for the average of k ratings Agreement for the average of k ratings We'll worry about estimating these “components of variance” later We'll worry about estimating these “components of variance” later

ICC(2,k) Because raters are crossed with ratees you can get a separate rater main effect. Because raters are crossed with ratees you can get a separate rater main effect. Agreement for the average ratings across a set of random raters Agreement for the average ratings across a set of random raters

ICC(3,k) Raters are “fixed” so you get to drop their variance from the denomenator Raters are “fixed” so you get to drop their variance from the denomenator Consistency/reliability of the average rating across a set of fixed raters Consistency/reliability of the average rating across a set of fixed raters

Shrout & Fleiss (1979)

ICCs in SPSS For SPSS, you must choose: (1) An ANOVA Model (2) A Type of ICC Absolute AgreementConsistencyTYPE: ICC(3,1) “ICC(CONSISTENCY)” Two way Mixed Model : Raters Fixed Patients Random ICC(2,1) “ICC(AGREEMENT)” Two way Random Effects One way Random Effects ICC(1,1) ANOVA Model

ICCs in SPSS

Select raters... Select raters...

ICCs in SPSS Choose Analysis under the statistics tab Choose Analysis under the statistics tab

ICCs in SPSS Output... Output... R E L I A B I L I T Y A N A L Y S I S R E L I A B I L I T Y A N A L Y S I S Intraclass Correlation Coefficient  Two-way Random Effect Model (Absolute Agreement Definition):  People and Measure Effect Random  Single Measure Intraclass Correlation =.2898*  95.00% C.I.: Lower =.0188 Upper =.7611  F = DF = (5,15.0) Sig. =.0001 (Test Value =.00)  Average Measure Intraclass Correlation =.6201  95.00% C.I.: Lower =.0394 Upper =.9286  F = DF = (5,15.0) Sig. =.0001 (Test Value =.00)  Reliability Coefficients  N of Cases = 6.0 N of Items = 4

Confidence intervals for ICCs For your reference... For your reference...

Standard Error of Measurement Estimate of the average distance of observed test scores from an individual's true score. Estimate of the average distance of observed test scores from an individual's true score.

Standard Error of the Difference Region of indistinguishable true scores Region of indistinguishable true scores Region of Indistinguishable Promotion Scores (429 of 517candidates) Promotion Rank # 34Promotion Rank # 463 Promotion Rank # 264

Agreement vs. Reliability Reliability/correlation is based on covariance and not the actual value of the two variables Reliability/correlation is based on covariance and not the actual value of the two variables If one rater is more lenient than another but they rank the candidates the same, then the reliability will be very high If one rater is more lenient than another but they rank the candidates the same, then the reliability will be very high Agreement requires absolute consistency. Agreement requires absolute consistency.

Agreement vs. Reliability Interrater Reliability Interrater Reliability “Degree to which the ratings of different judges are proportional when expressed as deviations from their means” (Tinsley & Weiss, 1975, p. 359) “Degree to which the ratings of different judges are proportional when expressed as deviations from their means” (Tinsley & Weiss, 1975, p. 359) Used when interest is in the relative ordering of the ratings Used when interest is in the relative ordering of the ratings Interrater Agreement Interrater Agreement “Extent to which the different judges tend to make exactly the same judgments about the rated subject” (T&W, p. 359) “Extent to which the different judges tend to make exactly the same judgments about the rated subject” (T&W, p. 359) Used when the absolute value of the ratings matters Used when the absolute value of the ratings matters

Agreement Indices Percent agreement Percent agreement What percent of the total ratings are exactly the same? What percent of the total ratings are exactly the same? Cohen's Kappa Cohen's Kappa Percent agreement corrected for the probability of chance agreement Percent agreement corrected for the probability of chance agreement r wg – agreement when rating a single stimulus (e.g., a supervisor, community, or clinician). r wg – agreement when rating a single stimulus (e.g., a supervisor, community, or clinician).

Kappa Typically used to assess interrater agreement Typically used to assess interrater agreement Designed for categorical judgments (finishing places, disease states) Designed for categorical judgments (finishing places, disease states) Corrects for chance agreements due to limited number of rating scales Corrects for chance agreements due to limited number of rating scales P A = Proportion Agreement P A = Proportion Agreement P C = expected agreement by chance P C = expected agreement by chance 0 – 1; usually a bit lower than reliability 0 – 1; usually a bit lower than reliability

Kappa Example

Expected by chance... Expected by chance...

Kappa Standards Kappa >.8 = good agreement Kappa >.8 = good agreement.67 <kappa <.8 – “tentative conclusions”.67 <kappa <.8 – “tentative conclusions” Carletta '96 Carletta '96 As with everything...it depends As with everything...it depends For more than 2 raters... For more than 2 raters... Average pairwise kappas Average pairwise kappas

Kappa Problems Affected by marginals Affected by marginals 2 examples with 90% Agreement 2 examples with 90% Agreement Ex 1: Kappa =.44 Ex 1: Kappa =.44 Ex 2: Kappa =.80 Ex 2: Kappa =.80 Highest kappa with equal amount of yes and no Highest kappa with equal amount of yes and no No Yes NoYes No Yes NoYes

Kappa Problems Departures from symmetry in the contingency tables (i.e., prevalence and bias) affect the magnitude of kappa. Departures from symmetry in the contingency tables (i.e., prevalence and bias) affect the magnitude of kappa. Unbalanced agreement reduces kappa Unbalanced agreement reduces kappa Unbalanced disagreement increases kappa. Unbalanced disagreement increases kappa.

r wg Based on Finn's (1970) index of agreement Based on Finn's (1970) index of agreement Rwj is used to assess agreement when multple raters rate a single stimulus Rwj is used to assess agreement when multple raters rate a single stimulus When there is no variation in the stimuli you can't examine the agreement of ratings over different stimuli When there is no variation in the stimuli you can't examine the agreement of ratings over different stimuli

r wg Could use the standard deviation of the ratings Could use the standard deviation of the ratings Like percent agreement...does account for chance Like percent agreement...does account for chance r wg references the observed standard deviation in ratings to the expected standard deviation if the ratings are random r wg references the observed standard deviation in ratings to the expected standard deviation if the ratings are random

r wg Compares observed variance in ratings to the variance in ratings if ratings were random Compares observed variance in ratings to the variance in ratings if ratings were random Standard assumption is a uniform distribution over the ratings scale range Standard assumption is a uniform distribution over the ratings scale range is a reasonable standard is a reasonable standard