Measurement Reliability

Slides:



Advertisements
Similar presentations
Questionnaire Development
Advertisements

FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.
Chapter 8 Flashcards.
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
MEASUREMENT: RELIABILITY Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
Consistency in testing
Topics: Quality of Measurements
The Research Consumer Evaluates Measurement Reliability and Validity
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 4 – Reliability Observed Scores and True Scores Error
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Methods for Estimating Reliability
Effect Size and Meta-Analysis
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Reliability Analysis. Overview of Reliability What is Reliability? Ways to Measure Reliability Interpreting Test-Retest and Parallel Forms Measuring and.
Concept of Measurement
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
A quick introduction to the analysis of questionnaire data John Richardson.
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
LECTURE 16 STRUCTURAL EQUATION MODELING.
Research Methods in MIS
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Psychometrics Timothy A. Steenbergh and Christopher J. Devers Indiana Wesleyan University.
LECTURE 6 RELIABILITY. Reliability is a proportion of variance measure (squared variable) Defined as the proportion of observed score (x) variance due.
Reliability, Validity, & Scaling
MEASUREMENT MODELS. BASIC EQUATION x =  + e x = observed score  = true (latent) score: represents the score that would be obtained over many independent.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Independent vs Dependent Variables PRESUMED CAUSE REFERRED TO AS INDEPENDENT VARIABLE (SMOKING). PRESUMED EFFECT IS DEPENDENT VARIABLE (LUNG CANCER). SEEK.
All Hands Meeting 2005 The Family of Reliability Coefficients Gregory G. Brown VASDHS/UCSD.
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Reliability Analysis.
Ch. 5 Measurement Concepts.
Measures of Agreement Dundee Epidemiology and Biostatistics Unit
Reliability and Validity
Assessment Theory and Models Part II
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
assessing scale reliability
Classical Test Theory Margaret Wu.
Introduction to Measurement
Final Session of Summer: Psychometric Evaluation
Class 4 Experimental Studies: Validity Issues Reliability of Instruments Chapters 7 Spring 2017.
5. Reliability and Validity
Calculating Reliability of Quantitative Measures
Class 4 Experimental Studies: Validity Issues Reliability of Instruments Chapters 7 Spring 2017.
Reliability, validity, and scaling
PSY 614 Instructor: Emily Bullock, Ph.D.
Instrumentation: Reliability Measuring Caring in Nursing
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Evaluation of measuring tools: reliability
RESEARCH METHODS Lecture 18
By ____________________
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Reliability Analysis.
Procedures for Estimating Reliability
Presentation transcript:

Measurement Reliability Qian-Li Xue Biostatistics Program Harvard Catalyst | The Harvard Clinical & Translational Science Center Short course, October 27, 2016

Objectives Classical Test Theory Definitions of Reliability Types of Reliability Coefficients Test-Retest, Inter-Rater, Internal Consistency, Correction for Attenuation Review Exercises

What is reliability Consistency of measurement The extent to which a measurement instrument can differentiate among subjects Reliability is relative How reproducible th results of a scale are under different conditions. Our comfort resides in the knowledge that the error of measurement is a relatively small fraction of the range in the observations.

Facets of Reliability Mrs. Z scores 20 at visit 1 and 25 at visit 2. Could be: Random variation (Test-Retest) Tech # 2 more lenient than Tech # 1 (Inter-Rater Reliability) Version # 2 easier than Version # 1 (Related to Internal Consistency) Mrs. Z’s picture-naming actually improved

Classical Test Theory X = Tx + e The Observed Score = True Score + Error Assumptions: E(e) = 0 Cov(Tx,e) = 0 Cov(ei,ek) = 0 Var(X) =Var(Tx+e) = Var(Tx) + 2Cov(Tx,e)+Var(e) Var(X) = Var(Tx) + Var(e)

Reliability as Consistency of Measurement The relationship between parallel tests Ratio of True score variance to total score variance ρxx = Var(Tx) Var(X) = Var(X)-Var(e)

Parallel Tests Parallel: Tau-Equivalent: Essentially Tau-Equivalent: Congeneric: See Graham (2006) for details.

Correlation, r Correlation (i.e. “Pearson” correlation) is a scaled version of covariance -1 ≤ r ≤ 1 r = 1 perfect positive correlation r = -1 perfect negative correlation r = 0 uncorrelated 8

Correlation between Parallel Tests equal to reliability of each test

DIADS Example Depression in Alzheimers Disease Study. Placebo-controlled double-blind controlled trial of sertraline One outcome was the Boston Naming Test. Consists of 60 pictures to be named, two versions.

Measures for Reliability   Continuous Categorical Test-retest r or ICC Kappa or ICC Inter-rater Internal Consistency Alpha or Split-half or ICC KR-20 or ICC (dichotomous)

Kappa Coefficient (Cohen, 1960) Test-Retest or Inter-rater reliability for categorical (typically dichotomous) data. Accounts for chance agreement

Kappa Coefficient kappa = Po - Pe Po = observed proportion of agreements 1.0 - Pe Pe = expected proportion of agreements kappa = [(20+55)/100]-[(10.5+45.5)/100] = 0.43 1-[(10.5+45.5)/100]

Kappa in STATA

Kappa Interpretation Interpretation: Kappa Value Interpretation Below 0.00 Poor 0.00-0.20 Slight 0.21-0.40 Fair 0.41-0.60 Moderate 0.61-0.80 Substantial 0.81-1.00 Almost perfect (source: Landis, J. R. and Koch, G. G. 1977. Biometrics 33: 159-174) kappa could be high simply because marginal proportions are either very high or very low!! Best interpretation of kappa is to compare its values on other, similar scales

Weighted Kappa (Cohen, 1968) For ordered polytomous data Requires assignment of a weighting matrix Kw=ICC with quadratic weights (Fleiss & Cohen, 1973)  

Measures for Reliability   Continuous Categorical Test-retest r or ICC Kappa or ICC Inter-rater Internal Consistency Alpha or Split-half or ICC KR-20 or ICC (dichotomous)

Internal Consistency Degree of homogeneity of items within a scale. Items should be correlated with each other and the total score. Not a measure of dimensionality; assumes unidimensionality.

Internal Consistency and Dimensionality Two (at least) explanations for lack of internal consistency among scale items: More than one dimension Bad items

Cronbach’s Alpha

Cronbach’s Alpha Mathematically equivalent to ICC(3,k) When inter-item correlations are equal across items, equal to the average of all split-half reliabilities. See DeVellis pp 36-38

STATA Alpha Output

Kuder-Richardson 20 Cronbach’s alpha for dichotomous items Proportion responding positively to item i Cronbach’s alpha for dichotomous items Use alpha command in STATA, will automatically give KR20 when items are dichotomous.

Correction for Attenuation You can calculate rx,y You want to know rTxTy

Correction for Attenuation

How to Improve Reliability Reduce error variance Better observer training Improve scale design Enhance true variance Introduce new items better at capturing heterogeneity Change item responses Increase number of items in a scale

Exercise #1 You develop a new survey measure of depression based on a pilot sample that consists of 33% severely depressed, 33% mildly depressed, and 33% non-depressed. You are happy to discover that your measure has a high reliability of 0.90. Emboldened by your findings, you find funding and administer your survey to a nationally representative sample. However, you find that your reliability is now much lower. Why might have the reliability dropped?

Exercise #1 - Answer Change notation for reliability Suppose all of the national sample are severely depressed, then BMS (between-person variance) drops, as does ICC.

Exercise #2 A: Draw data where the cov(Tx,e) is negative B: Draw data where the cov(Tx,e) is positive

Exercise #2a – Answer

Exercise #2b - Answer

Exercise #3 The reported correlations between years of educational attainment and adults’ scores on anti-social personality disorder scales (ASP) is usually about 0.30, and the reported reliability of the education scale is 0.95 and for the ASP scale 0.70. What will your observed correlation between these two measures be if your data on the education scale has the same reliability (0.95) but the ASP has much lower reliability of 0.40?

Exercise #3 - Answer Solve for true score correlation from reported data. Solve for new observed correlation

Exercise #4 In rating a dichotomous child health outcome among 100 children, two psychiatrists disagree in 20 cases – in 10 of these cases the 1st psychiatrist rated the outcome as present and the 2nd as absent, and in the other 10 cases were vice-versa. What will be the value of the Kappa coefficient if both psychiatrists agree that 50 children have the outcome?

Exercise #4 - Answer

Exercise #5 Give substantive examples of how measures of self-reported discrimination could possibly violate each of the three assumptions of classical test theory.

Exercise #5 - Answer E(x) = 0 could be violated if the true score is underreported as a result of social desirability bias   Cov(Tx,e)=0 could be violated if people systematically overreported or underreported discrimination at either high or low extremes of the measure Cov(ei,ej)=0 could be violated if discrimination was clustered within certain areas of a location, and multiple locations were included in the analysis pool.