Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliability and Validity: Design and Analytic Approaches

Similar presentations


Presentation on theme: "Reliability and Validity: Design and Analytic Approaches"— Presentation transcript:

1 Reliability and Validity: Design and Analytic Approaches
QSU Seminar Reliability and Validity: Rita Popat, PhD Dept of Health Research & Policy Division of Epidemiology Design and Analytic Approaches Practical Considerations Mention why I selected this topic and BG.

2 What do we want to know about the measurements? Why?
Dependent variable (outcome) Independent variable (risk factor or predictor) JAMA. 2004; 292: 2

3 What are other possible explanations for not detecting an association?

4 JAMA. 2004;291: 4

5 Outline Definitions: Measurement error, reliability, validity
Why should we care about measurement error? Effects of measurement error on Study Validity (Categorical Exposures) Effects of measurement error on Study Validity (Continuous Exposures) Measures (or indices) for reliability and validity

6 Measurement error For an individual, measurement error is the difference between his/her observed and true measurement. Measurement error can occur in dependent (outcome) or independent (predictor or exposure) variables For categorical variables, measurement error is referred to as misclassification Measurement error is an important source of bias that can threaten internal validity of a study 6

7 Reliability (aka reproducibility, consistency)
Reliability is the extent to which repeated measurement of a stable phenomenon- by the same person or different people and instruments, at different times and places- obtain similar results. A precise measurement is reproducible, that is, has the same (or nearly the same) value each time it is measured. The higher the reliability, the greater the statistical power for a fixed sample size Reliability is affected by random error 7

8 Validity or Accuracy The accuracy of a variable is the degree to which it actually represents what it is intended to represent That is: The extent to which the measurement represents the true value of the attribute being assessed. 8

9 Precise (Reliable) and Accurate (Valid) measurements are key to minimizing measurement error
Precision, no accuracy Accuracy, low precision Precision and accuracy No precision, no accuracy 9

10 Measurement error in Categorical Variables
Referred to as Misclassification and could be in the Outcome variables, or Exposure variables How do we know misclassification exists? When method used for classifying exposure lacks accuracy

11 Assessment of Accuracy
Criterion validity (compare against a reference or gold standard) Imperfect Classification True classification b+d a+c c+d d TN c FN Present - a+b b FP a TP + Absent Gold standard is often cumbersome, invasive, expensive. Sensitivity = a / (a+c) False negative = c / (a+c) Specificity = d / (b+d) False positive = b / (b+d) 11

12 Misclassification of exposure
Cases (outcome +) Controls (outcome -) Exposure + a b Exposure - c d Non-differential Differential 12 12

13 Misclassification of exposure
True exposure Cases Controls Exposure + 50 20 Exposure - 80 Reported exposure:90% sensitivity & 80% specificity in cases & controls Cases Controls Exposure + 55 34 Exposure - 45 66 Attenuation of true association due to misclassification of exposure 13

14 Misclassification of the exposure
Cases (outcome +) Controls (outcome -) Exposure + a b Exposure - c d Non-differential misclassification occurs when the degree of misclassification of exposure is independent of outcome/disease status Tends to bias the association toward the null Occurs when the sensitivity and specificity of the classification of exposure are same for those with and without the outcome but less than 100% 14 14

15 Underestimation of a relative risk or odds ratio for…
Bias toward the null hypothesis Underestimation of a relative risk or odds ratio for… True value Observed value A. Risk factor Bias toward the null hypothesis True value Observed value B. Protective factor Modified from Greenberg. Fig 10-4, chapter 10

16 Misclassification of the exposure
True exposure Cases Controls Exposure + 50 20 Exposure - 80 Reported exposure: Cases % sensitivity and 100% specificity Controls- 70% sensitivity and 100% specificity Cases Controls Exposure + 48 14 Exposure - 52 86 16

17 Misclassification of the exposure
True exposure Cases Controls Exposure + 50 20 Exposure - 80 Reported exposure: Cases % sensitivity and 100% specificity Controls- 70% sensitivity and 80% specificity Cases Controls Exposure + 48 30 Exposure - 52 70 17

18 Misclassification of the exposure
Cases (outcome +) Controls (outcome -) Exposure + a b Exposure - c d Differential misclassification occurs when the degree of misclassification differs between the groups being compared. May bias the association either toward or away from the null hypothesis Occurs when the sensitivity and specificity of the classification of exposure differ for those with and without the outcome 18 18

19 Overestimation of a relative risk or odds ratio for…
Bias away from the null hypothesis Overestimation of a relative risk or odds ratio for… Observed value True value A. Risk factor Bias away from the null hypothesis Observed value True value B. Protective factor Modified from Greenberg. Fig 10-4, chapter 10

20 accuracy? Pharmacy database Cases Index Proxy (~25%) Controls
Hormone therapy Never Former Current accuracy? Pharmacy database

21 Summary so far…. Misclassification of exposure is an important source of bias Good to know something about the validity of measurement for exposure classification before the study begins Almost impossible to avoid misclassification, but try to avoid differential misclassification If the study has already been conducted, develop analytic strategies that explore exposure misclassification as a possible explanation of the observed results (especially for a “primary” exposure of interest) For example in the folic acid study, women were interviewed about FA supplementation 4 – 24 months after giving birth. So examine whether results for <12 differ from >12 months

22 Measurement error in Continuous Variables
Physiologic measures (SBP, BMI) Biomarkers (hormone levels, lipids) Nutrients Environmental exposures Outcome measures (QOL, function)

23 Model of measurement error

24 Measurement Theory: Example contd..
24

25 Measurement theory 25

26 Validity of X… 𝜌_𝑋𝑇 is assumed to range b/w 0 and 1, that is for X to be considered to be a measure of T, X must be positively correlated with T

27 Measurement error

28 Differential Measurement error
OR Ask audience to think of an example of differential measurement error (e.g., smoking amount in a case-control study of cancer or amount of physical activity)

29 Differential Measurement error
OR Ask audience to think of an example of differential measurement error (e.g., smoking amount in a case-control study of cancer or amount of physical activity)

30 Differential Bias

31 Non- Differential Measurement error
Talk briefly about how a validity study can help estimate the validity coefficient. Or inter-method reliability studies can be used to make some inferences about the validity coefficient. Fortunately, if the study is designed well and especially if prospective then measurement error tends to be non-differential The effects of non-differential measurement error on the odds ratio. ORT is the true odds ratio for exposure versus reference level r. ORO is the observable odds ratio for exposure versus reference level r.

32 Effects of non-differential measurement error

33

34 See also example in the Appendix of White paper that shows effect of differential and non-differential measurement error

35 Summary so far…. Measurement error is an important source of bias
Good to know something about the validity of measurement for exposure before the study begins Almost impossible to avoid misclassification, but try to avoid differential misclassification! Non-differential measurement error will attenuate the results towards the null, resulting in loss of power for a fixed sample size This should be taken into account when estimating sample size during the planning stage and Interpretation of results and determining internal validity of a study For example in the folic acid study, women were interviewed about FA supplementation 4 – 24 months after giving birth. So examine whether results for <12 differ from >12 months

36 So why should we evaluate reliability and validity of measurements?
If it precedes the actual study, it tells us whether the instrument/method we are using is reliable and valid This information can help us run sensitivity analysis or correct for the measurement error in the variables after the study has been completed

37 Outline Definitions: Measurement error, reliability, validity
Why should we care about measurement error? Effects of measurement error on Study Validity (Categorical Exposures) Effects of measurement error on Study Validity (Continuous Exposures) Measures (or indices) for reliability and validity

38 Choice of reliability and validity measures depend on type of variable . . .
Reliability Measure(s) Validity Measure(s) Dichotomous Kappa sensitivity, specificity Ordinal weighted kappa ICC* misclassification matrix Continuous ICC * Bland Altman Plots Pearson correlation (see note) Bland-Altman Plots *ICC – intraclass correlation coefficient Note: in inter-method reliability studies, inferences about validity can be made from coefficients of reproducibility (such as the Pearson’s correlation ) 38

39 Assessing Accuracy (Validity) of continuous measures
Bias: difference between the mean value as measured and the mean of the true values So bias = – Standardized bias = Bland-Altman plots 39

40 Bland and Altman plots Take two measurements (different methods or instrument) on the same subject For each subject, plot the difference b/w the two measures (y axis) vs. the mean of the two measures We expect the mean difference to be 0 We expect 95% of the differences to be within 2 standard deviations (SD)

41 Yoong et al. BMC Medical Research Methodology 2013, 13:38

42 Yoong et al. BMC Medical Research Methodology 2013, 13:38

43 Suppose there is no gold standard, then how do we evaluate validity. …
Suppose there is no gold standard, then how do we evaluate validity? ….. We make inferences from inter-method reliability studies! Note: will not be able to estimate bias when the two measures are based on different scales

44 Inferences about validity from inter-method reliability studies
Suppose two different methods (instruments) are used to measure the same continuous exposure. Let X1 denote the measure of interest (i.e., the one to be used to measure the exposure in the study) and X2 is the comparison measure We have the reliability coefficient However, we are actually interested in the validity coefficient: Example: Is self-reported physical activity valid? Compare it to the 4-week diary. 44

45 Relationship of Reliability to Validity
Errors of X1 and X2 are: Relationship b/w reliability and validity Usual application 1. Uncorrelated and both measures are equally precise Intramethod study 2. Uncorrelated , X2 is more precise than X1 Intermethod study 3. Uncorrelated , X1 is more precise than X2 4. Correlated errors and both measures are equally precise Take home message: In most situations the square root of the reliability coefficient can provide an upper limit to the validity coefficient 45

46 Inferences about validity from inter-method reliability studies
In our example, X1 is measure of interest (i.e., the one to be used to measure the exposure in the study: self-reported activity) and X2 is the comparison measure (4-wk diaries) We have the reliability coefficient = 0.79 Errors in X1 and X2 are likely to be uncorrelated and X2 is more precise than X1, so 0.79 < < 0.89 So, self-reported activity appears to be a valid measure 46

47 Reliability to Validity
Summary of Inferences From Reliability to Validity Reliability studies are used to interpret validity of x. Reliability is necessary for validity (instrument cannot be valid if it is not reproducible). Reliability is not sufficient for validity - repetition of test may yield same result because both X1 and X2 measure some systematic error (i.e., errors are correlated). Reliability can only give an upper limit on validity. If the upper limit is low, then the instrument is not valid. An estimate of reliability (or validity) depends on the sample (i.e., may vary by age, gender, etc.) 47

48 Reliability of continuously distributed variables
Pearson product-moment correlation? Spearman rank correlation? 48

49 But…does correlation tell you about relationship or agreement?
Pearson’s Correlation coefficient=0.99 Is this measure reliable? 49

50 Reliability of continuously distributed variables
Other methods generally preferred for intra or inter- observer reliability when same method/instrument is used Intraclass correlation coefficients (ICC): is calculated using variance estimates obtained through an analysis of variance (ANOVA) Bland-Altman plots Correlation coefficient useful for inter-method reliability to make inferences about validity (especially when the measurement scale differs for the two methods) 50

51 Intraclass Correlation Coefficient (ICC)
If within-person variance is very high, then measurement error can "overwhelm" the measurement of between person differences. If between-person differences are obscured by measurement error, it becomes difficult to demonstrate a correlation between the imperfectly measured characteristic and any other variable of interest. ICC is computed using ANOVA 51

52 ANalysis Of Variance (ANOVA) in a reliability study
In a reliability study, we are not studying associations b/w predictors and outcome, so we will express the overall variability in the measurement as a function of between-subjects and within-subjects variability So let’s consider a test-retest reliability study, where multiple measurements are taken for each subject SST = SSB + SSW 52

53 Total Variation Subject 1 Subject 2 Subject 3 53

54 Between-Subject Variation
Where: k1= number of measurements taken on subject 1 Subject 1 Subject 2 Subject 3 54

55 Within-Subject Variation
(continued) Subject 1 Subject 2 Subject 3 55

56 One way analysis of variance for computation of ICC: test-retest study
Source of variance Sum of squares Degrees of Mean square (SS) freedom (df) (MS=SS/df) Between subjects n-1 BMS Within subjects n (k –1) WMS (random error) Total nk - 1 Here, each subject is a group. k=# times measure is repeated 56

57 Interpretation of ICC If within-person variance is very high, then measurement error can "overwhelm" the measurement of between person differences. If between-person differences are obscured by measurement error, it becomes difficult to demonstrate a correlation between the imperfectly measured characteristic and any other variable of interest. 57

58 Interpretation of ICC The ICC ranges b/w 0 and 1 and is a measure of reliability adjusted for chance agreement An ICC of 1 is obtained when there is perfect agreement and in general a higher ICC is obtained when the within-subject error (i.e., random error) is small. Hence, ICC=1 only when there is exact agreement between measures (i.e., Xi1=Xi2=...Xik for each subject). Generally, ICCs greater than 0.7 are considered to indicate good reliability. 58

59 Source of variance Sum of squares Degrees of Mean square
Two-way fixed effects ANOVA for computation of ICC (inter-rater reliability) Source of variance Sum of squares Degrees of Mean square (SS) freedom (df) (MS=SS/df) Between subjects n-1 SMS Between measures k - 1 MMS Within subjects (random error) (n-1)(k-1) EMS Total by subtraction nk - 1 59

60 Measuring reliability of categorical variables
Percent agreement or concordance rate Kappa statistic 60

61 Reliability of categorical variables
Concordance rate is the proportion of observations on which the two observers agree Example: Agreement matrix for radiologists reading mammography for breast cancer Radiologist B Radiologist A b+d a+c c+d d c No - a+b b a Yes + Overall % agreement = (a+d) / (a+b+c+d) 61

62 Concordance rates: limitations
Considerable agreement could be expected by chance alone. Misleading when the observations are not evenly distributed among the categories (i.e., when the proportion “abnormal” on a dichotomous test is substantially different from 50%) So, what reliability measures should we use? 62

63 Kappa Kappa is another measurement of reliability
Kappa measures the extent of agreement beyond that would be expected by chance alone Can be used for binary or variables with >2 levels 63

64 Cohen’s Kappa ( ): some notation
A reliability study in which n subjects have each been measured twice where each measure is a nominal variable with k categories. It is assumed that the two measures are equally accurate.  is a measure of agreement that corrects for the agreement that would be expected by chance. 64

65 Cohen’s Kappa Measure 2 (or Rater 2)
Table. Layout of data for computations of Cohen’s  and weighted  Measure 2 (or Rater 2) k Total 1 p11 p p1k r1 Measure 1 2 p21 p p2k r2 (or Rater 1) k pk1 pk pkk rk Total c1 c ck 1 65

66 Cohen’s Kappa Table. Layout of data for computations of Cohen’s  and weighted  Measure 2 k Total 1 p11 p p1k r1 Measure 1 2 p21 p p2k r2 k pk1 pk pkk rk Total c1 c ck 1 The observed proportion of agreement, Po, is the sum of the proportions on the diagonal: 66

67 Cohen’s Kappa Table. Layout of data for computations of Cohen’s  and weighted  Measure 2 k Total 1 p11 p p1k r1 Measure 1 2 p21 p p2k r2 k pk1 pk pkk rk Total c1 c ck 1 The expected proportion of agreement (on the diagonal), Pe, is: Where ri and ci are marginal proportions for the 1st and 2nd measure respectively. 67

68 Kappa Then, kappa is estimated by: Which is:
Observed agreement(%)-Expected agreement (%) 100% - Expected agreement (%) = maximum possible nonchance agreement or 100% less the contribution of chance = proportion of observations that can be attributed to reliable measurement (i.e., not due to chance) So kappa is the ratio of the number of observed nonchance agreements to the number of possible nonchance agreements 68

69 Pictorial of kappa statistic
69

70 Kappa Kappa ranges from –1 (perfect disagreement) to +1 (perfect agreement) Kappa of 0 means that: observed agreement = expected agreement 70

71 Reliability of categorical variables
Example 1: Agreement matrix for radiologists reading mammography for breast cancer Radiologist B Radiologist A 126 24 86 83 (d) 3 (c) No - 64 43 (b) 21 (a) Yes + Overall % agreement = (a+d) / (a+b+c+d)=(21+83)/150=0.69 71


Download ppt "Reliability and Validity: Design and Analytic Approaches"

Similar presentations


Ads by Google