Reliability: Introduction
Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability 4.Interpreting Coefficients
1. Conceptions of Reliability “This patient is often late!” * “My car won’t start!” Roughly: ± S.E.M. * Does this make him reliable or unreliable?
Measured Value = True Value plus any Systematic Error (Bias) plus Random Error The usefulness of a score depends on the ratio of its true value component to any error variance that it contains Classic view of the components of a measurement
Several sources of variance in test scores: Which to include in estimating reliability? Variance between patients Variance due to different observers Fluctuations over time: day of week or time of day Changes in the measurement instrument (reagents degrade) Changes in definitions (e.g. revised diagnostic codes) Random errors (various sources)
Reliability Subject Variability Subject variability + Measurement Error Reliability = Subject Variability Subject Var. + Observer Variability + Meas’t Error or,
Generalizability Theory An ANOVA model that estimates each source of variability separately: –Observer inconsistency over time –Discrepancies between observers –Changes in subject being assessed over time Quantifies these Helps to show how to optimize design (and administration) of test given these performance characteristics.
2. Classical Test Theory Distinguishes random error from systematic, or bias. Random = unreliability; bias = invalidity. Classical test theory assumes: –Errors are independent of the score (i.e. similar errors occur at all levels of the variable being measured) –Mean of errors = zero (some increase & some decrease the score; these errors balance out) Hence, random errors tend to cancel out if enough observations are made, so a large sample can give you an accurate estimate of the population mean even if the measure is unreliable. Useful! –From the above, Observed score = True score + Error (additive: no interaction between score and error)
Reliability versus Sensitivity of a Measurement: Metaphor of the combs Fine-grained scale may produce more error variance Coarse measure will appear more stable but is less sensitive
Reliability and Precision Some sciences use ‘precision’ to refer to the close grouping of results that in the metaphor of the shooting target we called ‘reliability’. You may also see ‘accuracy’ used in place of our ‘validity’. These terms are common in laboratory disciplines, and you should be aware of the contrasting usage. In part this difference arises because in the social sciences, measurements need to distinguish between 3 concepts: reliability and validity, plus the level of detail a measure is capable of revealing – the number of significant digits it provides. Thus, rating pain as “moderate” is imprecise and yet could be done reliably, and it may also be valid (as far as we can tell!) By contrast, mechanical measurements in laboratory sciences can be sufficiently consistent that they have little need for our concept of reliability.
3. Consistency over time One way to test reliability is to repeat the measurement. If you get the same score, it’s reliable. But this runs into problem that a real change in health may occur over time, giving a falsely negative impression of reliability. Alternatively, people may remember their replies, perhaps falsely inflating reliability. To avoid this you could correlate different, but equivalent, versions of the test. One approach is to divide the whole test into two halves and correlate them (“Split-half reliability”) Formulas by Kuder & Richardson, & Cronbach’s alpha generalize this. Leads to internal consistency: a reliable test is one with items that are very similar Test this using item-total correlations The more items, the lower the error Spearman-Brown formula estimates this: # items reliability
4. Statistics to use: Intra-class correlation vs. Pearson r ICC = 1.0; r = 1.0 r = 1.0; ICC < 1.0 Systematic error: bias Message: a re-test correlation will ignore a systematic change in scores over time. An ICC measures agreement, so will penalize retest reliability when a shift occurs. Which do you prefer?
Self-test fun time! What is the Reliability when: Every student is rated “above average” Physician A rates every BP as 5 mm Hg higher than physician B The measure is applied to a different population The observers change The patients do, in reality, improve over time?