Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging.

Similar presentations


Presentation on theme: "1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging."— Presentation transcript:

1 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

2 2 Overview of Class 4 u Concepts of error, sources of error and bias in measures. u Indicators of variability and reasons for poor variability u Indicators of reliability u Interpretability of scores

3 3 Components of an Individual’s Observed Item Score (Simplistic view) Observed true item score score =+ error

4 4 Components of an Individual’s Observed Item Score Observed true item score score =+ error “score that would be obtained over repeated testings” Nunnally, 1994, p211

5 5 Random versus Systematic Error Observed true item score score =+ error random systematic

6 6 Random versus Systematic Error Observed true item score score =+ error random systematic Relevant to reliability Relevant to validity

7 7 Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) =+ error variance

8 8 Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) =+ (Random) error variance

9 9 Combining Items into Multi-Item Scales u When items are combined into a summated scale, random error to some extent “cancels out” –Error variance reduced as # items increases –Reducing random error increases amount of “true score” variance

10 10 Sources of Error u Subjects u Observers or interviewers u Measure or instrument

11 11 Example: Measuring Weight of Children u Observed score is a linear combination of many sources of variation for an individual

12 12 Measuring Weight in Pounds (Without Shoes) of One Child Scale is miscalibrated True weight 80 lbs Amount of water past 30 min Weight of clothes Observed weight Person weighing children is not very precise = + + ++

13 13 Measuring Weight in Pounds (Without Shoes) of One Child Scale is miscalibrated +.1 lb True weight 80 lbs Amount of water past 30 min +.25 lb Weight of clothes +.70 lb Observed weight 82.1 lbs Person weighing children is not very precise +1 lb = + + ++ 82.1 = 80 +.25 +.70 +.1 +1

14 14 Sources of Error in Measuring Weight of Children u Weight of clothes –Subject source of random error u Scale is miscalibrated –Instrument source of systematic error u Person weighing child is not precise –Observer source of random error

15 15 Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Unwillingness to tell interviewer “True” depression 16 Hard to choose number on the 1-6 response choice scale Observed depression score Measure misses 2 culturally- bound symptoms =+ ++ Poor memory of feelings +

16 16 Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Unwilling to tell interviewer -2 “True” depression 16 Hard to choose number on the 1-6 response choice scale +1 Observed depression score 12 Measure misses 2 culturally- bound symptoms -2 =+ ++ 12 = 16 +1 -2 -1 -2 Poor memory of feelings +

17 17 Sources of Error in Measuring Depression u Hard to choose one number on 1-6 response scale –Subject source of random error u Unwilling to tell interviewer, poor memory of feelings –Subject sources of systematic error (underreport true depression) u Measure misses culturally-bound symptoms –Instrument source of systematic error (underestimate true depression)

18 18 Four Types of Memory Errors: From Cognitive Psychology u Encoding –Information inadequately stored in memory u Storage –Memory eroded over time u Retrieval –Some events/feelings harder to recall u Reconstruction –Errors filling in missing pieces R Torangeau, Chap 3, in AA Stone et al. (eds) The Science of Self-Report, London: Lawrence Erlbaum, 2000

19 19 Memory and Time u Autobiographical memory – memory of things in time and space u Events not encoded with their calendar dates –Thus time is a poor retrieval method u Numerous errors remembering “when” and “how often” something occurred within a particular time frame N Bradburn, Chap 4, The Science of Self-Report

20 20 Memory and Emotion u Tend to remember –positive more than negative experiences –more emotionally intense than neutral experiences –non-threatening events more than threatening, sensitive events Kihlstrom et al, Chap 6, The Science of Self-Report

21 21 Overview u Concepts of error u Basic psychometric characteristics –Variability –Reliability –Interpretability

22 22 Variability u Good variability –All (or nearly all) scale levels are represented –Distribution approximates bell-shaped normal u Variability is a function of the sample –Need to understand variability of a measure in sample similar to one you are studying u Review criteria –Adequate variability on the latent variable that is relevant to your study

23 23 Indicators of Variability u Range of scores u Mean, median, mode u Standard deviation (or standard error) u Interquartile range u Skewness statistic u % at floor (lowest possible score) u % at ceiling (highest possible score)

24 24 Range of Scores: Possible and Observed u Especially important for multi-item measures u Example: –CES-D possible range is 0-30 –Wong et al. study of mothers of young children: observed range was 0-23 »missing entire high end of the distribution (none had high levels of depression)

25 25 Mean, Median, Mode u Mean - average u Median - midpoint u Mode - most frequent score u In normally distributed measures, these are all the same u In non-normal distributions, they will vary

26 26 Mean and Standard Deviation u Most information on variability is from mean and standard deviation –Can envision how measure is distributed on the possible range –Mean + 1 SD = 64% of the scores

27 27 Interquartile Range (IR) u Difference between the 3rd and 1st quartiles IR = Quartile 3 - Quartile 1 u This range contains the middle 50% of the distribution –25% of the sample is above and 25% is below this range

28 28 Quartiles Divide distribution into 4 parts with 25% of the sample in each part (quartiles) u Quartile 1 - the scale score at the boundary of the lowest 25% of the distribution u Quartile 2 - the score that divides the distribution in half (same as the median) u Quartile 3 - the score at the boundary of the highest 25% (25% of the sample scores above this point)

29 29 Set of Scores on 12 people 1 2 3 4 5 6 7 8 9 10 11 12 2 3 8 1 7 4 4 3 2 7 5 3 4 9 1 8 2 12 7 6 11 10 5 3 1 2 2 3 3 3 4 4 5 7 7 8 12 people (red), 12 scores (black) Re-arrange scores in numeric order

30 30 Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 Q1=lowest 25% (lowest 3 people) Q2= median (50% below, 50% above) Q3=highest 25% (highest 3 people) 2.5 Q1 6 Q3 3.5 Q2

31 31 Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 Interquartile range - quartile 3 - quartile 1 = 6 - 2.5 = 3.5 2.5 Q1 6 Q3 3.5 Q2

32 32 Skewness u Positive skew - scores bunched at low end, long tail to the right u Negative skew - opposite pattern u Skewness coefficient ranges from - infinity to + infinity –the closer to zero, the more normal u Scores +2.0 are cause for concern

33 33 Ceiling and Floor Effects: Similar to Skewness Information u Ceiling effects: substantial number of people get highest possible score u Floor effects: opposite u More helpful for single-item measures or coarse scales with only a few levels

34 34 … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? % 49% not limited at all (can’t improve)

35 35 SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) Physical function 10 items Role- physical 4 items Mental health 5 items Vitality (energy) 5 items Mean (SD)80 (27)75 (41)71 (21)54 (22) Skewness -.99 -.26 -.83 -.24 % floor< 124<1 % ceiling19374<1 McHorney C et al. Med Care. 1994;32:40-66. All on 0-100 scales, higher is better

36 36 Evidence of Floor and Ceiling Effects in One SF-36 Scale Physical function 10 items Role- physical 4 items Mental health 5 items Vitality (energy) 5 items Mean (SD)80 (27)75 (41)71 (21)54 (22) Skewness -.99 -.26 -.83 -.24 % floor< 1 % ceiling194<1 McHorney C et al. Med Care. 1994;32:40-66. All on 0-100 scales, higher is better 24 37

37 Reasons for Poor Variability u Low variability in construct being measured in that “sample” (true low variation) u Items not adequately tapping construct –If only one item, especially hard u Items not detecting variation at one end u What to do: –If developing measures, add items –If selecting measures – find another one

38 38 Advantages of Multi-item Scales Revisited u Using multi-item scales minimizes likelihood of ceiling/floor effects u Even if items are skewed, multi-item scale “normalizes” the skew

39 39 Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: Stewart A. et al., Measuring Functioning and Well-Being, 1992 % Very nervous person (none of the time)34 Felt calm and peaceful (all of the time) 4 Felt downhearted and blue (none of the time)33 Happy person (all of the time)10 So down in the dumps nothing could cheer you up (none of the time)63

40 40 Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: Stewart A. et al., Measuring Functioning and Well-Being, 1992 % Very nervous person (none of the time)34 Felt calm and peaceful (all of the time) 4 Felt downhearted and blue (none of the time)33 Happy person (all of the time)10 So down in the dumps nothing could cheer you up (none of the time) 63

41 41 Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: Stewart A. et al., Measuring Functioning and Well-Being, 1992 % Very nervous person (none of the time)34 Felt calm and peaceful (all of the time) 4 Felt downhearted and blue (none of the time)33 Happy person (all of the time)10 So down in the dumps nothing could cheer you up (none of the time)63 5-item scale: only 5% had highest score

42 42 Overview u Concepts of error u Basic psychometric characteristics –Variability –Reliability –Interpretability

43 43 Reliability u Extent to which an observed score is free of random error –Produces the same score each time it is administered (all else being equal) u Population-specific - reliability affected by: –sample size –variability in scores (dispersion) –a person’s level on the scale

44 44 Back to Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) =+ error variance

45 45 Reliability Depends on True Score Variance u Reliability is a group-level statistic u Reliability: –Reliability = 1 – (error variance) –OR Proportion of variance due to true score Total variance

46 46 Reliability Depends on True Score Variance Reliability of.70 means 30% of variance in observed scores is due to error Reliability = total variance – error variance.70 = 1.0 –.30

47 47 Reliability Coefficient u Typically ranges from.00 - 1.00 u Higher scores indicate better reliability

48 48 Importance of Reliability u Necessary for validity (but not sufficient) –Low reliability (or high measurement error) attenuates correlations with other variables –May conclude that two variables are not related when they are u Greater reliability = greater power –The more reliable your scales, the smaller sample size you need to detect an association

49 49 Reliable Scale? u NO! u There is no such thing as a “reliable” scale u We accumulate “evidence” of reliability in a variety of populations in which it has been tested

50 50 How Do You Know if a Scale or Measure Has Adequate Reliability? u Adequacy of reliability judged according to standard criteria –Criteria depend on type of coefficient

51 51 Types of Reliability Tests u Internal-consistency u Test-retest u Inter-rater u Intra-rater

52 52 Internal Consistency Reliability: Cronbach’s Alpha u Requires multiple items supposedly measuring same construct to calculate u Extent to which all items measure the same construct (same latent variable)

53 53 Internal-Consistency Reliability u For multi-item scales u Cronbach’s alpha –for scales using ordinal items (e.g., 1-5) u Kuder Richardson 20 (KR-20) –for scales using dichotomous items

54 54 Minimum Standards for Internal Consistency Reliability u For group comparisons (e.g., regression, correlational analyses) –.70 or above is minimum (Nunnally, 1978) –.80 is optimal – above.90 is unnecessary u For individual assessment (e.g., treatment decisions) –.90 or above (.95) is preferred (Nunnally, 1978)

55 55 Internal-Consistency Reliability Can be Spurious u Based on only those who answered all questions in the measure –If a lot of people are having trouble with the items and skip some, they are not included in test of reliability u Important to compare sample size in reliability calculation to total sample

56 56 Internal-Consistency Reliability is a Function of Number of Items in Scale u Increases with the number of items u Very large scales (20 or more items) can have high reliability without other good psychometric properties

57 57 Example: 20 item Beck Depression Inventory (BDI) u BDI 1978 version (asks about past week) –Internal consistency reliability =.86 Beck AT et al. J Clin Psychol. 1984;40:1365-1367

58 58 Example: 20 item Beck Depression Inventory (BDI) u BDI 1978 version (asks about past week) –Internal consistency reliability =.86 –BUT: 3 items correlated <.30 with other items in the scale Beck AT et al. J Clin Psychol. 1984;40:1365-1367

59 59 Reliability Varies by Level on Measure u Reliability can be poorer for those scoring at one end of the scale u Example: Number of visits to doctor in past 12 months –More reliable for those with fewer visits

60 60 Test-Retest Reliability u Repeat assessment on individuals not expected to change u Time between assessments should be: –Short enough so no change occurs –Long enough so subjects don’t recall first response u Only reliability test for single item measures u Coefficient: correlation between 2 measurements

61 61 Appropriate Test-Retest Coefficients by Type of Scale u Continuous scales (ratio or interval scales, multi-item Likert scales): –Pearson u Ordinal or non-normally distributed scales: –Spearman or Kendall’s tau u Dichotomous (categorical) measures: –Phi or Kappa

62 62 Minimum Standards for Test-Retest Reliability u Magnitude of a test-retest correlation is important, not significance u Criterion: similar to that for internal consistency –>.70 is desirable –>.80 is optimal

63 63 Observer or Rater Reliability u Inter-rater reliability (across two or more raters) –Consistency (correlation) between two or more observers of the same subjects (one point in time) u Intra-rater reliability (within one rater) –Consistency within one observer –Correlation among repeated values obtained by the same observer (over time)

64 64 Observer or Rater Reliability u Sometimes Pearson correlations are used – scores on a group of individuals obtained by one observer correlated with scores obtained by another observer –Assesses association only u.65 to.95 are typical correlations u >.85 is considered acceptable McDowell I et al. Measuring Health, 2006, p. 45.

65 65 Association vs. Agreement When Correlating Scores from Two Times or Ratings u Association: degree to which scores of one rater linearly predict scores of 2 nd rater u Agreement: extent to which same score obtained on 2nd measurement (retest, 2nd rater) u Can have high correlation and poor agreement –If second score is consistently higher for all subjects, can obtain high correlation –Need second test of mean differences

66 66 Hypothetical Scores on 4 Subjects by 2 Observers

67 67 Example of Association and Agreement u Scores by observer 1 are exactly 2 points above scores by observer 2 –Correlation (association) would be perfect (r=1.0) –Agreement is poor (no agreement on score in all cases - a difference of 2 between scores on each subject

68 68 Intraclass Correlation Coefficient (Kappa) for Testing Inter-rater Reliability u Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance u Appropriate for dichotomous (categorical) scales and ordinal scales u Several forms of kappa: –e.g., Cohen’s kappa: 2 judges, dichotomous scale u Sensitive to number of observations, distribution of data

69 69 Interpreting Magnitude of Kappa: Level of Reliability <0.00.00 -.20.21 -.40.41 -.60.61 -.80.81 - 1.00 Poor Slight Fair Moderate Substantial Almost perfect. 60 or higher is acceptable (Landis, 1977)

70 70 Reliability Often Poorer in Lower SES or Low Literacy Groups More random error due to u Reading problems u Difficulty understanding complex questions u Unfamiliarity with questionnaires and surveys

71 71 Advantages of Multi-item Scales Revisited u Using multi-item scales improves reliability u Random error is “canceled out” across multiple items

72 72 What Makes a Measure Reliable? u Preventing measurement error easier than assessing its effects u Measure –Clear items, appropriate response choices, etc. u Format –Make instrument easily understood u Method of administration –Train raters to do their job –Adhere to standard administration procedures

73 73 Overview u Concepts of error u Basic psychometric characteristics –Variability –Reliability –Interpretability

74 74 Interpretability: What does a Score Mean? u What are the endpoints? u What does a high score mean? (direction of scoring) u Compared to norms - is score low or high? Single items, more easily interpretable Multi-item scales, no inherent meaning to scores

75 75 Endpoints u What is minimum and maximum possible? –Enable interpretation of mean score u When scores are added, endpoints depend on number of items & number of response choices –5 items, 4 response choices = 5 to 20 –3 items, 5 response choices = 3 to 15

76 76 Compare Results to Norms u Comparing your means to published norms helps interpret the mean of your sample u SF-36 has numerous norms, e.g. –General population »By age group, gender, and chronic disease

77 77 SF-36 in MOS Patients versus Population Norms Physical function Role- physical Mental health Vitality (energy) MOS patients Mean (SD)80 (27)75 (41)71 (21)54 (22) NORMS Gen pop84 (23)81 (34)75 (18)61 (21) Age 75+53 (30)45 (42)74 (20)50 (24) JE Ware et al, SF-36 Health Survey Manual and Interpretation Guide, The Health Institute, 1993.

78 78 Direction of Scoring u What does a high score mean? u Where in the range does the mean score lie? –Toward top, bottom? –In the middle?

79 79 Descriptive Statistics for ~3,000 Women M (SD)MinMax Age46.2 (2.7)42.052.9 Activity7.7 (1.8)3.014.0 Stress8.6 (2.9)4.019.0 Med Care, 2003;41:1262-1276

80 80 Descriptive Statistics for ~3,000 Women M (SD)MinMax Age46.2 (2.7)42.052.9 Activity7.7 (1.8)3.014.0 Stress8.6 (2.9)4.019.0 Med Care, 2003;41:1262 Activity: no measure mentioned Stress: Perceived stress scale (Cohen, 1983)

81 81 Perceived Stress Scale (Cohen 1983): Hard to Find u Available in JSTOR –Can print one page at a time u Searched article “on line” –Could not find scoring information other than reverse 7 of the 14 items and sum them »Possible score range of 0-56 –Could not find response choices

82 82 Another Example: Mean Scores in a Sample of Older Adults Physical functioning 45.0 Sleep problems 28.1 Disability 35.7 Mean

83 83 Making it Easier to Interpret Physical functioning 45.0 Sleep problems 28.1 Disability 35.7 * All scores 0-100 Mean*

84 84 Making it Easier to Interpret Physical functioning (+) 45.0 Sleep problems (-) 28.1 Disability (-) 35.7 * All scores 0-100 (+) indicates higher score is better health (-) indicates lower score is better health Mean*

85 85 Confusion Introduced by Labels: u SF-36 Bodily Pain scale –Higher score is no pain or limitations due to pain –Rationale: so 8 subscales scored in same direction u Social Adjustment Scale (Weissman) u Functional Status Index (Jette)

86 86 Mean Has to be Interpreted Within Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother 2.55.74 Husbands’ reports of wife 5.32 3.30 *Note: high score indicates more harsh practices

87 87 Mean Has to be Interpreted Within Possible Range (Add Range) M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55.74 Husbands’ reports of wife (1-7) 5.32 3.30 *Note: high score indicates more harsh practices

88 88 Mean Has to be Interpreted Within Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55.74 Husbands’ reports of wife (1-7) 5.32 3.30 Interviewer: 1 2 3 4 5 Husband: 1 2 3 4 5 6 7 *Note: high score indicates more harsh practices 2.55 5.32

89 89 Transforming a Summated Scale to a 0-100 Scale u Works with any ordinal or summated scale u Transforms it so 0 is the lowest possible and 100 is the highest possible u Eases interpretation across numerous scales 100 x (observed score - minimum possible score) ( maximum possible score - minimum possible score)

90 90 Homework u Complete rows 13-19 on matrix for both measures –Interpretability, nature of samples on which it has been tested, variability and central tendency, reliability


Download ppt "1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging."

Similar presentations


Ads by Google