Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging.

Similar presentations


Presentation on theme: "1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging."— Presentation transcript:

1 1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

2 2 Overview of Class 4 u Basic psychometric characteristics –Variability –Reliability –Interpretability –Validity and bias –Responsiveness and sensitivity to change

3 3 Overview u This class: –Variability –Reliability –Interpretability u Next class (class 5) –Validity and bias –Responsiveness and sensitivity to change

4 4 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

5 5 Variability u Good variability –All (or nearly all) scale levels are represented –Distribution approximates bell-shaped normal u Variability is a function of the sample –Need to understand variability of measure of interest in sample similar to one you are studying u Review criteria –Adequate variability in a range that is relevant to your study

6 6 Common Indicators of Variability u Range of scores (possible, observed) u Mean, median, mode u Standard deviation (standard error) u Skewness u % at floor (lowest score) u % at ceiling (highest score)

7 7 Range of Scores u Especially important for multi-item measures u Possible and observed u Example of difference: –CES-D possible range is 0-30 –Wong et al. study of mothers of young children: observed range was 0-23 »missing entire high end of the distribution (none had high levels of depression)

8 8 Mean, Median, Mode u Mean - average u Median - midpoint u Mode - most frequent score u In normally distributed measures, these are all the same u In non-normal distributions, they will vary

9 9 Mean and Standard Deviation u Most information on variability is from mean and standard deviation –Can envision how it is distributed on the possible range

10 10 Normal Distributions (Or Approximately Normal) u Mean, SD tell the entire story of the distribution  + 1 SD on each side of the mean = 64% of the scores

11 11 Examples from Sarkisian (2002): Expectations Regarding Aging M (SD) M + 1 SD Cognitive function35.4 (21.7) 13.7 – 57.1 Pain24.7 (20.1) 4.6 – 44.8 Appearance32.4 (27.8) 4.6 – 60.0 Fatigue23.9 (18.3) 5.6 – 42.2 Scores 0-100 – higher indicate better expectations

12 12 Skewness u Positive skew - scores bunched at low end, long tail to the right u Negative skew - opposite pattern u Coefficient ranges from - infinity to + infinity –the closer to zero, the more normal u Test whether skewness coefficient is significantly different from zero –thus depends on sample size u Scores +2.0 are cause for concern

13 13 Skewed Distributions u Mean and SD are not as useful –SD often goes out beyond the maximum or minimum possible

14 14 Ceiling and Floor Effects: Similar to Skewness Information u Ceiling effects: substantial number of people get highest possible score u Floor effects: opposite u Not very meaningful for continuous scales –there will usually be very few at either end u More helpful for single-item measures or coarse scales with only a few levels

15 15 … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? % 49% not limited at all (can’t improve)

16 16 … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? % 49% not limited at all (can’t improve)

17 17 Advantages of multi-item scales revisited u Using multi-item scales minimizes likelihood of ceiling/floor effects u When items are skewed, multi-item scale “normalizes” the skew

18 18 Percent with Highest (Best) Score: MOS 5-Item Mental Health Index u Items (6 pt scale - all of the time to none of the time): –Very nervous person - 34% none of the time –Felt calm and peaceful - 4% all of the time –Felt downhearted and blue - 33% none of the time –Happy person - 10% all of the time –So down in the dumps nothing could cheer you up – 63% none of the time u Summated 5-item scale (0-100 scale) –Only 5% had highest score Stewart A. et al., MOS book, 1992

19 19 SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) Physical function Role- physical Mental health Vitality (energy) 0-100 Mean80757154 SD27412122 Skewness -.99 -.26 -.83 -.24 % floor< 124<1 % ceiling19374<1 McHorney C et al. Med Care. 1994;32:40-66.

20 20 Ceiling and floor effects: Expectations About Aging (Sarkisian) % min (floor) % max (ceiling) Sexual function333 Pain251 Urinary incontinence533 Appearance306 Cognitive function62 Fatigue171

21 21 Ceiling and floor effects: Expectations About Aging (Sarkisian) # items % min (floor) % max (ceiling) Sexual function 2 333 Pain 2 251 Urinary incontinence 1 533 Appearance 1 306 Cognitive function 4 62 Fatigue 4 171

22 22 Reasons for Poor Variability u Low variability in construct being measured in that “sample” (true low variation) u Items not adequately tapping construct –If only one item, especially hard u Items not detecting important differences in construct at one or the other end of the continuum u Solutions: a dd items

23 23 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

24 24 Reliability u Extent to which an observed score is free of random error u Population-specific; reliability increases with: –sample size –variability in scores (dispersion) –a person’s level on the scale

25 25 Components of an Individual’s Observed Item Score Observed true item score score =+ error

26 26 Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) =+ error variance

27 27 Reliability Depends on True Score Variance u Reliability is a group-level statistic u Reliability: –Reliability = 1 – (error variance) –Reliability is: Proportion of variance due to true score Total variance

28 28 Reliability Depends on True Score Variance Reliability of.70 means 30% of the variance in the observed score is explained by error Reliability = total variance – error variance Proportion of variance due to true score Total variance

29 29 Reliability Depends on True Score Variance Proportion of variance due to true score Total variance Reliability = Total variance – error variance.70 = 100% - 30%

30 30 Importance of Reliability u Necessary for validity (but not sufficient) –Low reliability attenuates correlations with other variables (harder to detect true correlations among variables) –May conclude that two variables are not related when they are u Greater reliability, greater power –Thus the more reliable your scales, the smaller sample size you need to detect an association

31 31 Reliability Coefficient u Typically ranges from.00 - 1.00 u Higher scores indicate better reliability

32 32 How Do You Know if a Scale or Measure Has Adequate Reliability? u Adequacy of reliability judged according to standard criteria –Criteria depend on type of coefficient

33 33 Types of Reliability Tests u Internal-consistency u Test-retest u Inter-rater u Intra-rater

34 34 Internal Consistency Reliability: Cronbach’s Alpha u Requires multiple items supposedly measuring same construct to calculate u Extent to which all items measure the same construct (same latent variable)

35 35 Internal-Consistency Reliability u For multi-item scales u Cronbach’s alpha –ordinal scales u Kuder Richardson 20 (KR-20) –for dichotomous items

36 36 Minimum Standards for Internal Consistency Reliability u For group comparisons (e.g., regression, correlational analyses) –.70 or above is minimum (Nunnally, 1978) –.80 is optimal – above.90 is unnecessary u For individual assessment (e.g., treatment decisions) –.90 or above (.95) is preferred (Nunnally, 1978)

37 37 Internal-Consistency Reliability Can be Spurious u Based on only those who answered all questions in the measure –If a lot of people are having trouble with the items and skip some, they are not included in test of reliability

38 38 Internal-Consistency Reliability is a Function of Number of Items in Scale u Increases with the number of items u Very large scales (20 or more items) can have high reliability without other good scaling properties

39 39 Example: 20 item Beck Depression Inventory (BDI) u BDI 1961 version (symptoms “today”) –reliability.88 –2 items correlated <.30 with other items in the scale u BDI 1978 version (past week) –reliability.86 –3 items correlated <.30 with other items in the scale Beck AT et al. J Clin Psychol. 1984;40:1365-1367

40 40 Test-Retest Reliability u Repeat assessment on individuals who are not expected to change u Time between assessments should be: –Short enough so no change occurs –Long enough so subjects don’t recall first response u Coefficient is a correlation between two measurements –Type of correlation depends on scale properties u For single item measures, the only way to test reliability

41 41 Appropriate Test-Retest Coefficients by Type of Measure u Continuous scales (ratio or interval scales, multi-item Likert scales): –Pearson u Ordinal or non-normally distributed scales: –Spearman –Kendall’s tau u Dichotomous (categorical) measures: –Phi –Kappa

42 42 Minimum Standards for Test-Retest Reliability u Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability u Criteria: similar to those for internal consistency –>.70 is desirable –>.80 is optimal

43 43 Observer or Rater Reliability u Inter-rater reliability (across two or more raters) –Consistency (correlation) between two or more observers on the same subjects (one point in time) u Intra-rater reliability (within one rater) – A test-retest within one observer –Correlation among repeated values obtained by the same observer (over time)

44 44 Observer or Rater Reliability u Sometimes Pearson correlations are used - correlate one observer with another –Assesses association only u.65 to.95 are typical correlations u >.85 is considered acceptable McDowell and Newell

45 45 Association vs. Agreement When Correlating Two Times or Ratings u Association is degree to which one score linearly predicts other score u Agreement is extent to which same score is obtained on second measurement (retest, second observer) u Can have high correlation and poor agreement –If second score is consistently higher for all subjects, can obtain high correlation –Need second test of mean differences

46 46 Example of Association and Agreement u Scores at time 2 are exactly 3 points above scores at time 1 –Correlation (association) would be perfect (r=1.0) –Association is not perfect (no agreement on score in all cases - a difference of 3 between each score at time 1 and time 2

47 47 Types of Reliability Coefficients for Agreement Among Raters u Intraclass correlation –Kappa

48 48 Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) u Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance u Appropriate for dichotomous (categorical) scales and ordinal scales u Several forms of kappa: –e.g., Cohen’s kappa is for 2 judges, dichotomous scale u Sensitive to number of observations, distribution of data

49 49 Interpreting Kappa: Level of Reliability < 0.00.00 -.20.21 -.40.41 -.60.61 -.80.81 - 1.00 Poor Slight Fair Moderate Substantial Almost perfect.60 or higher is acceptable (Landis, 1977)

50 50 Reliable Scale? u NO! u There is no such thing as a “reliable” scale u We accumulate “evidence” of reliability in a variety of populations in which it has been tested

51 51 Reliability Often Poorer in Lower SES Groups More random error due to u Reading problems u Difficulty understanding complex questions u Unfamiliarity with questionnaires and surveys

52 52 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

53 53 Interpretability of Scale Scores: What does a Score Mean? Meaning of scores u What are the endpoints? u Direction of scoring - what does a high score mean? u Compared to norms - is score average, low, or high compared to norms? Single items, more easily interpretable Multi-item scales, no inherent meaning to scores

54 54 Endpoints u What is minimum and maximum possible? –To enable interpretation of mean score u Endpoints of summated scales depend on number of items & number of response choices –5 items, 4 response choices = 5 - 20 –3 items, 5 response choices = 3 - 15

55 55 Direction of Scoring u What does a high score mean? u Where in the range does this mean score lie? –Toward top, bottom? –In the middle?

56 56 Descriptive Statistics for 3193 Women M (SD)MinMax Age46.2 (2.7)44.052.9 Activity7.7 (1.8)3.014.0 Stress8.6 (2.9)4.019.0 Avis NE et al. Med Care, 2003;41:1262-1276

57 57 Sample Results: Mean Scores in a Sample of Older Adults Physical functioning 45.0 Sleep 28.1 Disability 35.7 Mean

58 58 Example of Table Labeling Scores: Making it Easier to Interpret Physical functioning 45.0 Sleep 28.1 Disability 35.7 * All scores 0-100 Mean*

59 59 Example of Table Labeling Scores: Making it Easier to Interpret Physical functioning (+) 45.0 Sleep (-) 28.1 Disability (-) 35.7 * All scores 0-100 (+) indicates higher score is better health (-) indicates lower score is better health Mean*

60 60 Solutions u Can include in label (+) or (-) –Can label scale so that higher score is more of “label” u Can easily put score range next to label if they differ in one table

61 61 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother 2.55.74 Husbands’ reports of wife 5.32 3.30 *Note: high score indicates more harsh practices

62 62 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55.74 Husbands’ reports of wife (1-7) 5.32 3.30 *Note: high score indicates more harsh practices

63 63 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55.74 Husbands’ reports of wife (1-7) 5.32 3.30 Interviewer: 1 2 3 4 5 Husband: 1 2 3 4 5 6 7 *Note: high score indicates more harsh practices 2.55 5.32

64 64 Mean Has to be Interpreted Within the Possible Range: Adding SD Information M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55.74 Husbands’ reports of wife (1-7) 5.32 3.30 Interviewer: 1 2 3 4 5 Husband: 1 2 3 4 5 6 7 *Note: high score indicates more harsh practices 2.55 5.32

65 65 Transforming a Summated Scale to 0- 100 Scale u Works with any ordinal or summated scale u Transforms it so 0 is the lowest possible and 100 is the highest possible u Eases interpretation across numerous scales 100 x (observed score - minimum possible score) ( maximum possible score - minimum possible score)

66 66 Next Class (Class 5) u Validity and bias u Responsiveness and sensitivity to change

67 67 Homework u Complete rows 13 – 26 in matrix for your two measures (including reliability)


Download ppt "1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging."

Similar presentations


Ads by Google