1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging.

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Chapter 8 Flashcards.
2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
Topics: Quality of Measurements
Survey Methodology Reliability and Validity EPID 626 Lecture 12.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 4 – Reliability Observed Scores and True Scores Error
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Copyright © Allyn & Bacon (2010) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
Calculating & Reporting Healthcare Statistics
Concept of Measurement
A quick introduction to the analysis of questionnaire data John Richardson.
The Simple Regression Model
Statistical Evaluation of Data
Chapter 7 Correlational Research Gay, Mills, and Airasian
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Measures of Central Tendency
Measurement and Data Quality
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
PTP 560 Research Methods Week 3 Thomas Ruediger, PT.
Instrumentation.
Foundations of Educational Measurement
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Data Analysis. Quantitative data: Reliability & Validity Reliability: the degree of consistency with which it measures the attribute it is supposed to.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Chapter 1: Research Methods
Statistical Evaluation of Data
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Descriptive Statistics
User Study Evaluation Human-Computer Interaction.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Descriptive Statistics
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
EDU 8603 Day 6. What do the following numbers mean?
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
1 Class 4 Psychometric Characteristics Part I: Sources of Error, Variability, Reliability, Interpretability October 12, 2006 Anita L. Stewart Institute.
1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging.
Sociology 5811: Lecture 3: Measures of Central Tendency and Dispersion Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
1 Locating and Assessing the Usefulness of Health Measures for Health Disparities Research Anita L. Stewart, Ph.D. University of California, San Francisco.
Unit 2 (F): Statistics in Psychological Research: Measures of Central Tendency Mr. Debes A.P. Psychology.
Data Analysis.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Outline Sampling Measurement Descriptive Statistics:
Statistical analysis.
Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging.
Statistical analysis.
Understanding Results
Reliability & Validity
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Evaluation of measuring tools: reliability
Using statistics to evaluate your test Gerard Seinhorst
Presentation transcript:

1 Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

2 Overview of Class 4 u Basic psychometric characteristics –Variability –Reliability –Interpretability –Validity and bias –Responsiveness and sensitivity to change

3 Overview u This class: –Variability –Reliability –Interpretability u Next class (class 5) –Validity and bias –Responsiveness and sensitivity to change

4 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

5 Variability u Good variability –All (or nearly all) scale levels are represented –Distribution approximates bell-shaped normal u Variability is a function of the sample –Need to understand variability of measure of interest in sample similar to one you are studying u Review criteria –Adequate variability in a range that is relevant to your study

6 Common Indicators of Variability u Range of scores (possible, observed) u Mean, median, mode u Standard deviation (standard error) u Skewness u % at floor (lowest score) u % at ceiling (highest score)

7 Range of Scores u Especially important for multi-item measures u Possible and observed u Example of difference: –CES-D possible range is 0-30 –Wong et al. study of mothers of young children: observed range was 0-23 »missing entire high end of the distribution (none had high levels of depression)

8 Mean, Median, Mode u Mean - average u Median - midpoint u Mode - most frequent score u In normally distributed measures, these are all the same u In non-normal distributions, they will vary

9 Mean and Standard Deviation u Most information on variability is from mean and standard deviation –Can envision how it is distributed on the possible range

10 Normal Distributions (Or Approximately Normal) u Mean, SD tell the entire story of the distribution  + 1 SD on each side of the mean = 64% of the scores

11 Examples from Sarkisian (2002): Expectations Regarding Aging M (SD) M + 1 SD Cognitive function35.4 (21.7) 13.7 – 57.1 Pain24.7 (20.1) 4.6 – 44.8 Appearance32.4 (27.8) 4.6 – 60.0 Fatigue23.9 (18.3) 5.6 – 42.2 Scores – higher indicate better expectations

12 Skewness u Positive skew - scores bunched at low end, long tail to the right u Negative skew - opposite pattern u Coefficient ranges from - infinity to + infinity –the closer to zero, the more normal u Test whether skewness coefficient is significantly different from zero –thus depends on sample size u Scores +2.0 are cause for concern

13 Skewed Distributions u Mean and SD are not as useful –SD often goes out beyond the maximum or minimum possible

14 Ceiling and Floor Effects: Similar to Skewness Information u Ceiling effects: substantial number of people get highest possible score u Floor effects: opposite u Not very meaningful for continuous scales –there will usually be very few at either end u More helpful for single-item measures or coarse scales with only a few levels

15 … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? % 49% not limited at all (can’t improve)

16 … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? % 49% not limited at all (can’t improve)

17 Advantages of multi-item scales revisited u Using multi-item scales minimizes likelihood of ceiling/floor effects u When items are skewed, multi-item scale “normalizes” the skew

18 Percent with Highest (Best) Score: MOS 5-Item Mental Health Index u Items (6 pt scale - all of the time to none of the time): –Very nervous person - 34% none of the time –Felt calm and peaceful - 4% all of the time –Felt downhearted and blue - 33% none of the time –Happy person - 10% all of the time –So down in the dumps nothing could cheer you up – 63% none of the time u Summated 5-item scale (0-100 scale) –Only 5% had highest score Stewart A. et al., MOS book, 1992

19 SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) Physical function Role- physical Mental health Vitality (energy) Mean SD Skewness % floor< 124<1 % ceiling19374<1 McHorney C et al. Med Care. 1994;32:40-66.

20 Ceiling and floor effects: Expectations About Aging (Sarkisian) % min (floor) % max (ceiling) Sexual function333 Pain251 Urinary incontinence533 Appearance306 Cognitive function62 Fatigue171

21 Ceiling and floor effects: Expectations About Aging (Sarkisian) # items % min (floor) % max (ceiling) Sexual function Pain Urinary incontinence Appearance Cognitive function 4 62 Fatigue 4 171

22 Reasons for Poor Variability u Low variability in construct being measured in that “sample” (true low variation) u Items not adequately tapping construct –If only one item, especially hard u Items not detecting important differences in construct at one or the other end of the continuum u Solutions: a dd items

23 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

24 Reliability u Extent to which an observed score is free of random error u Population-specific; reliability increases with: –sample size –variability in scores (dispersion) –a person’s level on the scale

25 Components of an Individual’s Observed Item Score Observed true item score score =+ error

26 Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) =+ error variance

27 Reliability Depends on True Score Variance u Reliability is a group-level statistic u Reliability: –Reliability = 1 – (error variance) –Reliability is: Proportion of variance due to true score Total variance

28 Reliability Depends on True Score Variance Reliability of.70 means 30% of the variance in the observed score is explained by error Reliability = total variance – error variance Proportion of variance due to true score Total variance

29 Reliability Depends on True Score Variance Proportion of variance due to true score Total variance Reliability = Total variance – error variance.70 = 100% - 30%

30 Importance of Reliability u Necessary for validity (but not sufficient) –Low reliability attenuates correlations with other variables (harder to detect true correlations among variables) –May conclude that two variables are not related when they are u Greater reliability, greater power –Thus the more reliable your scales, the smaller sample size you need to detect an association

31 Reliability Coefficient u Typically ranges from u Higher scores indicate better reliability

32 How Do You Know if a Scale or Measure Has Adequate Reliability? u Adequacy of reliability judged according to standard criteria –Criteria depend on type of coefficient

33 Types of Reliability Tests u Internal-consistency u Test-retest u Inter-rater u Intra-rater

34 Internal Consistency Reliability: Cronbach’s Alpha u Requires multiple items supposedly measuring same construct to calculate u Extent to which all items measure the same construct (same latent variable)

35 Internal-Consistency Reliability u For multi-item scales u Cronbach’s alpha –ordinal scales u Kuder Richardson 20 (KR-20) –for dichotomous items

36 Minimum Standards for Internal Consistency Reliability u For group comparisons (e.g., regression, correlational analyses) –.70 or above is minimum (Nunnally, 1978) –.80 is optimal – above.90 is unnecessary u For individual assessment (e.g., treatment decisions) –.90 or above (.95) is preferred (Nunnally, 1978)

37 Internal-Consistency Reliability Can be Spurious u Based on only those who answered all questions in the measure –If a lot of people are having trouble with the items and skip some, they are not included in test of reliability

38 Internal-Consistency Reliability is a Function of Number of Items in Scale u Increases with the number of items u Very large scales (20 or more items) can have high reliability without other good scaling properties

39 Example: 20 item Beck Depression Inventory (BDI) u BDI 1961 version (symptoms “today”) –reliability.88 –2 items correlated <.30 with other items in the scale u BDI 1978 version (past week) –reliability.86 –3 items correlated <.30 with other items in the scale Beck AT et al. J Clin Psychol. 1984;40:

40 Test-Retest Reliability u Repeat assessment on individuals who are not expected to change u Time between assessments should be: –Short enough so no change occurs –Long enough so subjects don’t recall first response u Coefficient is a correlation between two measurements –Type of correlation depends on scale properties u For single item measures, the only way to test reliability

41 Appropriate Test-Retest Coefficients by Type of Measure u Continuous scales (ratio or interval scales, multi-item Likert scales): –Pearson u Ordinal or non-normally distributed scales: –Spearman –Kendall’s tau u Dichotomous (categorical) measures: –Phi –Kappa

42 Minimum Standards for Test-Retest Reliability u Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability u Criteria: similar to those for internal consistency –>.70 is desirable –>.80 is optimal

43 Observer or Rater Reliability u Inter-rater reliability (across two or more raters) –Consistency (correlation) between two or more observers on the same subjects (one point in time) u Intra-rater reliability (within one rater) – A test-retest within one observer –Correlation among repeated values obtained by the same observer (over time)

44 Observer or Rater Reliability u Sometimes Pearson correlations are used - correlate one observer with another –Assesses association only u.65 to.95 are typical correlations u >.85 is considered acceptable McDowell and Newell

45 Association vs. Agreement When Correlating Two Times or Ratings u Association is degree to which one score linearly predicts other score u Agreement is extent to which same score is obtained on second measurement (retest, second observer) u Can have high correlation and poor agreement –If second score is consistently higher for all subjects, can obtain high correlation –Need second test of mean differences

46 Example of Association and Agreement u Scores at time 2 are exactly 3 points above scores at time 1 –Correlation (association) would be perfect (r=1.0) –Association is not perfect (no agreement on score in all cases - a difference of 3 between each score at time 1 and time 2

47 Types of Reliability Coefficients for Agreement Among Raters u Intraclass correlation –Kappa

48 Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) u Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance u Appropriate for dichotomous (categorical) scales and ordinal scales u Several forms of kappa: –e.g., Cohen’s kappa is for 2 judges, dichotomous scale u Sensitive to number of observations, distribution of data

49 Interpreting Kappa: Level of Reliability < Poor Slight Fair Moderate Substantial Almost perfect.60 or higher is acceptable (Landis, 1977)

50 Reliable Scale? u NO! u There is no such thing as a “reliable” scale u We accumulate “evidence” of reliability in a variety of populations in which it has been tested

51 Reliability Often Poorer in Lower SES Groups More random error due to u Reading problems u Difficulty understanding complex questions u Unfamiliarity with questionnaires and surveys

52 Overview u Basic psychometric characteristics –Variability –Reliability –Interpretability

53 Interpretability of Scale Scores: What does a Score Mean? Meaning of scores u What are the endpoints? u Direction of scoring - what does a high score mean? u Compared to norms - is score average, low, or high compared to norms? Single items, more easily interpretable Multi-item scales, no inherent meaning to scores

54 Endpoints u What is minimum and maximum possible? –To enable interpretation of mean score u Endpoints of summated scales depend on number of items & number of response choices –5 items, 4 response choices = –3 items, 5 response choices =

55 Direction of Scoring u What does a high score mean? u Where in the range does this mean score lie? –Toward top, bottom? –In the middle?

56 Descriptive Statistics for 3193 Women M (SD)MinMax Age46.2 (2.7) Activity7.7 (1.8) Stress8.6 (2.9) Avis NE et al. Med Care, 2003;41:

57 Sample Results: Mean Scores in a Sample of Older Adults Physical functioning 45.0 Sleep 28.1 Disability 35.7 Mean

58 Example of Table Labeling Scores: Making it Easier to Interpret Physical functioning 45.0 Sleep 28.1 Disability 35.7 * All scores Mean*

59 Example of Table Labeling Scores: Making it Easier to Interpret Physical functioning (+) 45.0 Sleep (-) 28.1 Disability (-) 35.7 * All scores (+) indicates higher score is better health (-) indicates lower score is better health Mean*

60 Solutions u Can include in label (+) or (-) –Can label scale so that higher score is more of “label” u Can easily put score range next to label if they differ in one table

61 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother Husbands’ reports of wife *Note: high score indicates more harsh practices

62 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) Husbands’ reports of wife (1-7) *Note: high score indicates more harsh practices

63 Mean Has to be Interpreted Within the Possible Range M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) Husbands’ reports of wife (1-7) Interviewer: Husband: *Note: high score indicates more harsh practices

64 Mean Has to be Interpreted Within the Possible Range: Adding SD Information M SD Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) Husbands’ reports of wife (1-7) Interviewer: Husband: *Note: high score indicates more harsh practices

65 Transforming a Summated Scale to Scale u Works with any ordinal or summated scale u Transforms it so 0 is the lowest possible and 100 is the highest possible u Eases interpretation across numerous scales 100 x (observed score - minimum possible score) ( maximum possible score - minimum possible score)

66 Next Class (Class 5) u Validity and bias u Responsiveness and sensitivity to change

67 Homework u Complete rows 13 – 26 in matrix for your two measures (including reliability)