Evaluating Health-Related Quality of Life Measures Ron D. Hays, Ph.D. UCLA GIM & HSR February 9, 2015 (9:00-11:50 am) HPM 214, Los Angeles, CA
Where are we now in HPM 214? 1.Introduction to Outcomes and Effectiveness 2.HRQOL Profile Measures 3.HRQOL Preference-Based Measures 4.Designing HRQOL Measures 5.Evaluating HRQOL Measures 6.PROMIS/IRT/Internet Panels 7.Responding to reviews 8.Course Review ( Cognitive interview assignment due ) 9.Final Exam (3/16/15) 2
The 2nd class assignment is to conduct and summarize 5 cognitive interviews with a self-administered HRQOL survey instrument. Your written summary should be no more than 3 pages in length. Longer summaries will not be accepted. You are required to conduct 5 (and no more than 5) cognitive interviews with every item in your selected instrument. If you have a long instrument you can parse it up so that each respondent does not have to be interviewed on every item but 5 people need to be exposed to each item. cognitive interview write-up is due at 9am on 03/09/ Extra credit can be obtained by writing a 2-page review of a published HRQOL article. The article selected needs to be cleared with the instructor in advance.
Four Levels of Measurement Nominal (categorical) Ordinal (rank) Interval (numerical) Ratio (numerical)
Levels of Measurement and Their Properties Property LevelMagnitude Equal Interval Absolute 0 NominalNoNoNo OrdinalYesNoNo IntervalYesYesNo RatioYesYesYes
Ordinal Scale In general, how would you rate your health? –Excellent –Very good –Good –Fair –Poor
Ordinal Scale In general, how would you rate your health is … –100 = Excellent? –075 = Very good? [84] [76] –050 = Good? [61] [52] –025 = Fair? [26] –000 = Poor?
Interval Scales Fahrenheit and Centigrade temperature –T (°C) = (T (°F) - 32) × 5/9 40°C ≠ 2 times as hot as 20°C 104°F ≠ 2 times as hot as 68°F
Ratio Scales Kelvin Temperature Scale (absolute 0) Days spent in hospital in last 30 days Age A 4- year old is twice as old as a 2-year old. If you subtract 1 from both of their ages, then 4 becomes 3 and 2 becomes 1. The 4-year old is still twice as old as the 2-year old despite the new age values being 3 versus 1 (i.e., “0” no longer means zero years).
Measurement Range for HRQOL Measures NominalOrdinalIntervalRatio
Levels of Measurement and Their Properties Item PersonMagnitude Equal Interval Absolute 0 Total Score NominalNoNoNo0 OrdinalYesNoNo1 IntervalYesYesNo2 RatioYesYesYes3
12 Four Types of Data Collection Errors Coverage Error Does each person in target population have an equal chance of selection? Sampling Error Only some members of the target population are sampled. Nonresponse Error Do people in the sample who respond differ from those who do not? Measurement Error Inaccuracy in answers given to survey questions.
Characteristics of Good Measures Acceptability Variability Reliability Validity Interpretability
Indicators of Acceptability Response rate Administration time Missing data (item, scale)
Variability Responses fall in each response category Distribution approximates bell-shaped “normal” curve (68.2%, 95.4%, and 99.6%)
Reliability Reliability is the degree to which the same score is obtained for thing being measured (person, plant or whatever) when that thing hasn’t changed. –Ratio of signal to noise
Observed Score is: observed score = “true” score + systematic error + random error
Flavors of Reliability Inter-rater (rater) –Need 2 or more raters of the thing being measured Test-retest (administrations) –Need 2 or more time points Internal consistency (items) –Need 2 or more items
Reliability Minimum Standards 0.70 or above (for group comparisons) 0.90 or higher (for individual assessment) SEM = SD (1- reliability) 1/2 95% CI = true score +/ x SEM if z-score = 0, then CI: -.62 to +.62 when reliability = 0.90 Width of CI is 1.24 z-score units
Hypothetical Ratings of Performance of Six Students in HPM 214 by Two Raters Using Excellent to Poor Scale [1 = Poor; 2 = Fair; 3 = Good; 4 = Very good; 5 = Excellent] 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 raters)
Kappa Coefficient of Agreement (Corrects for Chance) kappa = (observed - chance) (1 - chance) “Quality Index”
Cross-Tab of Ratings Rater 1Total PFGVGE P011 F11 G11 VG1012 E Rater 2
Calculating KAPPA P C = (0 x 1) + (2 x 1) + (2 x 1) + (1 x 2) + (1 x 1) =0.19 (6 x 6) P obs. = 2 = Kappa = 0.33– 0.19 =
Guidelines for Interpreting Kappa ConclusionKappaConclusionKappa Poor <.40 Poor < 0.0 Fair Slight Good Fair Excellent >.74 Moderate Substantial Almost perfect Fleiss (1981) Landis and Koch (1977)
Weighted Kappa (Linear and Quadratic) PFGVGE P1.75 (.937).50 (.750).25 (.437)0 F.75 (.937)1.50 (.750).25 (.437) G.50 (.750).75 (.937)1.50 (.750) VG.25 (.437).50 (.750).75 (.937)1 E0.25 (.437).5 (.750).75 (.937)1 W l = 1 – ( i/ (k – 1)) W q = 1 – (i 2 / (k – 1) 2 ) i = number of categories ratings differ by k = n of categories Linear weighted kappa = 0.52; Quadratic weighted kappa = 0.77
26 Intraclass Correlation and Reliability ModelIntraclass CorrelationReliability One- way Two- way mixed Two-way random BMS = Between Ratee Mean Square N = n of ratees WMS = Within Mean Square k = n of items or raters JMS = Item or Rater Mean Square EMS = Ratee x Item (Rater) Mean Square
Two-Way Random Effects ( Reliability of Performance Ratings) Students (BMS) Raters (JMS) Stud. x Raters (EMS) Total Source df SSMS 2-way R = 6 ( ) = (3.13) ICC = 0.80
Responses of Students to Two Questions about Their Health 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 items)
Two-Way Mixed Effects (Cronbach’s Alpha) Respondents (BMS) Items (JMS) Resp. x Items (EMS) Total Source df SSMS Alpha = = 2.93 = ICC = 0.77
Satisfaction of 12 Family Members with 6 Students (2 per student) 1. Julian (fam1: Good, fam2: Very Good) 2. Narissa (fam3: Very Good, fam4: Excellent) 3. Alina (fam5: Good, fam6: Good) 4. Greg (fam7: Fair, fam8: Poor) 5. Linda (fam9: Excellent, fam10: Very Good) 6. Caroline (fam11: Fair, fam12: Fair) (Target = 6 students; assessed by 2 family members each)
One-Way ANOVA (Reliability of Ratings of Students) Respondents (BMS) Within (WMS) Total Source df SS MS 1-way = = 2.80 =
Standardized Alpha for Different Numbers of Items and Average Inter-item Correlation Number of Items (k) Average Inter-item Correlation ( r ) Alpha st = k * r 1 + (k -1) * r
Spearman-Brown Prophecy Formula alpha y = N alpha x 1 + (N - 1) * alpha x N = how much longer scale y is than scale x ) (
Example Spearman-Brown Calculations Estimating the reliability of the MHI-18 from the MHI-32 18/32 (0.98) = =0.96 (1+(18/32 –1)*
Number of Items and Reliability: Three Versions of the Mental Health Inventory (MHI) Measure Number of Items Completion Time (min.) Reliability MHI MHI MHI-55 1 or less.90 Data from McHorney et al. 1992
Multitrait Scaling Analysis Internal consistency reliability –Item convergence Item discrimination
37 Item-scale correlation matrix
38 Item-scale correlation matrix
Validity Does instrument measure what it is supposed to measure? A “validated” instrument is a holy grail
Reliability and Validity
Threats to Validity Socially Desirable Response Set Socially Desirable Response Set Acquiescent Response Set Acquiescent Response Set
Listed below are a few statements about your relationships with others. How much is each statement TRUE or FALSE for you? 1. I am always courteous even to people who are disagreeable. 2. There have been occasions when I took advantage of someone. 3. I sometimes try to get even rather than forgive and forget. 4. I sometimes feel resentful when I don’t get my way. 5. No matter who I’m talking to, I’m always a good listener. Definitely true; Most true; Don’t know; Mostly false; Definitely false
Two Types of Validity Content Validity –Includes face validity Construct Validity –Many synonyms
Content Validity Does the measure adequately represent the domain? –Do items operationalize concept? –Do items cover all aspects of concept? –Does scale name represent item content? Face validity is extent to which measure “appears” to reflect what it is intended to –E.g., by expert judges or by patient focus groups
Construct Validity Do scores on a measure relate to other variables in ways consistent with hypotheses?
Evaluating Construct Validity ScaleAgeObesityESRDNursing Home Resident Physical Functioning Medium (-). Small (-) Large (-) Depressive Symptoms ? Small (+) ? Medium (+) Cohen effect size rules of thumb (d = 0.2, 0.5, and 0.8): Small correlation = Medium correlation = Large correlation = r = d / [(d 2 + 4).5 ] = 0.8 / [( ).5 ] = 0.8 / [( ).5 ] = 0.8 / [( 4.64).5 ] = 0.8 / = (Beware r’s of 0.10, 0.30 and 0.50 are often cited as small, medium, and large.)
Relative Validity Analyses Form of "known groups" validity Relative sensitivity of measure to important clinical difference One-way between group ANOVA
Relative Validity Example Severity of Heart Disease NoneMildSevereF-ratio Relative Validity Scale # Scale # Scale #
Responsiveness to Change HRQOL measures should be responsive to interventions that changes HRQOL Need external indicators of change (Anchors)
Self-Report Indicator of Change Overall has there been any change in your asthma since the beginning of the study? Much improved; Moderately improved; Minimally improved No change Minimally worse; Moderately worse; Much worse
Clinical Indicator of Change “changed” group = seizure free (100% reduction in seizure frequency) “unchanged” group = <50% change in seizure frequency
Responsiveness Indices (1) Effect size (ES) = D/SD (2) Standardized Response Mean (SRM) = D/SD† (3) Guyatt responsiveness statistic (RS) = D/SD‡ D = raw score change in “changed” group; SD = baseline SD; SD† = SD of D; SD‡ = SD of D among “unchanged”
Effect Size Benchmarks Small: 0.20->0.49 Moderate: 0.50->0.79 Large: 0.80 or above
Minimally Important Difference (MID) External anchors –Self-report –Provider report –Clinical measure –Intervention Anchor correlated with change on target measure at or higher Anchor indicates “minimal” change
Change in Physical Function Baseline = 100 (U.S. males mean = 87, SD = 20) Hit by Bike causes me to be limited a lot in vigorous activities, limited a little in moderate activities, and limited a lot in climbing several flights of stairs. Physical functioning drops to 75 (-1.25 SD) Hit by Rock causes me to be limited a little in vigorous activities and physical functioning drops to 95 ( SD)
Example with Multiple Anchors 693 RA clinical trial participants evaluated at baseline and 6- weeks post-treatment. Five anchors: 1.patient global self-report; 2.physician global report; 3.pain self-report; 4.joint swelling; 5.joint tenderness Kosinski, M. et al. (2000). Determining minimally important changes in generic and disease- specific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis. Arthritis and Rheumatism, 43,
Patient and Physician Global Reports How are you (is the patient) doing, considering all the ways that RA affects you (him/her)? Very good (asymptomatic and no limitation of normal activities) Good (mild symptoms and no limitation of normal activities) Fair (moderate symptoms and limitation of normal activities) Poor (severe symptoms and inability to carry out most normal activities) Very poor (very severe symptoms that are intolerable and inability to carry out normal activities --> Improvement of 1 level over time
Global Pain, Joint Swelling and Tenderness 0 = no pain, 10 = severe pain Number of swollen and tender joints -> 1-20% improvement over time
Effect Sizes (mean = 0.34) for SF-36 Changes Linked to Minimal Change in Anchors ScaleSelf-RClin.-RPainSwell Tende r Mean PF Role-P Pain GH EWB Role-E SF EF PCS MCS
Appendix-- ANOVA Computations A. Student’s SS ( )/2 – 38 2 /12 = B. Rater/Item SS ( )/6 – 38 2 /12 = 0.00 C. Total SS ( ) – 38 2 /10 = Student x Item SS= A – (B + C SS)
options ls=130 ps=52 nocenter; options nofmterr; data one; input id 1-2 rater 4 rating 5; CARDS; ; run; **************;
proc freq; tables rater rating; run; *******************; proc means; var rater rating; run; *******************************************; proc anova; class id rater; model rating=id rater id*rater; run; *******************************************;
data one; input id 1-2 rater 4 rating 5; CARDS; ; run; *************************************************************** ***; %GRIP(indata=one,targetv=id,repeatv=rater,dv=rating, type=1,t1=test of GRIP macro,t2=); GRIP macro is available at:
data one; input id 1-2 rater1 4 rater2 5; control=1; CARDS; ; run; **************; DATA DUMMY; INPUT id 1-2 rater1 4 rater2 5; CARDS; RUN;
DATA NEW; SET ONE DUMMY; PROC FREQ; TABLES CONTROL*RATER1*RATER2 /NOCOL NOROW NOPERCENT AGREE; *******************************************; data one; set one; *****************************************; proc means; var rater1 rater2; run; *******************************************; proc corr alpha; var rater1 rater2; run;