Evaluating Health-Related Quality of Life Measures Ron D. Hays, Ph.D. UCLA GIM & HSR February 9, 2015 (9:00-11:50 am) HPM 214, Los Angeles, CA.

Slides:



Advertisements
Similar presentations
The Research Consumer Evaluates Measurement Reliability and Validity
Advertisements

Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-
MEASUREMENT CONCEPTS © 2012 The McGraw-Hill Companies, Inc.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
Part II Sigma Freud & Descriptive Statistics
Evaluating Multi-Item Scales Health Services Research Design (HS 225A) November 13, 2012, 2-4pm (Public Health)
1 Internal and External Validity and Method of Control Ron D. Hays, Ph.D. - UCLA Department of Medicine: Division of General.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Concept of Measurement
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Variables cont. Psych 231: Research Methods in Psychology.
Measurement and Data Quality
Reliability, Validity, & Scaling
Primer on Evaluating Reliability and Validity of Multi-Item Scales Questionnaire Design and Testing Workshop October 25, 2013, 3:30-5:00pm Wilshire.
Measurement in Exercise and Sport Psychology Research EPHE 348.
Instrumentation.
Foundations of Educational Measurement
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Is the Minimally Important Difference Really 0.50 of a Standard Deviation? Ron D. Hays, Ph.D. June 18, 2004.
Which Test Do I Use? Statistics for Two Group Experiments The Chi Square Test The t Test Analyzing Multiple Groups and Factorial Experiments Analysis of.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
1 Assessing the Minimally Important Difference in Health-Related Quality of Life Scores Ron D. Hays, Ph.D. UCLA Department of Medicine October 25, 2006,
Reliability & Validity
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
Evaluating Health-Related Quality of Life Measures June 12, 2014 (1:00 – 2:00 PDT) Kaiser Methods Webinar Series Ron D.Hays, Ph.D.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
1 Session 6 Minimally Important Differences Dave Cella Dennis Revicki Jeff Sloan David Feeny Ron Hays.
Evaluating Self-Report Data Using Psychometric Methods Ron D. Hays, PhD February 6, 2008 (3:30-6:30pm) HS 249F.
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
Measurement Theory in Marketing Research. Measurement What is measurement?  Assignment of numerals to objects to represent quantities of attributes Don’t.
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire Design and Testing.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Psychometric Evaluation of Questionnaire Design and Testing Workshop December , 10:00-11:30 am Wilshire Suite 710 DATA.
Evaluating Self-Report Data Using Psychometric Methods Ron D. Hays, PhD February 8, 2006 (3:00-6:00pm) HS 249F.
Approaches for Estimating Minimally Important Differences Ron D. Hays, Ph.D. January 12, 2004 (8:50-9:10am) Minimal Clinically Important Differences in.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Measurement of Outcomes Ron D. Hays Accelerating eXcellence In translational Science (AXIS) January 17, 2013 (2:00-3:00 pm) 1720 E. 120 th Street, L.A.,
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire.
Estimating Minimally Important Differences on PROMIS Domain Scores Ron D. Hays UCLA Department of Medicine/Division of General Internal Medicine & Health.
Evaluating Multi-item Scales Ron D. Hays, Ph.D. UCLA Division of General Internal Medicine/Health Services Research HS237A 11/10/09, 2-4pm,
Evaluating Multi-Item Scales Health Services Research Design (HS 225B) January 26, 2015, 1:00-3:00pm CHS.
Health-Related Quality of Life in Outcome Studies Ron D. Hays, Ph.D UCLA Division of General Internal Medicine & Health Services Research GCRC Summer Session.
Health-Related Quality of Life (HRQOL) Assessment in Outcome Studies Ron D. Hays, Ph.D. UCLA/RAND GCRC Summer Course “The.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Product Reliability Measuring
Evaluating Multi-Item Scales
Final Session of Summer: Psychometric Evaluation
Health-Related Quality of Life Assessment in Outcome Studies
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Evaluating Multi-Item Scales
Evaluating Multi-item Scales
Multitrait Scaling and IRT: Part I
Evaluating Multi-item Scales
Estimating Minimally Important Differences (MIDs)
Evaluating Multi-Item Scales
Introduction to Psychometric Analysis of Survey Data
UCLA Department of Medicine
Presentation transcript:

Evaluating Health-Related Quality of Life Measures Ron D. Hays, Ph.D. UCLA GIM & HSR February 9, 2015 (9:00-11:50 am) HPM 214, Los Angeles, CA

Where are we now in HPM 214? 1.Introduction to Outcomes and Effectiveness 2.HRQOL Profile Measures 3.HRQOL Preference-Based Measures 4.Designing HRQOL Measures 5.Evaluating HRQOL Measures  6.PROMIS/IRT/Internet Panels 7.Responding to reviews 8.Course Review ( Cognitive interview assignment due ) 9.Final Exam (3/16/15) 2

The 2nd class assignment is to conduct and summarize 5 cognitive interviews with a self-administered HRQOL survey instrument. Your written summary should be no more than 3 pages in length. Longer summaries will not be accepted. You are required to conduct 5 (and no more than 5) cognitive interviews with every item in your selected instrument. If you have a long instrument you can parse it up so that each respondent does not have to be interviewed on every item but 5 people need to be exposed to each item. cognitive interview write-up is due at 9am on 03/09/ Extra credit can be obtained by writing a 2-page review of a published HRQOL article. The article selected needs to be cleared with the instructor in advance.

Four Levels of Measurement Nominal (categorical) Ordinal (rank) Interval (numerical) Ratio (numerical)

Levels of Measurement and Their Properties Property LevelMagnitude Equal Interval Absolute 0 NominalNoNoNo OrdinalYesNoNo IntervalYesYesNo RatioYesYesYes

Ordinal Scale In general, how would you rate your health? –Excellent –Very good –Good –Fair –Poor

Ordinal Scale In general, how would you rate your health is … –100 = Excellent? –075 = Very good? [84] [76] –050 = Good? [61] [52] –025 = Fair? [26] –000 = Poor?

Interval Scales Fahrenheit and Centigrade temperature –T (°C) = (T (°F) - 32) × 5/9 40°C ≠ 2 times as hot as 20°C 104°F ≠ 2 times as hot as 68°F

Ratio Scales Kelvin Temperature Scale (absolute 0) Days spent in hospital in last 30 days Age A 4- year old is twice as old as a 2-year old. If you subtract 1 from both of their ages, then 4 becomes 3 and 2 becomes 1. The 4-year old is still twice as old as the 2-year old despite the new age values being 3 versus 1 (i.e., “0” no longer means zero years).

Measurement Range for HRQOL Measures NominalOrdinalIntervalRatio

Levels of Measurement and Their Properties Item PersonMagnitude Equal Interval Absolute 0 Total Score NominalNoNoNo0 OrdinalYesNoNo1 IntervalYesYesNo2 RatioYesYesYes3

12 Four Types of Data Collection Errors Coverage Error Does each person in target population have an equal chance of selection? Sampling Error Only some members of the target population are sampled. Nonresponse Error Do people in the sample who respond differ from those who do not? Measurement Error Inaccuracy in answers given to survey questions.

Characteristics of Good Measures Acceptability Variability Reliability Validity Interpretability

Indicators of Acceptability Response rate Administration time Missing data (item, scale)

Variability Responses fall in each response category Distribution approximates bell-shaped “normal” curve (68.2%, 95.4%, and 99.6%)

Reliability Reliability is the degree to which the same score is obtained for thing being measured (person, plant or whatever) when that thing hasn’t changed. –Ratio of signal to noise

Observed Score is: observed score = “true” score + systematic error + random error

Flavors of Reliability Inter-rater (rater) –Need 2 or more raters of the thing being measured Test-retest (administrations) –Need 2 or more time points Internal consistency (items) –Need 2 or more items

Reliability Minimum Standards 0.70 or above (for group comparisons) 0.90 or higher (for individual assessment)  SEM = SD (1- reliability) 1/2  95% CI = true score +/ x SEM  if z-score = 0, then CI: -.62 to +.62 when reliability = 0.90  Width of CI is 1.24 z-score units

Hypothetical Ratings of Performance of Six Students in HPM 214 by Two Raters Using Excellent to Poor Scale [1 = Poor; 2 = Fair; 3 = Good; 4 = Very good; 5 = Excellent] 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 raters)

Kappa Coefficient of Agreement (Corrects for Chance) kappa = (observed - chance) (1 - chance) “Quality Index”

Cross-Tab of Ratings Rater 1Total PFGVGE P011 F11 G11 VG1012 E Rater 2

Calculating KAPPA P C = (0 x 1) + (2 x 1) + (2 x 1) + (1 x 2) + (1 x 1) =0.19 (6 x 6) P obs. = 2 = Kappa = 0.33– 0.19 =

Guidelines for Interpreting Kappa ConclusionKappaConclusionKappa Poor <.40 Poor < 0.0 Fair Slight Good Fair Excellent >.74 Moderate Substantial Almost perfect Fleiss (1981) Landis and Koch (1977)

Weighted Kappa (Linear and Quadratic) PFGVGE P1.75 (.937).50 (.750).25 (.437)0 F.75 (.937)1.50 (.750).25 (.437) G.50 (.750).75 (.937)1.50 (.750) VG.25 (.437).50 (.750).75 (.937)1 E0.25 (.437).5 (.750).75 (.937)1 W l = 1 – ( i/ (k – 1)) W q = 1 – (i 2 / (k – 1) 2 ) i = number of categories ratings differ by k = n of categories Linear weighted kappa = 0.52; Quadratic weighted kappa = 0.77

26 Intraclass Correlation and Reliability ModelIntraclass CorrelationReliability One- way Two- way mixed Two-way random BMS = Between Ratee Mean Square N = n of ratees WMS = Within Mean Square k = n of items or raters JMS = Item or Rater Mean Square EMS = Ratee x Item (Rater) Mean Square

Two-Way Random Effects ( Reliability of Performance Ratings) Students (BMS) Raters (JMS) Stud. x Raters (EMS) Total Source df SSMS 2-way R = 6 ( ) = (3.13) ICC = 0.80

Responses of Students to Two Questions about Their Health 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 items)

Two-Way Mixed Effects (Cronbach’s Alpha) Respondents (BMS) Items (JMS) Resp. x Items (EMS) Total Source df SSMS Alpha = = 2.93 = ICC = 0.77

Satisfaction of 12 Family Members with 6 Students (2 per student) 1. Julian (fam1: Good, fam2: Very Good) 2. Narissa (fam3: Very Good, fam4: Excellent) 3. Alina (fam5: Good, fam6: Good) 4. Greg (fam7: Fair, fam8: Poor) 5. Linda (fam9: Excellent, fam10: Very Good) 6. Caroline (fam11: Fair, fam12: Fair) (Target = 6 students; assessed by 2 family members each)

One-Way ANOVA (Reliability of Ratings of Students) Respondents (BMS) Within (WMS) Total Source df SS MS 1-way = = 2.80 =

Standardized Alpha for Different Numbers of Items and Average Inter-item Correlation Number of Items (k) Average Inter-item Correlation ( r ) Alpha st = k * r 1 + (k -1) * r

Spearman-Brown Prophecy Formula alpha y = N alpha x 1 + (N - 1) * alpha x N = how much longer scale y is than scale x ) (

Example Spearman-Brown Calculations Estimating the reliability of the MHI-18 from the MHI-32 18/32 (0.98) = =0.96 (1+(18/32 –1)*

Number of Items and Reliability: Three Versions of the Mental Health Inventory (MHI) Measure Number of Items Completion Time (min.) Reliability MHI MHI MHI-55 1 or less.90 Data from McHorney et al. 1992

Multitrait Scaling Analysis Internal consistency reliability –Item convergence Item discrimination

37 Item-scale correlation matrix

38 Item-scale correlation matrix

Validity Does instrument measure what it is supposed to measure? A “validated” instrument is a holy grail

Reliability and Validity

Threats to Validity Socially Desirable Response Set Socially Desirable Response Set Acquiescent Response Set Acquiescent Response Set

Listed below are a few statements about your relationships with others. How much is each statement TRUE or FALSE for you? 1. I am always courteous even to people who are disagreeable. 2. There have been occasions when I took advantage of someone. 3. I sometimes try to get even rather than forgive and forget. 4. I sometimes feel resentful when I don’t get my way. 5. No matter who I’m talking to, I’m always a good listener. Definitely true; Most true; Don’t know; Mostly false; Definitely false

Two Types of Validity Content Validity –Includes face validity Construct Validity –Many synonyms

Content Validity Does the measure adequately represent the domain? –Do items operationalize concept? –Do items cover all aspects of concept? –Does scale name represent item content? Face validity is extent to which measure “appears” to reflect what it is intended to –E.g., by expert judges or by patient focus groups

Construct Validity Do scores on a measure relate to other variables in ways consistent with hypotheses?

Evaluating Construct Validity ScaleAgeObesityESRDNursing Home Resident Physical Functioning Medium (-). Small (-) Large (-) Depressive Symptoms ? Small (+) ? Medium (+) Cohen effect size rules of thumb (d = 0.2, 0.5, and 0.8): Small correlation = Medium correlation = Large correlation = r = d / [(d 2 + 4).5 ] = 0.8 / [( ).5 ] = 0.8 / [( ).5 ] = 0.8 / [( 4.64).5 ] = 0.8 / = (Beware r’s of 0.10, 0.30 and 0.50 are often cited as small, medium, and large.)

Relative Validity Analyses Form of "known groups" validity Relative sensitivity of measure to important clinical difference One-way between group ANOVA

Relative Validity Example Severity of Heart Disease NoneMildSevereF-ratio Relative Validity Scale # Scale # Scale #

Responsiveness to Change HRQOL measures should be responsive to interventions that changes HRQOL Need external indicators of change (Anchors)

Self-Report Indicator of Change Overall has there been any change in your asthma since the beginning of the study? Much improved; Moderately improved; Minimally improved No change Minimally worse; Moderately worse; Much worse

Clinical Indicator of Change “changed” group = seizure free (100% reduction in seizure frequency) “unchanged” group = <50% change in seizure frequency

Responsiveness Indices (1) Effect size (ES) = D/SD (2) Standardized Response Mean (SRM) = D/SD† (3) Guyatt responsiveness statistic (RS) = D/SD‡ D = raw score change in “changed” group; SD = baseline SD; SD† = SD of D; SD‡ = SD of D among “unchanged”

Effect Size Benchmarks Small: 0.20->0.49 Moderate: 0.50->0.79 Large: 0.80 or above

Minimally Important Difference (MID) External anchors –Self-report –Provider report –Clinical measure –Intervention Anchor correlated with change on target measure at or higher Anchor indicates “minimal” change

Change in Physical Function Baseline = 100 (U.S. males mean = 87, SD = 20) Hit by Bike causes me to be limited a lot in vigorous activities, limited a little in moderate activities, and limited a lot in climbing several flights of stairs. Physical functioning drops to 75 (-1.25 SD) Hit by Rock causes me to be limited a little in vigorous activities and physical functioning drops to 95 ( SD)

Example with Multiple Anchors 693 RA clinical trial participants evaluated at baseline and 6- weeks post-treatment. Five anchors: 1.patient global self-report; 2.physician global report; 3.pain self-report; 4.joint swelling; 5.joint tenderness Kosinski, M. et al. (2000). Determining minimally important changes in generic and disease- specific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis. Arthritis and Rheumatism, 43,

Patient and Physician Global Reports How are you (is the patient) doing, considering all the ways that RA affects you (him/her)? Very good (asymptomatic and no limitation of normal activities) Good (mild symptoms and no limitation of normal activities) Fair (moderate symptoms and limitation of normal activities) Poor (severe symptoms and inability to carry out most normal activities) Very poor (very severe symptoms that are intolerable and inability to carry out normal activities --> Improvement of 1 level over time

Global Pain, Joint Swelling and Tenderness 0 = no pain, 10 = severe pain Number of swollen and tender joints -> 1-20% improvement over time

Effect Sizes (mean = 0.34) for SF-36 Changes Linked to Minimal Change in Anchors ScaleSelf-RClin.-RPainSwell Tende r Mean PF Role-P Pain GH EWB Role-E SF EF PCS MCS

Appendix-- ANOVA Computations A. Student’s SS ( )/2 – 38 2 /12 = B. Rater/Item SS ( )/6 – 38 2 /12 = 0.00 C. Total SS ( ) – 38 2 /10 = Student x Item SS= A – (B + C SS)

options ls=130 ps=52 nocenter; options nofmterr; data one; input id 1-2 rater 4 rating 5; CARDS; ; run; **************;

proc freq; tables rater rating; run; *******************; proc means; var rater rating; run; *******************************************; proc anova; class id rater; model rating=id rater id*rater; run; *******************************************;

data one; input id 1-2 rater 4 rating 5; CARDS; ; run; *************************************************************** ***; %GRIP(indata=one,targetv=id,repeatv=rater,dv=rating, type=1,t1=test of GRIP macro,t2=); GRIP macro is available at:

data one; input id 1-2 rater1 4 rater2 5; control=1; CARDS; ; run; **************; DATA DUMMY; INPUT id 1-2 rater1 4 rater2 5; CARDS; RUN;

DATA NEW; SET ONE DUMMY; PROC FREQ; TABLES CONTROL*RATER1*RATER2 /NOCOL NOROW NOPERCENT AGREE; *******************************************; data one; set one; *****************************************; proc means; var rater1 rater2; run; *******************************************; proc corr alpha; var rater1 rater2; run;