Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to Change October 22, 2009 Anita L. Stewart Institute.

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Cross Cultural Research
ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
Conceptualization and Measurement
Reliability and Validity
Part II Sigma Freud & Descriptive Statistics
Validity In our last class, we began to discuss some of the ways in which we can assess the quality of our measurements. We discussed the concept of reliability.
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Reliability and Validity of Research Instruments
RESEARCH METHODS Lecture 18
Chapter 4 Validity.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Validity and Validation: An introduction Note: I have included explanatory notes for each slide. To access these, you will probably have to save the file.
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Measurement and Data Quality
Measurement in Exercise and Sport Psychology Research EPHE 348.
Ch 6 Validity of Instrument
Instrumentation.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
FDA Approach to Review of Outcome Measures for Drug Approval and Labeling: Content Validity Initiative on Methods, Measurement, and Pain Assessment in.
Chapter 1: Research Methods
Lecture 6: Reliability and validity of scales (cont) 1. In relation to scales, define the following terms: - Content validity - Criterion validity (concurrent.
1 Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to Change October 16, 2008 Anita L. Stewart Institute.
Reliability & Validity
EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.
1 Class 6 Additional Psychometric Characteristics: Validity and Bias, Responsiveness and Sensitivity to Change October 25, 2007 Anita L. Stewart Institute.
Chapter 4 – Research Methods in Clinical Psych Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
1 Session 6 Minimally Important Differences Dave Cella Dennis Revicki Jeff Sloan David Feeny Ron Hays.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
1 Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to Change October 22, 2009 Anita L. Stewart Institute.
Validity: Introduction. Reliability and Validity Reliability Low High Validity Low High.
Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.
Chapter 5 Assessment: Overview INTRODUCTION TO CLINICAL PSYCHOLOGY 2E HUNSLEY & LEE PREPARED BY DR. CATHY CHOVAZ, KING’S COLLEGE, UWO.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Test-Retest Reliability of the Work Disability Functional Assessment Battery (WD-FAB) Dr. Leighton Chan, MD, MPH Chief, Rehabilitation Medicine Department.
Instrument Development and Psychometric Evaluation: Scientific Standards May 2012 Dynamic Tools to Measure Health Outcomes from the Patient Perspective.
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.
Chapter 8 Introducing Inferential Statistics.
Survey Methodology Reliability and Validity
Experimental Research
Sample Power No reading, class notes only
Chapter 4 Research Methods in Clinical Psychology
QUESTIONNAIRE DESIGN AND VALIDATION
Reliability and Validity
Reliability and Validity in Research
Concept of Test Validity
Assessment Theory and Models Part II
Test Validity.
Understanding Results
Reliability & Validity
Human Resource Management By Dr. Debashish Sengupta
Week 3 Class Discussion.
Reliability and Validity of Measurement
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Unit IX: Validity and Reliability in nursing research
RESEARCH METHODS Lecture 18
Response biases.
Analyzing Reliability and Validity in Outcomes Assessment
Multivariate Analysis - Introduction
International Perthes Study Group
UCLA Department of Medicine
Patient-reported Outcome Measures
Chapter 3: How Standardized Test….
Presentation transcript:

Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to Change October 22, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Overview Types of validity in health assessment Focus on construct validity How bias affects validity Socially desirable responding and culture as sources of bias Sensitivity to change

Validity Does a measure (or instrument) measure what it is supposed to measure? And… Does a measure NOT measure what it is NOT supposed to measure?

Valid Scale? No! There is no such thing as a “valid” scale We accumulate “evidence” of validity in a variety of populations in which it has been tested Similar to reliability

Validation of Measures is an Iterative, Lengthy Process Accumulation of evidence Different samples Longitudinal designs

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Content Validity: Relevant when writing items Extent to which a set of items represents the defined concept

Relevance of Content Validity to Selecting Measures “Conceptual adequacy” Does “candidate” measure represent adequately the concept YOU are intending to measure

Content Validity Appropriate at Two Levels Battery or Are all relevant domains instrument represented in an instrument? Measure Are all aspects of a defined concept represented in the items of a scale?

Example of Content Validity of Instrument You are studying health-related quality of life (HRQL) in clinical depression Your HRQL concept includes sleep problems, ability to work, and social functioning SF-36 - a candidate Missing sleep problems

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Criterion Validity How well a measure correlates with another measure considered to be an accepted standard (criterion) Can be Concurrent Predictive

Criterion Validity of Self-reported Health Care Utilization Compare self-report with “objective” data (computer records of utilization) # MD visits past 6 months (self-report) correlated .64 with computer records # hospitalizations past 6 months (self-report) correlated .74 with computer records Ritter PL et al, J Clin Epid, 2001;54:136-141

Criterion Validity of Screening Measure Develop depression screening tool to identify persons likely to have disorder Do clinical assessment only on those who screen “likely” Criterion validity Extent to which the screening tool detects (predicts) those with disorder sensitivity and specificity, ROC curves

Criterion Validity of Measure to Predict Outcome If goal is to predict health or other outcome Extent to which the measure predicts the outcome Example: Develop self-reported war-related stress measure to identify vets at risk of PTSD How well does it predict subsequent PTSD (Vogt et al., 2004, readings)

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Construct Validity Basics Does measure relate to other measures in hypothesized ways? Do measures “behave as expected”? 3-step process State hypothesis: direction and magnitude Calculate correlations Do results confirm hypothesis?

Source of Hypotheses in Construct Validity Prior literature in which associations between constructs have been observed e.g., other samples, with other measures of constructs you are testing Theory, that specifies how constructs should be related Clinical experience

Who Tests for Validity? When measure is being developed, investigators should test construct validity As measure is applied, results of other studies provide information that can be used as evidence of construct validity

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Convergent Validity Hypotheses stated as expected direction and magnitude of correlations “We expect X measure of depression to be positively and moderately correlated with two measures of psychosocial problems” The higher the depression, the higher the level of problems on both measures

Testing Validity of Expectations Regarding Aging Measure Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity Sarkisian CA et al. Gerontologist. 2002;42:534

Testing Validity of Expectations Regarding Aging Measure Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age (convergent) Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity Sarkisian CA et al. Gerontologist. 2002;42:534

ERA-38 Convergent Validity Results: Hypothesis 1 ERA Functional Independence ADL .19** .20*** PCS-12 .27** .32*** MCS-12 .35** .30** Comorbidity - .09* ns Depressive symptoms - .33** - .28** Age - .24** - .14**

ERA-38: Non-Supporting Convergent Validity Results ERA Functional Independence ADL .19** .20*** PCS-12 .27** .32*** MCS-12 .35** .30** Comorbidity - .09* ns Depressive symptoms - .33** - .28** Age - .24** - .14**

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Discriminant Validity: Known Groups Does the measure distinguish between groups known to differ in concept being measured? Tests for mean differences between groups

Example of a Known Groups Validity Hypothesis Among three groups: General population Patients visiting providers Patients in a public health clinic Hypothesis: scores on functioning and well-being measures will be the best in a general population and the worst in patients in a public health clinic

Mean Scores on MOS 20-item Short Form in Three Groups Public General MOS health population patients patients Physical function 91 78 50 Role function 88 78 39 Mental health 78 73 59 Health perceptions 74 63 41 Bindman AB et al., Med Care 1990;28:1142

PedsQL Known Groups Validity Hypothesis: PedsQL scores would be lower in children with a chronic health condition than without Child report: Total score Emotional functioning Chron ill* 77 (16) 76 (22) Acutely ill* 79 (14) 77 (20) ANOVA, p = .001 Healthy 83 (15) 81 (20) * Different from healthy children, p < .05 JW Varni et al. PedsQL™ 4.0: Reliability and Validity of the Pediatric Quality of Life Inventory™ …, Med Care, 2001;39:800-812.

Types of Measurement Validity Content Criterion Construct Convergent Discriminant Convergent/discriminant All can be: Concurrent Predictive

Convergent/Discriminant Validity Does measure correlate lower with measures it is not expected to be related to … than to measures it is expected to be related to? The extent to which the pattern of correlations conforms to hypothesis is confirmation of construct validity

Basis for Convergent/Discriminant Hypotheses All measures of health will correlate to some extent Hypothesis is of relative magnitude

Example of Convergent/Discriminant Validity Hypothesis Expected pattern of relationships: A measure of physical functioning is “hypothesized” to be more highly related to a measure of mobility than to a measure of depression

Example of Convergent/Discriminant Validity Evidence Pearson correlation: Mobility Depression Physical functioning .57 .25

Testing Validity of Expectations Regarding Aging Measure Hypothesis 1: ERA-38 total score would correlate moderately with ADLS, PCS, MCS, depression, comorbidity, and age (convergent) Hypothesis 2: Functional independence scale would show strongest associations with ADLs, PCS, and comorbidity (convergent/discriminant) Sarkisian CA et al. Gerontologist. 2002;42:534

ERA-38 Convergent/Discriminant Validity Results: Hypothesis 2 ERA Functional Independence ADL .19** .20*** PCS-12 .27** .32*** MCS-12 .35** .30** Comorbidity - .09* ns Depressive symptoms - .33** - .28** Age - .24** - .14**

ERA-38: Non-Supporting Validity Results ERA Functional Independence ADL .19** .20*** PCS-12 .27** .32*** MCS-12 .35** .30** Comorbidity - .09* ns Depressive symptoms - .33** - .28** Age - .24** - .14**

Construct Validity Thoughts: Lee Sechrest There is no point at which construct validity is established It can only be established incrementally Our attempts to measure constructs help us better understand and revise these constructs Sechrest L, Health Serv Res, 2005;40(5 part II), 1596

Construct Validity Thoughts: Lee Sechrest (cont) “An impression of construct validity emerges from examining a variety of empirical results that together make a compelling case for the assertion of construct validity”

Construct Validity Thoughts: Lee Sechrest (cont) Because of the wide range of constructs in the social sciences, many of which cannot be exactly defined.. …once measures are developed and in use, we must continue efforts to understand them and their relationships to other measured variables.

Interpreting Validity Coefficients Magnitude and conformity to hypothesis are important, not statistical significance Nunnally: rarely exceed .30 to .40 which may be adequate (1994, p. 99) McDowell and Newell: typically between 0.40 and 0.60 (1996, p. 36) Max correlation between 2 measures = square root of product of reliabilities 2 scales with .70 reliabilities, max correlation .70 Correlation of .60 would be “high” McHorney 1993 Nunnally

Overview Types of validity in health assessment Focus on construct validity How bias affects validity Socially desirable responding and culture as sources of bias Sensitivity to change

Components of an Individual’s Observed Item Score (from Class 3) Observed true item score score random systematic error = +

Random versus Systematic Error Observed true item score score Relevant to reliability random systematic error = + Relevant to validity

Bias is Systematic Error Affects validity of scores If scores contain systematic error, cannot know the “true” mean score Will obtain an observed score that is either systematically higher or lower than the “true” score

“Bias” or “Systematic Error”? Bias implies that the direction of error known Systematic error – direction neutral Same error applies to entire sample

Sources of “Systematic Error” in Observed Scores of Individuals Respondent Socially desirable responding Acquiescent response bias Cultural beliefs (e.g., not reporting distress) Halo affects Observer Belief that respondent is ill Instrument

Socially Desirable Responding Tendency to respond in socially desirable ways to present oneself favorably Observed score is consistently lower or higher than true score in the direction of a more socially acceptable score

Socially Desirable Response Set – Looking “good” After coming up with an answer to a question, respondent “screens” the answer “Will this answer make the person like me less?” May “edit” their answer Systematic underreporting of “risk” behavior example A woman has 2 drinks of alcohol a day, but responds that she drinks a few times a week

Ways to Minimize Socially Desirable Responding Write items and instructions to increase “acceptability” of an “undesirable” response Instead of: “Have you followed your doctor’s recommendations?” Use: “Have you had any of the following problems following your doctor’s recommendations?”

Acquiescent Response Set Tendency to agree with statements regardless of content give “positive” response such as yes, true, satisfied Extent and nature of bias depends on direction of wording of the questions Minimizing acquiescence: Include positively- and negatively-worded items in the same scale

Example of Systematic Error Due to Cultural Norms or Beliefs A person feels sad “most of the time” Unwilling to admit this to the interviewer so answers “a little of the time” Not culturally appropriate to admit to negative feelings Always present a positive personality Observed response reflects less sadness than “true” sadness of respondent

In reporting on a patient’s well-being Discrepancies in Information Sources: Systematic Error or Different Perspectives? In reporting on a patient’s well-being Patients report highest levels Clinicians report levels in the middle Family members report the lowest levels No way to know which is the “true” score to say one score is “biased” implies another one is the “true score”

Overview Types of validity in health assessment Focus on construct validity How bias affects validity Socially desirable responding and culture as sources of bias Sensitivity to change

Sensitivity to Change: Two Issues Measure able to detect true changes One knows how much change is meaningful on the measure

Measure Able to Detect True Change Sensitive to true differences or changes in the attribute being measured Sensitive enough to measure differences in outcomes that might be expected given the relative effectiveness of treatments Ability of a measure to detect change statistically

Importance of Sensitivity Need to know measure can detect true change if planning to use it as outcome of intervention Approaches for testing sensitivity are often simultaneous tests of effectiveness of an intervention sensitivity of measures

Measuring Sensitivity Score is stable in those who are not changing Score changes in those who are actually changing (true change) One method Identify groups “known” to change Compare changes in measure across these groups

Examined PHQ-9 change scores in these “known groups” Sensitivity to Change Evidence for PHQ-9 (Short Screener for Depression) Classified patients with major depression (DSM-IV criteria) over time as: Persistent depression Partial remission Full remission Examined PHQ-9 change scores in these “known groups” Löwe B et al. Med Care, 2004;42:1194-1201

Changes in PHQ-9 Scores by Change in Depression at 6 Months Mean change Effect size Persistent depression -4.4 -0.9 Partial remission -8.8 -1.8 Full remission -13.0 -2.6 Löwe et al, 2004, p. 1200

Stewart AL et al. Med Sci Sports Exerc, 2001;33:1126-1141. Considerations in Developing CHAMPS Physical Activity (PA) Questionnaire Needed outcome measure to detect PA changes due to CHAMPS lifestyle intervention increase PA levels in everyday life (e.g., walking, stretching) in activities of their choice Existing measures designed to capture younger persons’ PA Stewart AL et al. Med Sci Sports Exerc, 2001;33:1126-1141.

Changes in Measure Resulting from Intervention: Validity Evidence for Others After CHAMPS intervention detected PA change, others used our results as evidence of “sensitivity to change” Used in Project ACTIVE because of it’s sensitivity to change in CHAMPS (S Wilcox et al, Am J Pub Health, 2006;96:1201-1209)

Sensitivity to Change: Two Issues Measure able to detect true changes One knows how much change is meaningful on the measure

Relevant or Meaningful Change Is the observed change important? To clinician: change might influence patient management To patient: patient notices change amount of change matters

“Minimal Important Difference” (MID) The minimal difference that would result in a change in treatment The smallest change perceived by patients as beneficial

Two Basic Approaches to Estimate MID Anchor-based methods Require external criterion of change Distribution based methods Statistical indicators of change

Anchor-Based Approaches to Estimating MID Requires longitudinal studies Criteria: Clinical endpoints Patient-rated global improvement Some combination M Liang, 2000

Example of Anchor-Based Approach Identify a subgroup in a study that has changed by a “minimal” amount Clinical change Patient reported change Change score in a relevant health measure for this subgroup = MID M Liang, 2000

Locating Groups that Have Changed “Minimally” Administer a global rating of change (perceived change) by patients the anchor Select subset that reported “somewhat better” or “somewhat worse” change in a relevant health measure for this subset = MID

Two Categories Can Define “Minimal Change” Groups Since your surgery, how would you rate the amount of change in your physical functioning? Much worse Somewhat worse About the same Somewhat better Much better

Meaning of Change Depends on Direction of Change A change for the better may result in a different MID than a change for the worse May need to evaluate these as separate estimates

Example: Mean 2-week Change Score in Symptom Measure by Perceived Change Mean change Much better 2.25 A little better 1.41 minimal positive change? About the same 0.42 A little worse -0.29 minimal negative change? Much worse -0.10 C Paterson. BMJ, 1996;312:1016-20.

Distribution-Based Methods Ways of expressing the observed change in a standardized metric Three commonly used: Effect size (ES) Mean change divided by SD at baseline Standardized response mean (SRM) Mean change divided by SD of changes Responsiveness statistic (RS) Mean change divided by SD of change for people who have not changed

Mean 4-week Change Score in Four Measures and Responsiveness Statistic Patients who are “about the same” Patients who are “a little better” Responsiveness score Symptom 1 0.58 1.64 1.14 Activity 0.46 1.33 Well-being 0.39 0.68 Note: scores range from 1-7; Higher change scores indicate improvement C Paterson. BMJ, 1996;312:1016-20.

Multi-item Measures: More Likely to Detect Change Instrument needs to have sufficient variability to detect change Multi-item scales: many scale levels Look for evidence of good variability in sample like yours (at baseline) Room to improve

C Jenkinson et al. Qual Life Res, 1994;3:317-321. Effect Size of Changes in Health Due to Treatment for Menstrual Bleeding Drugs Surgery Self-rated health item -.18 -.10 Health perceptions scale (5 items) -.03 -.64 Energy/vitality -.23 -.89 Mental health -.14 -.65 Pain -.12 -.73 Effect size – change scores standardized. ES = 1 is 1 SD change .20 small, .50 moderate, .80 large C Jenkinson et al. Qual Life Res, 1994;3:317-321.

Summary: MID of Measures MID is based on evidence from multiple studies Over time, learn whether evidence is strong for a particular MID MID of a measure in one context may not generalize to another one e.g. MID for treatment of pain in cancer may differ from MID for treatment of back pain

Readings as a Resource Farivar et al. Stewart et al Sechrest Issues in measuring MID Stewart et al Methods for assessing validity (as developed for the Medical Outcomes Study) Sechrest Classic commentary on validation issues

Next Class (Class 5) Factor analysis with Steve Gregorich

Homework Complete rows 20-26 in matrix for your two measures Validity, responsiveness and sensitivity to change, scoring, and costs