Class 9 and 10 Interpreting Pretest Data Considerations in Modifying Measures Testing Scales and Creating Scores Creating and Presenting Change Scores December 3, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
Tasks in Analyzing Pretest Data For each item Tabulate problems Determine importance of problems Results become basis for possible revisions/adaptations
Tasks in Analyzing Pretest Data For each item Tabulate problems Determine importance of problems Results become basis for possible revisions/adaptations
Methods of Summarizing Problems Optimal: transcripts of all pretest interviews Problems identified through: standard administration – interviewer or respondent probes Analyze dialogue (narrative) for clues to solve problems
Behavioral Coding: Problems with Standard Administration Systematic approach to identifying problems with items administered by interviewer Interviewer problems Respondent problems Method Listen to taped interview Read transcript
Examples of Interviewer “Behaviors” Indicating Problem Items Question misread or altered Slight change – meaning not affected Major change – alters meaning Question skipped by interviewer
Examples of Respondent “Behaviors” Indicating Problem Items Asked for clarification or repeat of question Indicated did not understand question Qualified answer (e.g., it depends) Indicated answer falls between existing response choices Didn’t know the answer Refused
Behavioral Coding Summary Sheet: Problems with Standard Administration Item # Interviewer: difficulty reading Subject: asks to repeat Q Subject: asks for clarification 1 2 3 4
Summarize Behavioral Codes For Each Item Proportion of interviews (respondents) with each problematic behavior # of occurrences of problem divided by N 7/48 respondents requested clarification
Behavioral Coding Summary Sheet: Standard Administration (e.g., n=20) Item # Interviewer: difficulty reading Subject: asks to repeat Q Subject: asks for clarification 1 2/20 1/20 2 3 3/20 4 …..
Missing Data: Clue to Problems More missing data associated with unclear, difficult, or irrelevant items Obtained for self-report administration
How Missing Data Prevalence Helps Items with large percent of responses missing – clue to problem In H-CAHPS® pretest: Did hospital staff talk with you about whether you would have the help you needed when you left the hospital? 35% missing for Spanish group 29% missing for English group MP Hurtado et al. Health Serv Res, 2005;40-6, Part II:2140-2161
Behavioral Coding: Problems Identified Through Probes Systematic approach to identifying respondent problems found via probes Method Listen to taped interview Read transcript
Usefulness of Transcript of Probe Interviews Can organize responses of all interview subjects by item (see example handout) Item 1 probe Response by subject 1 Response by subject 2 Etc. Item 2 probe
Results: Probing Meaning of Phrase I asked you how often doctors asked you about your health beliefs? What does the term ‘health beliefs’ mean to you? S1 - “.. I don’t want medicine” S5 - “.. How I feel, if I was exercising…” S7 - “.. Like religion? --not believing in going to doctors?”
Results: Beck Depression Inventory (BDI) Cognitive interviews older adults, oncology pts, and less educated adults Administered some selected BDI items Asked respondents to paraphrase items TL Sentell, Community Mental Health Journal, 2008;39:323
Results: Beck Depression Inventory (BDI) (cont) For each item, from 0-62% correctly paraphrased item Most misunderstandings: vocabulary confusion Phrase: I am critical of myself for my weaknesses and mistakes “Critical is when you’re very sick” “I don’t know how to explain mistakes”
Behavioral Coding of Probe Results I asked you how often doctors asked you about your health beliefs. What does the term “health beliefs” mean to you? Behavioral coding: # times response indicated lack of understanding as intended e.g., 2/10 respondents did not understand meaning based on response to probe
Results: Probing Meaning of Phrase On about how many of the past 7 days did you eat foods that are high in fiber, like whole grains, raw fruits, and raw vegetables? Probe: what does the term “high fiber” mean to you? Behavioral coding of standard administration Over half of respondents exhibited a problem Review of answers to probe Over ¼ did not understand the term Blixt S et al., Proceedings of section on survey research methods, American Statistical Association, 1993:1442.
Behavioral Coding Summary: Probes (+ Standard Administration) Item # Probe N=10 Meaning unclear Interviewer -difficulty reading Subject: asks to repeat Q Subject: asks for clarification 1 10 2/10 2/20 1/20 2 3 4/15 3/20 4
Probes Can Identify Problems Even When No Problem “Behaviors” Found Respondents appear to answer question appropriately during standard administration No problem behavior codes However, problems identified with probes Probe on meaning: Response indicates lack of understanding Probe on use of response options: Response indicates options are problematic
Results: No Behavior Coding Issues but Probe Detected Problems I seem to get sick a little easier than other people (definitely true, mostly true, mostly false, definitely false) Behavioral coding of standard administration Very few problems Review of answers to probe Almost 3/4 had comprehension problems Most problems around term “mostly” (either its true or its not) Blixt S et al., Proceedings of section on survey research methods, American Statistical Association, 1993:1442.
Interpret Behavioral Coding Results Determine if problems are common Items with only a few problems may be fine Quantifying “common” problems several types of problems (many row entries) several subjects experienced a problem problem w/item identified in >15% of interviews
Continue Analyzing Items with “Common” Problems Identify “serious” common problems Gross misunderstanding of the question Yields completely erroneous answer Couldn’t answer the question at all Some less serious problems can be addressed by improved instructions or a slight modification
Addressing More Serious Problems Conduct content analysis of transcript Use qualitative analysis software (e.g., NVIVO) For these items: review dialogue that ensued during administration of item and probes can reveal source of problems can help in deciding whether to keep, modify or drop items
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
Full Talk on Modifying Measures A Framework for Understanding and Discussing Modifications to Measures in Studies of Diverse Groups GSA 2009 RCMAR Preconference Workshop Slides posted on syllabus: “GSA preconference slides”
Overview What is the problem? Why would we modify a measure? What information is used to modify? What are the types of modifications? How should we test modified measures?
Use the existing measure “as is” to preserve integrity of measure OR When Problems are Found Through Pretesting… Investigators Face a Choice Use the existing measure “as is” to preserve integrity of measure OR Try to modify the measure to address problems in diverse group Dilemma once problems are found..
Argument in Favor of Using Measure “As Is” Modifications can change the measure’s validity and reliability Allows comparison of findings to other research using the measure
Argument Against Using Measure “As Is” …. …when problems are found If reliability and validity are poor… Results pertaining to the measure could be erroneous Limited internal validity Erroneous conclusions about the research questions of interest Ability to compare to other research (external validity) is moot
Reasons for Considering Modifying an Existing Measure Key reason Sample/population differs from that in which original measure developed Other reasons Measure developed awhile ago Poor format/presentation Study context issues Four basic reasons for… Several other reasons having nothing to do with sample differences…..(GO ON)
Key Reason: Population Group Differences from Original Mainstream research Different disease, health problem, patient group, age group Research in diverse population groups Different culture, race/ethnic group Lower level of socioeconomic status (SES) Limited English proficiency, lower literacy In mainstream research – the bulk of the literature – modifications tend to be because of differences in the disease or health problem, or patient group differences such as age. e.g. fatigue severity measure modified for MS patients
Reasons: Measure Developed in Prior “Era” Historical events or changes have affected concept definition In all populations Specific to a diverse group Language use out of date Science of self-report not well developed Many existing measures were developed in the 70s and 80s – This has little to do with diverse groups, but instead to historical events or changes in society. Many older measures use phrases that are out of date Since then, we have learned a great deal about “the science of self-reported measures”.
Reasons: Poor Format/Presentation = High Respondent Burden Instructions unnecessarily wordy, unclear Way of responding is complicated Difficult to navigate the questionnaire Crowded on the page Hard to track across the page Hard to read Poor contrast, small font Following to some extent on the prior era Many measures not designed according to good survey design principles Poorly formatted and presented on the page.
Example: Complex Instructions Instructions: There are 12 statements on this form. They are statements about families. You are to decide which of these statements are true of your family and which are false. If you think the statement is TRUE or MOSTLY TRUE of your family, please mark the box in the T (TRUE) column. If you think the statement is FALSE or MOSTLY FALSE of your family, please mark the box in the F (FALSE) column. You may feel that some of the statements are true for some family members and false for others. Mark the box in the T column if the statement is TRUE for most members. Mark the box in the F column if the statement is FALSE for most members. If the members are evenly divide, decide what is the stronger overall impression and answer accordingly. Remember, we would like to know what your family seems like to you. So do not try to figure out how other members see your family, but do give us your general impression of your family for each statement. Do not skip any item. Please begin with the first item. I thought this was a perfect example of how one might need to consider simplifying the instructions. And there are many existing measures with similar problems – if not quite so extreme. Family Environment Scale Instructions
Example: Burdensome Way of Responding For each question, choose from the following alternatives: 0 = Never 1 = Almost Never 2 = Sometimes 3 = Fairly Often 4 = Very Often 1. In the last month, how often have you felt nervous and “stressed”? ……………………………………. 1 2 3 4 2. In the last month, how often have you felt that things were going your way?.................................... Many instruments in order to save paper I guess present the response choices along with the instructions, and then ask respondents to use those to answer. In this example, at least the same numbers appear beside the items, but the respondent has to remember the meaning of each number or refer back and forth. As one gets to the bottom of the page of 22 items, this is cumbersome. I’ve also seen instruments where a blank space is provided alongside each item and respondents are supposed to write in the correct response number. S Cohen et al. J Health Soc Beh, 1983;24(4):385-396.
What Information is Used to Decide How to Modify a Measure? Information on conceptual differences in diverse population Including information to make revisions Identified through Qualitative research Published reviews of measures To modify a measure, we need information on which to base the modifications.
Basis: Qualitative Research Methods Focus groups In-depth qualitative interviews Expert panel reviews Standard pretests Cognitive interview pretests
Basis: Qualitative Research – Two Applications Explore concept definition in diverse group Independent of a particular measure Explore a specific measure in diverse group Conceptual adequacy Administration problems . Qualitative methods help us learn how a concept is defined by a new group – The results can provide information on: 1- how the concept differs from original 2 –how to make modifications, i.e., what added information is needed Qualitative methods can obtain feedback on an existing measure 1- can help find problems 2 – often provides solutions to problems
Basis: Published Reviews Increasingly – systematic reviews of how well existing measures work in diverse population groups Summaries across studies Results and recommendations provide basis for specific modifications The RCMAR’s published one such special issue including several reviews. Reviews on measures of physical activity, depression, and neighborhood environments in minority populations.
RJ Coates et al. Am J Clin Nutr; 1997;65(suppl):1108S-15S. Example: Published Review - Measures of Dietary Intake in Minority Populations Reviewed food frequency questionnaires for use in minority populations Performed well in some groups and poorly in others Group differences that could affect scores: Portion sizes differ Missing ethnic foods Could underestimate total intake and nutrients One example – Their suggestions provide a basis for modifying existing FFQs for use in diverse populations RJ Coates et al. Am J Clin Nutr; 1997;65(suppl):1108S-15S.
Types of Modifications Format or presentation Content Dimensions Item stems Response options Most modifications can be classified into CONTENT or FORMAT Three main types of content modifications:
Format/Presentation Modifications Goal: reduce respondent burden Improve appearance or way of responding Simplify instructions Modify format for responding Create more space, reduce crowded items Add illustrations Improve contrast, increase font size
Resource on Formatting Paul Mullin et al, Applying cognitive design principles to formatting HRQOL instruments, Quality of Life Research, 2000;9:13-27.
Types of Modifications Format or presentation Content Dimensions Item stems Response options Add Drop Replace Modify
Content Modification Example: Add Dimension Social support – typical dimensions Tangible, emotional, informational Older Korean/Chinese immigrants – additional dimension Language support Added to existing measure (based on focus group data) Help with translation at medical appointments Help to ask questions in English when on the phone Help to learn English Example by some RCMAR scholars – S Wong et al. Int J Health Human Dev, 2005;61:105-121.
Content Modification Example: Modify Item Stems If wording unclear, add parenthetical phrases Have you ever been told by a doctor that you have… Diabetes (sugar in blood or high sugar)? Hypertension (high blood pressure)? Anemia (low blood) Modifying item stems is often needed to meet the needs of those with limited English proficiency, lower levels of education, or limited literacy. This can be done without any substantive changes by simply adding parenthetical phrases as alternatives to words that might not be understood
Adding or Deleting a Response Choice If too few response choices Add an option within existing response scale If too many response choices Drop one option
Content Modification Example: Too Few Response Choices How much is each person (or group of persons) supportive for you at this time in your life. Your wife, husband, or significant other person: - None - Some - A lot Often, for simplicity, measures were designed with only a few response choices. However, given the tendency of many to not endorse either extreme, this provides extremely limited variability. Although not done to my knowledge, this could benefit from the addition of 1-2 more levels. G Parkerson et al. Fam Med; 1991;23:357-60.
Content Modification Example: Replace Response Choices How much is each person (or group of persons) supportive for you at this time in your life. Your wife, husband, or significant other person: - None - Not at all - Some - A little - A lot - Moderately - Quite a bit - Extremely Often, for simplicity, measures were designed with only a few response choices. However, given the tendency of many to not endorse either extreme, this provides extremely limited variability. Although not done to my knowledge, this could benefit from the addition of 1-2 more levels.
Content Modification Example: Replace Response Choices Health Perceptions Scale in older adults e.g., My health is excellent, I expect my health to get worse Original: 1 - Definitely true 2 - Mostly true 3 - Don’t know 4 - Mostly false 5 - Definitely false Modified: 1 – Not at all true 2 – A little true 3 - Somewhat true 4 - Mostly true 5 – Definitely true Agree/disagree response scales, or bidirectional response scales are VERY hard for respondents to use. In a study of Thai older adults, this investigator replaced it with a unidimensional response scale, varying in the extent to which the statement was true. L. Thiamwong et al, Thai J Nursing Res, 2008:12(4):286-296.
Minor to Major Modifications? Each type of modification can hypothetically be rated on a continuum from having minor to major impact on reliability and validity of original measure Minor – slight changes in format/presentation …… Major – numerous changes in dimensions, items, and response choices Defining the middle of the continuum is harder than one would think.
Do Not Make Assumptions All modifications, no matter how small, can affect reliability and validity of original measure Do not guess impact of modifications Burden is on investigator to test modified measure But what about several very minor modifications? What we all agree on is that any modifications CAN affect the reliability and validity of the original measure. We therefore suggest that at this stage in modifications research, investigators need to test the modified measures
Recommendations for Testing Modified Measures Pretest modified measure extensively before fielding in new study Build in ability to do psychometric testing when measure is fielded Add validity variables (e.g., similar to original measure to test comparability) Add follow-up to assess test-retest reliability Two key recommendations here 1 – PRETEST 2 – build in ability to conduct the needed analyses to justify the modifications
Analyze Psychometric Adequacy of Modified Measure in New Study Modified measure should meet minimal criteria Item-scale correlations Internal-consistency reliability
Analyzing Modified Measure: Comparability to Original Measure Compare measurement results of modified measure to original measure Reliability (sample dependent) Factor structure Construct validity Sensitivity to change Rather than prove that the modified measure is BETTER than original, is it COMPARABLE. Does factor structure conform to that of the original measure? Does the measure correlate similarly with validity variables as original measure does? Does the modified measure detect as much change over time as the original?
Some Suggestions: Avoid Dropping Items If modifications are only changes in item stems: Retain all original items Add modified items at the end Can test original and modified measure Assumes only a few new items
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
Questionnaire Guides Organizing your survey measures Keep track of measurement decisions Sample guide to measures (class 8) Documents sources of measures Any modifications, reason for modification
Sample “Summary of Survey Variables..” Handout Develop “codebook” of scoring rules Several purposes Variable list Meaning of scores (direction of high score) Special coding How missing data handled Type of variable (helps in analyses)
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
On to Your Field Test or Study What to do once you have your baseline data How to create summated scale scores
Review Surveys for Data Quality Examine each survey in detail as soon as it is returned, and mark any.. Missing data Inconsistent or ambiguous answers Skip patterns that were not followed
Reclaim Missing and Ambiguous Data Go over problems with respondent If survey returned in person, review then If mailed, call respondent ASAP, go over missing and ambiguous answers If you cannot reach by telephone, make a copy for your files and mail back the survey with request to clarify missing data
Print Frequencies of Each Item and Review: Range Checks Verify that responses for each item are within acceptable range Out of range values can be checked on original questionnaire corrected or considered “missing” Sometimes out of range values mean that an item has been entered in the wrong column a check on data entry quality
Testing Scaling Properties and Reliability in Your Sample for Multi-Item Scales Obtain item-scale correlations Part of internal consistency reliability program Calculate reliability in your sample (regardless of known reliability in other studies) internal-consistency for multi-item scales test-retest if you obtained it
SAS/SPSS Both Make Item Convergence Analysis Easy Reliability programs provide: Item-scale correlations corrected for overlap Internal consistency reliability (coefficient alpha) Reliability with each item removed To see effect of removing an item
SAS – Obtaining Item-Scale Correlations and Coefficient Alpha PROC CORR DATA=data-set-name ALPHA VAR (list of variables) Output: Coefficient alpha Item correlations Item-scale correlations corrected for overlap SAS Manual, Chapter 3: Assessing Scale Reliability with Coefficient Alpha
Testing Reliability in STATA www.stata.com/help.egi?alpha Alpha varlist [if] [in] [, options] NOTE: item-rest correlations are those corrected for overlap
What to Look For Review unstandardized coefficient alpha and item-total or item-scale correlations (corrected for overlap) Each item should correlate at least .30 with total Internal consistency should be at least .70
Item-Scale Correlations If any items correlate <.30 with the sum of the other items in the scale.. How much lower than .30? How many items? Does omitting this item increase reliability substantially?
Item-Scale Correlations (cont) If you try deleting items Do it one item at a time – removing one item changes all item-scale correlations Sometimes removing the worst item corrects other problems
What if Reliability is Too Low? How much lower than .70? Does removing items with low item-scale correlations increase alpha? For new scales under development Modify using item-scale criteria For standard scales (published) Report problems as caveats in your analyses Create a modified scale Report results using standard and modified scale
Creating Summated Ratings Scale Scores After final items determined (meet criteria for item-scale correlations and internal consistency reliability) Translate your “codebook” scoring rules into program code (SAS, SPSS): Reverse items in wrong direction (e.g., higher = better) Average all items - allows score if any item is answered Apply missing data rule if different e.g., if more than 50% items missing
Review Summated Scores Descriptive Statistics Review for out-of-range values, outliers, expected mean For scores with problems, review programming statements, locate errors and correct Repeat process until computer algorithm is producing accurate scores To test programming accuracy calculate scores by hand from 2 questionnaires check that they match computer generated scores
Summarize Measurement Characteristics (Handout) Present for each final scale: % missing Mean, standard deviation Observed range, possible range Floor and ceiling effects, skewness statistic Range of item-scale correlations Number of item-scale correlations > .30 Internal consistency reliability
Overview of Class 9 and 10 Analyzing pretest data Modifying/adapting measures Keeping track of your study measures Testing basic psychometric properties, creating summated ratings scales, and presenting measurement information Creating and presenting change scores
Two Basic Types of Change Scores Measured change Difference in scores between baseline and follow-up Perceived change How much change respondent reports (from some prior time period)
Change Scores are Important Variables! Creating change score variables is complex Requires thought ahead of time Don’t rely on your programmer Include specification of change scores in your codebook
Measured Change Example: measure administered at baseline and 1 month after treatment Pain in past 2 weeks 0-10 numeric scale, 10 = worst pain Hypothetical results for 1 person Time 1 (baseline) – score of 5 Time 2 (one month) – score of 8
How Should Change be Scored? Time 1 (baseline) - score of 5 Time 2 (one month) - score of 8 Two options: Option 1: time 2 minus time 1 Option 2: time 1 minus time 2
How Should Change be Scored? (cont) Time 1 (baseline) - score of 5 Time 2 (one month) - score of 8 Two options: Option 1: time 2 minus time 1 = 3 Option 2: time 1 minus time 2 = -3
Interpreting Change Score What do you want the change score to indicate? Positive change score = improving? Positive change score = worsening? Scoring “rule” depends on: Direction of scores on original measure (is higher score better or worse?) Which was subtracted from which?
Define Change Score In Codebook: Algorithms You want positive score = improvement If high score on measure is better Time 2 minus time 1 If high score on measure is worse Time 1 minus time 2 You want positive score = decline If high score on measure is better Time 1 minus time 2 If high score on measure is worse Time 2 minus time 1
Recommendation: Make Change Score Intuitively Meaningful If high score on measure = better Calculate change score so positive change score = improved Time 2 minus time 1 If high score on measure = worse Calculate change scores so positive change score = improved Time 1 minus time 2
Interpreting “Measured Change” Scores: What is Wrong? In a study predicting utilization of health care (outpatient visits) over a 1-year period as a function of self-efficacy… A results sentence: “Reduced utilization at one year was associated with level of self efficacy at baseline (p < .01) and with 6-month changes in self efficacy (p < .05).”
Interpreting “Measured Change” Scores: Making it Clearer “Reduced outpatient visits at one year were associated with lower levels of self efficacy at baseline (p < .01) and with 6-month improvements in self efficacy.”
Two Basic Types of Change Scores Measured change Difference in scores between baseline and follow-up Perceived change How much change respondent reports (from some prior time period)
Perceived Change (Retrospective Change) How much has your physical functioning changed since your surgery? -3 very much worse -2 much worse -1 worse 0 no change 1 better 2 much better 3 very much better
Perceived/Retrospective Change Perceived change enables respondent to define a concept in terms of what it means to them Measured change is a change on specific questions that were contained in a particular measure
Example of Measured vs. Perceived Change Measuring change in physical functioning 2 months after abdominal surgery Case: woman has more problems bending over than before surgery
Measured Change Since Abdominal Surgery Physical functioning measured at baseline and 2 months after surgery Difficulty walking Difficulty climbing stairs Measured change: change on these specific physical functions Measured change will not detect change in bending over
Measuring Perceived Change in Physical Functioning To what extent did your physical functioning change since just before your surgery? Much worse Worse No change Better Much better If person considers bending over as part of physical functioning, she will report becoming worse
Recommendations: Include Both Types of Measures Measured change enables Comparison with other studies May be more sensitive - has more scale levels Investigator defines clinically relevant outcomes Perceived/Retrospective change enables Person to report on domain using their own definition Picks up changes “unmeasured” by particular measure
Thank you! Final paper due by December 10 See handout (posted on syllabus during week 7)