Paul K. Crane, MD MPH Dan M. Mungas, PhD

Paul K. Crane, MD MPH Dan M. Mungas, PhD
Composite scores Paul K. Crane, MD MPH Dan M. Mungas, PhD

Disclaimer Funding for this conference was made possible, in part by Grant R13 AG from the National Institute on Aging. The views expressed do not necessarily reflect the official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government. Drs. Harvey and Crane have no conflicts of interest to report.

Outline Neuropsychological practice and the utility of z scores
Why composite scores? Drinking from the fire hose Z scores head to head with IRT scores Conclusions

Neuropsychological practice
Often focused on patterns of cognitive deficits across different domains Useful in differential diagnosis of the cognitively impaired individual May emphasize a premorbid estimate of ability Multiple determinants, including occupation, educational attainment, military rank (for vets) Vocabulary preserved in early AD, so it may be used as well

Scores and communication
Neuropsychological batteries contain many tests, each with a different scoring metric Familiarity permits experts to understand what a Mattis score of 130 and 4 errors on the Clock and a Trails B time of 142 seconds imply about the examinee’s cognitive functioning Difficult to communicate these scores to less experienced colleagues

Clinical use of z scores
Z scores facilitate short-hand communication Relatively easy to calculate Requires some average score and some standard deviation to calculate; age-specific or race-specific or education-specific norms? May not matter much within an individual, unless different tests have much different demographic impacts Makes it much easier for individuals with less experience with the tests to identify domains with deficits

Rationale for composite scores
Summary scores are very helpful for analyses Better measurement properties together than any instrument on its own Avoid problems from multiple hypotheses True signal scenario with multiple bad tests: a few show p<0.05, a few don’t, looks like the work of chance True signal scenario with one good test: may work Depends on measurement properties, and whether the items pooled together measure the thing intended (dimensionality and validity issues)

Logical next step in the z score story
Very simple extension to average the z scores of the tests within a domain and use that average z score in analyses Commonly done, even considered relatively sophisticated by study sections in 2008 But: may not be the best thing to do from a psychometrics perspective

Assumptions of z scores
Each item / scale / test has equal weight on the overall score Is letter fluency with 3 letters 1 test or 3 tests? This matters (1/n influence on total score, or 3/n influence on total score) The scale determined by the standard deviation Highly variable items / scales / tests are weighted less Less variable items / scales / tests are weighted more Is this what we would want? Wouldn’t we want to incorporate information about the relative difficulty of different tests?

Linearity Hidden in z scores is an assumption that 1 SD difference in scores has the same meaning related to underlying domain measured by the test in all regions (Usually “ability” for neuropsychological tests) A z score is a transformed sum score Tests constructed without modern psychometrics tend to have a common structure: most of the items are in the middle

Global cognitive tests

Curvilinear scaling

Curvilinearity in a longitudinal study
Where you start on the curve matters a great deal in how much change there appears to be

Linear scaling 1 Low ability High ability

Linear scaling 2 High ability (difficult items) Low ability (easy items) XXXXXXXXX XXXX XXXXX XX

Linear scaling 3 Low ability High ability XXXXXXXXX XXXX XXXXX XX A0 B0

Linear scaling 4 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0

Linear scaling 5 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0 11 “at risk” points 1 “at risk” point

Linear scaling 6 Low ability High ability XXXXXXXXX XXXX XXXXX XX +2 -2… -1 +1 …+3 Mean=7, SD=5

Same example with a different population
Low ability High ability XXXXXXXXX XXXX XXXXX XX Mean=13, SD=2

Same issue with Fluency
Let’s say in a population the mean of /F/ is 12 in 1 minute, SD 3 Using a z score implies difference in implication between 3 and 6 words is the same as the difference in implication between 33 and 36 words (1 SD unit difference) 3: really awful. 6: pretty bad. 33 and 36: both really good. Certainly 33 and 36 not qualitatively as different as 3 and 6 are Similarly, difference between 3 and 12 (awful and average) is the same as between 36 and 45 (really good and superb) (3 SD units)

Bias in the rate of change

Zero in z scores Average score is 0 for each test
Weights for scores different from 0 determined by the variability (in the form of the SD), not the relative difficulty of the test Is this what we would want?

Summary: issues with z scores
Dimensionality: Should we lump these items / scales / tests together? Equal weighting: Should each item / scale / test receive equal weight in the overall composite score? Scales based on variability: Is it appropriate to base the scale on the observed SD in the population? Equal difficulty: Are all of the items / scales / tests equally difficult? Linearity: Is the relationship with the underlying construct measured by the test the same across the entire spectrum?

Z scores vs. IRT scores IRT scores offer more flexibility; linear scaling Weighting based on relative difficulties of different tests (Different handling of demographic heterogeneity) Facilitates specific attention to measurement error / precision

2 head to head studies: Study 1
FH 2005, in press at JINS Executive functioning battery added to SENAS Subset had MRI evaluations Compared IRT to z scores head to head in terms of strength of relationship with neuroimaging parameters Demographic heterogeneity

Conceptual model i … Ability Demographics Composite Score MRI
1 n n+1 n+m Demographics … Composite Score MRI Items with DIF Items without

Findings Strength of relationship of executive functioning composite with MRI was similar for IRT scores as for composite z score Accounting for heterogeneity in ages using adjusted z scores decreased strength of relationship Accounting for ethnicity / language, education, and gender did not impair strength of relationship Accounting for heterogeneity using IRT and DIF did not impact strength of relationship

Study 2 Convenience sample with three known groups: AD, impaired cognition with no dementia, and normal cognition Neuropsychological battery administered, including several measures of executive functioning

Digits backwards from the CASI
I think it’s items in the CASI How to score these items? (not clear) Does it matter? (absolutely)

Digits backwards Score 1: more credit for 4 digits than 3 digits
Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both

Strength of relationship with cognitive impairment
IRT scores were at least as good as z scores Demographic adjustment in the z score framework was a bad idea Demographic adjustment in the IRT framework was not as bad an idea

Conclusions Many theoretical reasons latent trait scores (such as IRT) would be preferred to classical test theory scores (such as z scores) Here 2 specific examples of relative validity of executive functioning composites Theoretically and practically better approach to demographic heterogeneity

Paul K. Crane, MD MPH Dan M. Mungas, PhD

Similar presentations

Presentation on theme: "Paul K. Crane, MD MPH Dan M. Mungas, PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Paul K. Crane, MD MPH Dan M. Mungas, PhD

Similar presentations

Presentation on theme: "Paul K. Crane, MD MPH Dan M. Mungas, PhD"— Presentation transcript:

Similar presentations

About project

Feedback