Download presentation
Presentation is loading. Please wait.
1
Paul K. Crane, MD MPH Dan M. Mungas, PhD
Composite scores Paul K. Crane, MD MPH Dan M. Mungas, PhD
2
Disclaimer Funding for this conference was made possible, in part by Grant R13 AG from the National Institute on Aging. The views expressed do not necessarily reflect the official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government. Drs. Harvey and Crane have no conflicts of interest to report.
3
Outline Neuropsychological practice and the utility of z scores
Why composite scores? Drinking from the fire hose Z scores head to head with IRT scores Conclusions
4
Neuropsychological practice
Often focused on patterns of cognitive deficits across different domains Useful in differential diagnosis of the cognitively impaired individual May emphasize a premorbid estimate of ability Multiple determinants, including occupation, educational attainment, military rank (for vets) Vocabulary preserved in early AD, so it may be used as well
5
Scores and communication
Neuropsychological batteries contain many tests, each with a different scoring metric Familiarity permits experts to understand what a Mattis score of 130 and 4 errors on the Clock and a Trails B time of 142 seconds imply about the examinee’s cognitive functioning Difficult to communicate these scores to less experienced colleagues
6
Clinical use of z scores
Z scores facilitate short-hand communication Relatively easy to calculate Requires some average score and some standard deviation to calculate; age-specific or race-specific or education-specific norms? May not matter much within an individual, unless different tests have much different demographic impacts Makes it much easier for individuals with less experience with the tests to identify domains with deficits
7
Rationale for composite scores
Summary scores are very helpful for analyses Better measurement properties together than any instrument on its own Avoid problems from multiple hypotheses True signal scenario with multiple bad tests: a few show p<0.05, a few don’t, looks like the work of chance True signal scenario with one good test: may work Depends on measurement properties, and whether the items pooled together measure the thing intended (dimensionality and validity issues)
8
Logical next step in the z score story
Very simple extension to average the z scores of the tests within a domain and use that average z score in analyses Commonly done, even considered relatively sophisticated by study sections in 2008 But: may not be the best thing to do from a psychometrics perspective
9
Assumptions of z scores
Each item / scale / test has equal weight on the overall score Is letter fluency with 3 letters 1 test or 3 tests? This matters (1/n influence on total score, or 3/n influence on total score) The scale determined by the standard deviation Highly variable items / scales / tests are weighted less Less variable items / scales / tests are weighted more Is this what we would want? Wouldn’t we want to incorporate information about the relative difficulty of different tests?
10
Linearity Hidden in z scores is an assumption that 1 SD difference in scores has the same meaning related to underlying domain measured by the test in all regions (Usually “ability” for neuropsychological tests) A z score is a transformed sum score Tests constructed without modern psychometrics tend to have a common structure: most of the items are in the middle
11
Global cognitive tests
12
Curvilinear scaling
13
Curvilinearity in a longitudinal study
Where you start on the curve matters a great deal in how much change there appears to be
14
Linear scaling 1 Low ability High ability
15
Linear scaling 2 High ability (difficult items) Low ability (easy items) XXXXXXXXX XXXX XXXXX XX
16
Linear scaling 3 Low ability High ability XXXXXXXXX XXXX XXXXX XX A0 B0
17
Linear scaling 4 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0
18
Linear scaling 5 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0 11 “at risk” points 1 “at risk” point
19
Linear scaling 6 Low ability High ability XXXXXXXXX XXXX XXXXX XX +2 -2… -1 +1 …+3 Mean=7, SD=5
20
Same example with a different population
Low ability High ability XXXXXXXXX XXXX XXXXX XX Mean=13, SD=2
21
Same issue with Fluency
Let’s say in a population the mean of /F/ is 12 in 1 minute, SD 3 Using a z score implies difference in implication between 3 and 6 words is the same as the difference in implication between 33 and 36 words (1 SD unit difference) 3: really awful. 6: pretty bad. 33 and 36: both really good. Certainly 33 and 36 not qualitatively as different as 3 and 6 are Similarly, difference between 3 and 12 (awful and average) is the same as between 36 and 45 (really good and superb) (3 SD units)
22
Bias in the rate of change
23
Zero in z scores Average score is 0 for each test
Weights for scores different from 0 determined by the variability (in the form of the SD), not the relative difficulty of the test Is this what we would want?
24
Summary: issues with z scores
Dimensionality: Should we lump these items / scales / tests together? Equal weighting: Should each item / scale / test receive equal weight in the overall composite score? Scales based on variability: Is it appropriate to base the scale on the observed SD in the population? Equal difficulty: Are all of the items / scales / tests equally difficult? Linearity: Is the relationship with the underlying construct measured by the test the same across the entire spectrum?
25
Z scores vs. IRT scores IRT scores offer more flexibility; linear scaling Weighting based on relative difficulties of different tests (Different handling of demographic heterogeneity) Facilitates specific attention to measurement error / precision
26
2 head to head studies: Study 1
FH 2005, in press at JINS Executive functioning battery added to SENAS Subset had MRI evaluations Compared IRT to z scores head to head in terms of strength of relationship with neuroimaging parameters Demographic heterogeneity
27
Conceptual model i … Ability Demographics Composite Score MRI
1 n n+1 n+m Demographics … Composite Score MRI Items with DIF Items without
28
Findings Strength of relationship of executive functioning composite with MRI was similar for IRT scores as for composite z score Accounting for heterogeneity in ages using adjusted z scores decreased strength of relationship Accounting for ethnicity / language, education, and gender did not impair strength of relationship Accounting for heterogeneity using IRT and DIF did not impact strength of relationship
29
Study 2 Convenience sample with three known groups: AD, impaired cognition with no dementia, and normal cognition Neuropsychological battery administered, including several measures of executive functioning
30
Digits backwards from the CASI
I think it’s items in the CASI How to score these items? (not clear) Does it matter? (absolutely)
31
Digits backwards Score 1: more credit for 4 digits than 3 digits
Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both
32
Digits backwards Score 1: more credit for 4 digits than 3 digits
Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both
33
Strength of relationship with cognitive impairment
IRT scores were at least as good as z scores Demographic adjustment in the z score framework was a bad idea Demographic adjustment in the IRT framework was not as bad an idea
34
Conclusions Many theoretical reasons latent trait scores (such as IRT) would be preferred to classical test theory scores (such as z scores) Here 2 specific examples of relative validity of executive functioning composites Theoretically and practically better approach to demographic heterogeneity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.