Paul K. Crane, MD MPH Dan M. Mungas, PhD

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

Business Statistics for Managerial Decision
Intro to Statistics for the Behavioral Sciences PSYC 1900
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Why Scale -- 1 Summarising data –Allows description of developing competence Construct validation –Dealing with many items rotated test forms –check how.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Factor Analysis Psy 524 Ainsworth.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
DEVELOPING ALGEBRA-READY STUDENTS FOR MIDDLE SCHOOL: EXPLORING THE IMPACT OF EARLY ALGEBRA PRINCIPAL INVESTIGATORS:Maria L. Blanton, University of Massachusetts.
Measurement in Exercise and Sport Psychology Research EPHE 348.
Instrumentation.
Latent Variable Modeling of Neuropathology Data: Implications for Collaborative Science Dan Mungas University of California, Davis Friday Harbor Psychometrics,
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Survey of alternative uses of ACTIVE data, review of completed work and lessons learned Friday Harbor Advanced Psychometrics Workshop June 9-13, 2014 Presenter:
Friday Harbor Psychometrics 2012 Scientific Summary UC Davis / SENAS (Spanish and English Neuropsychological Assessment Scales)
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Translation and Cross-Cultural Equivalence of Health Measures
Item Response Theory in Health Measurement
Item Response Theory Dan Mungas, Ph.D. Department of Neurology
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
UC Davis Alzheimer’s Disease Center The Residual Approach to Measuring Cognitive Reserve in Aging and Dementia Bruce Reed & Dan Mungas University of California,
Instrument Development and Psychometric Evaluation: Scientific Standards May 2012 Dynamic Tools to Measure Health Outcomes from the Patient Perspective.
Longitudinal Data & Mixed Effects Models Danielle J. Harvey UC Davis.
Stats Methods at IC Lecture 3: Regression.
Outline Sampling Measurement Descriptive Statistics:
Friday Harbor Laboratory University of Washington August 22-26, 2005
SAT Prep Lesson #3 SAT vs. ACT.
Looking at the both ‘ends’ of the social aptitude dimension
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Reliability and Validity
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Concept of Test Validity
Statistics: The Z score and the normal distribution
Assessment Theory and Models Part I
Item Analysis: Classical and Beyond
Introduction to Statistics and Research
Measuring Social Life: How Many? How Much? What Type?
Early Cognitive Decline and the Aging Brain - Overview
Prepared by Lee Revere and John Large
Descriptive and inferential statistics. Confidence interval
His Name Shall Be Revered …
Assessment in Career Counseling
Spanish and English Neuropsychological Assessment Scales - Guiding Principles and Evolution Friday Harbor Psychometrics Workshop 2005.
Chapter 8: Estimating with Confidence
Rai University , November 2014
Product moment correlation
Friday Harbor Laboratory University of Washington August 27-31, 2006
Studies of Cognitive Reserve in WHICAP
Cognitive Reserve Concepts
DIF detection using OLR
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Introduction to IRT for non-psychometricians
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Longitudinal Data & Mixed Effects Models
Chapter 8: Estimating with Confidence
Item Analysis: Classical and Beyond
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Item Analysis: Classical and Beyond
Item Response Theory Applications in Health Ron D. Hays, Discussant
Presentation transcript:

Paul K. Crane, MD MPH Dan M. Mungas, PhD Composite scores Paul K. Crane, MD MPH Dan M. Mungas, PhD

Disclaimer Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. The views expressed do not necessarily reflect the official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government. Drs. Harvey and Crane have no conflicts of interest to report.

Outline Neuropsychological practice and the utility of z scores Why composite scores? Drinking from the fire hose Z scores head to head with IRT scores Conclusions

Neuropsychological practice Often focused on patterns of cognitive deficits across different domains Useful in differential diagnosis of the cognitively impaired individual May emphasize a premorbid estimate of ability Multiple determinants, including occupation, educational attainment, military rank (for vets) Vocabulary preserved in early AD, so it may be used as well

Scores and communication Neuropsychological batteries contain many tests, each with a different scoring metric Familiarity permits experts to understand what a Mattis score of 130 and 4 errors on the Clock and a Trails B time of 142 seconds imply about the examinee’s cognitive functioning Difficult to communicate these scores to less experienced colleagues

Clinical use of z scores Z scores facilitate short-hand communication Relatively easy to calculate Requires some average score and some standard deviation to calculate; age-specific or race-specific or education-specific norms? May not matter much within an individual, unless different tests have much different demographic impacts Makes it much easier for individuals with less experience with the tests to identify domains with deficits

Rationale for composite scores Summary scores are very helpful for analyses Better measurement properties together than any instrument on its own Avoid problems from multiple hypotheses True signal scenario with multiple bad tests: a few show p<0.05, a few don’t, looks like the work of chance True signal scenario with one good test: may work Depends on measurement properties, and whether the items pooled together measure the thing intended (dimensionality and validity issues)

Logical next step in the z score story Very simple extension to average the z scores of the tests within a domain and use that average z score in analyses Commonly done, even considered relatively sophisticated by study sections in 2008 But: may not be the best thing to do from a psychometrics perspective

Assumptions of z scores Each item / scale / test has equal weight on the overall score Is letter fluency with 3 letters 1 test or 3 tests? This matters (1/n influence on total score, or 3/n influence on total score) The scale determined by the standard deviation Highly variable items / scales / tests are weighted less Less variable items / scales / tests are weighted more Is this what we would want? Wouldn’t we want to incorporate information about the relative difficulty of different tests?

Linearity Hidden in z scores is an assumption that 1 SD difference in scores has the same meaning related to underlying domain measured by the test in all regions (Usually “ability” for neuropsychological tests) A z score is a transformed sum score Tests constructed without modern psychometrics tend to have a common structure: most of the items are in the middle

Global cognitive tests

Curvilinear scaling

Curvilinearity in a longitudinal study Where you start on the curve matters a great deal in how much change there appears to be

Linear scaling 1 Low ability High ability

Linear scaling 2 High ability (difficult items) Low ability (easy items) XXXXXXXXX XXXX XXXXX XX

Linear scaling 3 Low ability High ability XXXXXXXXX XXXX XXXXX XX A0 B0

Linear scaling 4 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0

Linear scaling 5 Low ability High ability XXXXXXXXX XXXX XXXXX XX A1 A0 B1 B0 11 “at risk” points 1 “at risk” point

Linear scaling 6 Low ability High ability XXXXXXXXX XXXX XXXXX XX +2 -2… -1 +1 …+3 Mean=7, SD=5

Same example with a different population Low ability High ability XXXXXXXXX XXXX XXXXX XX -6 -5-4-3-2-1 0 +1 +2 +3 Mean=13, SD=2

Same issue with Fluency Let’s say in a population the mean of /F/ is 12 in 1 minute, SD 3 Using a z score implies difference in implication between 3 and 6 words is the same as the difference in implication between 33 and 36 words (1 SD unit difference) 3: really awful. 6: pretty bad. 33 and 36: both really good. Certainly 33 and 36 not qualitatively as different as 3 and 6 are Similarly, difference between 3 and 12 (awful and average) is the same as between 36 and 45 (really good and superb) (3 SD units)

Bias in the rate of change

Zero in z scores Average score is 0 for each test Weights for scores different from 0 determined by the variability (in the form of the SD), not the relative difficulty of the test Is this what we would want?

Summary: issues with z scores Dimensionality: Should we lump these items / scales / tests together? Equal weighting: Should each item / scale / test receive equal weight in the overall composite score? Scales based on variability: Is it appropriate to base the scale on the observed SD in the population? Equal difficulty: Are all of the items / scales / tests equally difficult? Linearity: Is the relationship with the underlying construct measured by the test the same across the entire spectrum?

Z scores vs. IRT scores IRT scores offer more flexibility; linear scaling Weighting based on relative difficulties of different tests (Different handling of demographic heterogeneity) Facilitates specific attention to measurement error / precision

2 head to head studies: Study 1 FH 2005, in press at JINS Executive functioning battery added to SENAS Subset had MRI evaluations Compared IRT to z scores head to head in terms of strength of relationship with neuroimaging parameters Demographic heterogeneity

Conceptual model i … Ability Demographics Composite Score MRI 1 n n+1 n+m Demographics … Composite Score MRI Items with DIF Items without

Findings Strength of relationship of executive functioning composite with MRI was similar for IRT scores as for composite z score Accounting for heterogeneity in ages using adjusted z scores decreased strength of relationship Accounting for ethnicity / language, education, and gender did not impair strength of relationship Accounting for heterogeneity using IRT and DIF did not impact strength of relationship

Study 2 Convenience sample with three known groups: AD, impaired cognition with no dementia, and normal cognition Neuropsychological battery administered, including several measures of executive functioning

Digits backwards from the CASI I think it’s items in the CASI How to score these items? (not clear) Does it matter? (absolutely)

Digits backwards Score 1: more credit for 4 digits than 3 digits Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both

Digits backwards Score 1: more credit for 4 digits than 3 digits Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both

Strength of relationship with cognitive impairment IRT scores were at least as good as z scores Demographic adjustment in the z score framework was a bad idea Demographic adjustment in the IRT framework was not as bad an idea

Conclusions Many theoretical reasons latent trait scores (such as IRT) would be preferred to classical test theory scores (such as z scores) Here 2 specific examples of relative validity of executive functioning composites Theoretically and practically better approach to demographic heterogeneity