DIF and cross-cultural measurement of cognitive functioning Paul K. Crane, MD MPH Laura B. Gibbons, PhD.

Slides:



Advertisements
Similar presentations
Cross Cultural Research
Advertisements

Cross Sectional Designs
Statistical Techniques I EXST7005 Multiple Regression.
Applied Structural Equation Modeling for Dummies, by Dummies February 22, 2013 Indiana University, Bloomington Joseph J. Sudano, Jr., PhD Center for.
RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-
Item Response Theory in Health Measurement
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
The DIF-Free-Then-DIF Strategy for the Assessment of Differential Item Functioning 1.
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
Introduction to Factorial ANOVA Designs
© Curriculum Foundation1 Section 3 Assessing Skills Section 3 Assessing Skills There are three key questions here: How do we know whether or not a skill.
The Comparison of the Software Cost Estimating Methods
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Clustered or Multilevel Data
Structural Equation Modeling
American Psychological Association’s Task Force on Statistical Inference TFSI 1999.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Nonparametric or Distribution-free Tests
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
DIFFERENTIAL ITEM FUNCTIONING AND COGNITIVE ASSESSMENT USING IRT-BASED METHODS Jeanne Teresi, Ed.D., Ph.D. Katja Ocepek-Welikson, M.Phil.
Psy B07 Chapter 8Slide 1 POWER. Psy B07 Chapter 8Slide 2 Chapter 4 flashback  Type I error is the probability of rejecting the null hypothesis when it.
Translation and Cross-Cultural Equivalence of Health Measures.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances This presentation focuses (like my course) on MUS. It omits the effect.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Adjusted from slides attributed to Andrew Ainsworth
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Copyright © 2009 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Biostatistics Case Studies 2006 Peter D. Christenson Biostatistician Session 2: Correlation of Time Courses of Simultaneous.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Translation and Cross-Cultural Equivalence of Health Measures
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Item Response Theory in Health Measurement
GENERALIZING RESULTS: the role of external validity.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Definition Slides Unit 2: Scientific Research Methods.
Definition Slides Unit 1.2 Research Methods Terms.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 10: The t Test For Two Independent Samples.
Comments on the low education  dementia link Paul K. Crane, MD MPH.
Design and Data Analysis in Psychology I Salvador Chacón Moscoso Susana Sanduvete Chaves School of Psychology Dpt. Experimental Psychology 1.
Friday Harbor Laboratory University of Washington August 22-26, 2005
PSY 626: Bayesian Statistics for Psychological Science
OECD experience on measuring the costs for national accounts
CJT 765: Structural Equation Modeling
Paul K. Crane, MD MPH Dan M. Mungas, PhD
PSY 626: Bayesian Statistics for Psychological Science
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Spanish and English Neuropsychological Assessment Scales - Guiding Principles and Evolution Friday Harbor Psychometrics Workshop 2005.
Performance Management
Sampling and Power Slides by Jishnu Das.
Design and Data Analysis in Psychology I
DIF detection using OLR
15.1 The Role of Statistics in the Research Process
Test co-calibration and equating
Regression Analysis.
Presentation transcript:

DIF and cross-cultural measurement of cognitive functioning Paul K. Crane, MD MPH Laura B. Gibbons, PhD

DIF and cross-cultural measurement of cognitive functioning (and a few baby pictures) Paul K. Crane, MD MPH Laura B. Gibbons, PhD (Isaac M. Crane, baby) (Heidi M. Crane, MD MPH, mom)

Cognitive functioning itself (as opposed to its measurement) is intertwined with language –Difficult (impossible?) to separate how we think from our use of language (Saussere, Derrida, etc.) No reason to think cognitive functioning as measured by cognitive tests is the same in two different languages –Translation / back translation may result in test items with the same literal meanings, but relative importance of some tasks may differ in different linguistic contexts (i.e., in different languages) Cross cultural measurement

Examples Obviously wrong: WORLD vs. MUNDO spelled backwards –WORLD backwards is much harder than MUNDO backwards 10/66 group (next few slides, borrowed from their website)

Theory and applications for assessing success of cross-cultural measurement strategies are not well developed Yet we hope to compare cognitive functioning across languages in many specific contexts –CSHA, SALSA, 10/66, Ni-Hon-Sea … –Alzheimer’s Association area of focus –NIA health disparities / diverse elderly focus –R01 application for 2/1/06 many of us will be involved in DIF approaches may help

Lurking in the background of all DIF detection methods is the assumption that some θ = some other θ –Not a conceptual problem for gender, ethnicity, education, or other characteristics within a given language; the other items in the test provide an anchor; iterative techniques (or other techniques) to determine a set of relatively DIF-free items –This is potentially a conceptual problem for assessing DIF related to language of test administration But it’s even more complicated than that!

Educational attainment (quality) interferes with the measurement of cognitive functioning within a single language – (Manly, Teresi, Jones, Crane, Mungas…) Educational attainment differs markedly across language groups –Spanish speakers on average fewer years of school than English speakers in the US –French speakers on average fewer years of school than English speakers in Canada –Japanese speakers on average fewer years of school than English speakers in the Kame study That really throws a wrench into the some θ = some other θ problem

Review: Our general approach to DIF detection Establish θ=θ with a DIF free core: DIF free core Items with DIF Group APresent Missing Group BPresentMissingPresent

Several strategies of attack Goal is to develop a strategy to assess the success or failure of cross-language measurement of cognitive functioning –If DIF is found it would be nice to adjust for it Would like a strategy that can incorporate what we know about DIF related to education as well We will consider 4+ strategies

Strategy 1 Treat the θs as ships passing in the night, ignoring educational DIF: Items 1-nItems (n+1)-2n Language APresentMissing Language BMissingPresent

Explanation of previous slide There are n items in this test There are people in two language groups (A and B) This table illustrates a data structure for PARSCALE –Each person has a response for each item –Parameters for each item are attained separately for the two populations by virtue of the missing data strategy (a “not presented” item for PARSCALE) –Scores for the two populations are defined by a θ determined by 2n items Add education DIF to this structure:

Lang A DIF free items Lang A ed DIF items Lang B DIF free items Lang B ed DIF items Lang A, high ed Present Missing Lang A, low ed PresentMissingPresentMissing Lang B, high ed Missing Present Missing Lang B, low ed Missing PresentMissingPresent

Strategy 1 with education DIF Each language group has two education levels For each language group some of the items will be found to have DIF related to education Within each language: –Items without educational DIF have parameters estimated from whole sample –Items with educational DIF have parameters estimated from the two educational groups separately based on the missing data structure

Comments on strategy 1 θ language A = θ language B = θ Nothing at all in common between the two languages except the underlying metric, which is defined equally by the two language groups Deal with educational DIF separately in the two different languages Feasible with PARSCALE Never actually test for DIF related to language for any particular item (θ is used) Theoretical basis? (None I know of)

Strategy 1 vs. Strategy 0 Strategy 0 is the null strategy – treat the items as coming from two different samples and two different tests –Do some work with the parameters from the two samples See Rasch papers later on in talk

DIF assessment related to education within two language groups Identify a core set of items that do not seem to have education-related DIF in either language Use those items as the initial anchor items for the assessment of language DIF Strategy 2

Ed DIF free core Ed DIF Lang APresent Missing Lang BPresentMissingPresent Items without educational DIF used as initial DIF free core for purposes of calibration Items with educational DIF treated separately in different language groups (i.e., have parameters estimated separately in each language group) Two educational groups per language group on next slide:

No Ed DIF Ed DIF (Lang A) Ed DIF (Lang B) Lang A, High Ed Present Missing Lang A, Low Ed PresentMissingPresentMissing Lang B, High Ed PresentMissing PresentMissing Lang B, Low Ed PresentMissing Present This is the same data structure as the previous slide This slide makes the division of items with educational DIF in the two language groups explicit Also need to consider different findings of educational DIF in the two different languages (next slide):

No Ed DIF Ed DIF Lang B but not Lang A Ed DIF Lang A unique plus both Ed DIF Lang A but not Lang B Ed DIF Lang B unique plus both Lang A, High Ed Present Missing Lang A, Low Ed Present MissingPresentMissing Lang B, High Ed PresentMissing Present Missing Lang B, Low Ed PresentMissing PresentMissingPresent

Here we have added consideration of items that have DIF related to education in one language but not the other, otherwise no difference –Could have separated out columns for 1. DIF related to education in both languages (“both”) 2. DIF related to education only in language A (“lang A unique”), and 3. DIF related to education only in language B (“lang B unique”) Next consider language DIF for the core items

Doub le DIF free core Lang DIF but no ed DIF Ed DIF only in Lang B * Ed DIF in Lang A Ed DIF only in Lang A Ed DIF in Lang B * Lang A Present MissingPresent Missing Lang A Present MissingPresentMissingPresentMissing Lang B PresentMissingPresentMissing Present Missing Lang B PresentMissingPresentMissing PresentMissingPresent

Comments on strategy 2 With this strategy we do test for language DIF, at least for items that do not have education DIF What do we think about the treatment of items with education DIF in the two languages? –Never assess whether they also have DIF related to language (treated as if they do) Practical problem: may have too few anchor items; double DIF free core may be too small for anchoring

Strategy 3 Similar to strategy 2 but we allow items with DIF related to education to be considered as potential anchor items for assessment of language DIF –Items with DIF related to education in one language but not the other de facto have DIF related to language –Items with DIF related to education in both languages may not have DIF related to language

Double DIF free core Lang DIF but no ed DIF Ed DIF only in Lang B * Ed DIF only in Lang A † Ed DIF only in Lang B * Ed DIF in both Lang APresent MissingPresent Missing PresentMissing Lang APresent MissingPresentMissingPresentMissing Present Lang BPresentMissingPresentMissing Present MissingPresentMissing Lang BPresentMissingPresentMissing PresentMissingPresentMissingPresent

This strategy treats items that were found with educational DIF in both languages as other potential core items for the purposes of calibrating related to language of test administration Otherwise no difference from strategy 2

Comments on strategy 3 My favorite of these strategies Accounting problem trying to keep track of item categories Maximizes potential anchor items for the language analysis, increasing the validity of the common θ metric

Strategy 4 A priori (theory-driven) identification of potential anchor items for language evaluation –May make most sense from the perspective of SENAS, where Dan prospectively designed the metrics in Spanish and English with particular anchor items –Uncertain whether it makes sense in other contexts

Core itemsOther items Lang APresent Lang BPresent

Core items Other items without language DIF Other items with language DIF Lang APresent Missing Lang BPresent MissingPresent

Comments on strategy 4 DIF related to education is ignored –But see next slide Whether core items are really DIF free is never assessed Can there be post-hoc determinations of anchor items for tests like the 3MS or the CASI?

Core items, no Ed DIF Core items with Ed DIF Other items without Lang or Ed DIF Other items without Lang DIF but with Ed DIF Other items with lang DIF but no Ed DIF * Other items with both lang DIF and Ed DIF † Other items with lang DIF but no Ed DIF * Other items with both lang DIF and Ed DIF † Lang A, High Ed Present MissingPresent MissingPresent Missing Lang A, low Ed PresentMissingPresent MissingPresent MissingPresentMissing Lang B, High Ed Present MissingPresent Missing Present Missing Lang B, Low Ed PresentMissingPresent MissingPresentMissing PresentMissingPresent

Explanation of prior slide DIF related to education is taken care of –Core items with ed DIF treated distinctly in ed groups but anchor across languages –Similarly for non-core items found not to have language DIF but found to have ed DIF –Items with language DIF but no ed DIF treated separately in different languages –Items with both language and ed DIF treated as 4 different items, specific for educational level and language

DIF strategies from the literature Bjorner (1998) – Danish vs. English SF-36. Contingency table approach incorporating “partial version of the gamma coefficient” conditioning on total score; they left out items found with DIF and re-ran as 2 nd step Kutlay (2003): UK vs. Turkish versions of HRQL scale; no comment on what was used as anchor Martin (2004): 6-item HA scale in 6 languages. LR approach, scale score used as anchor

DIF strategies from lit – 2 Roorda (2004) – Dutch vs. Canadian Arthritis index; Rasch approach; calibrations separately in English and Dutch; difficulties centered and plotted against each other; scores calculated using missing data approach Hahn (2005): FACT-B in Austria and the US. Rasch approach. Difficulty calculated separately in each language; mean location for the 2 languages; Spearman rank correlations; calculated test statistics and confidence intervals; missing data for items with DIF (as here); repeated once with this anchor

Strategy 5 Rasch-like: we could do some sort of equating of difficulties and let the slopes vary (?) –Graphical display and thinking behind centered difficulties within a language is appealing ???

Overall comments on strategies Range from ignoring educational DIF to ignoring language DIF to theoretically uncertain to not worked out yet Rasch-like approach Each θ (language A) = θ (language B) is problematic –Relationships of items with the latent trait assumed to be the same across languages –Is this ever an acceptable assumption? –Possible role for SEM approaches?

Conclusions It is not a simple thing to test for language DIF, especially in the face of large differences in educational background in the language groups We have suggested 4+ potential strategies but would love guidance Thanks for your attention