DIF detection using OLR Paul K. Crane, MD MPH Internal Medicine University of Washington
Outline Statistical background DIFdetect package What do we do when we find DIF? DIF adjustments to PARSCALE code How good are adjusted scores? Discussion
Statistical background Recall definition of DIF: when demographic characteristic(s) interfere with relationship expected between ability level and responses to an item A conditional definition; have to control for ability level, or else we can’t differentiate between DIF and differential test impact
Logistic regression applied to DIF detection Swaminathan and Rogers (1990) Tested two models: P(Y=1|X, group)=f(β1X+β2*group+β3*X*group) P(Y=1|X)=f(β1X) Compared the –2 log likelihoods of these two models to a chi squared distribution with 2 df Uniform and non-uniform tested at same time
Camilli and Shepard (1994) Recommended a two step procedure, to first test for non-uniform DIF and then for uniform DIF P(Y=1|X, group)=f(β1X+β2*group+β3*X*group) P(Y=1|X, group)= f(β1X+β2*group) P(Y=1|X)=f(β1X) -2 log likelihoods of each pair of models compared to determine non-uniform DIF and uniform DIF in two separate steps
Millsap and Everson (1994) Dismissive of “observed score” techniques such as logistic regression X contains several items that have DIF, so adjusting for X is theoretically problematic Advocated latent approaches such as IRT for DIF detection Very influential publication
Zumbo (1999) Extended Swaminathan and Rogers framework to ordinal logistic regression case to handle polytomous items Did not address latent trait; also used a single step rather than two steps
Crane, van Belle, Larson (2004) Pointed out that logistic regression model is a re-parameterization of the IRT model as long as IRT-derived θ estimates are used as ability scores Addressed multiple hypothesis testing of non-uniform DIF; no difference between four different techniques of adjusting
Crane et al. (2004) – 2 Biggest change in terms of specific criteria for uniform DIF Recognized that non-uniform and uniform DIF were analogous to effect modification and confounding Employed epidemiological thinking about how to detect confounding relationships from the data
Crane et al. (2004) – 3 Same models used (though now θ not X) P(Y=1|θ, group)= f(β1θ+β2*group) P(Y=1|θ)=f(β1’θ) Determine the impact of including the group term on the magnitude of the relationship between θ and item responses Determine size of |(β1-β1’)/β1|. If this is large, uniform DIF (confounding) is present Maldonado and Greenland simulation study on confounder selection strategies
Work still pending “Optimal” criteria for uniform and non-uniform DIF are unknown Adjust α for multiple hypotheses? How many multiple hypotheses? Effect size for non-uniform DIF? In huge data sets, likely to have a significant interaction term What proportional change in β1 is significant UDIF?
DIFdetect package Can download from the web www.alz.washington.edu/DIFDETECT/welcome.html STATA-based user friendly package
Outline revisited Statistical background DIFdetect package What do we do when we find DIF? DIF adjustments to PARSCALE code How good are adjusted scores? Discussion
What to do when we find DIF? Educational settings often items with DIF are discarded Unattractive option for us Tests are too short as it is; lose variation Lose precision DIF doesn’t mean that the item doesn’t measure the underlying construct at all, just that it does so differently in different groups
What do we do – 2 Need a technique to incorporate items found to have DIF differently than DIF-free items Precedent for this approach in Reise, Widaman, and Pugh (1993) Constrain parameters for DIF-free items to be identical across groups Estimate parameters for items found with DIF separately in appropriate groups
Compensatory DIF Compensatory DIF occurs when DIF in some items leads to erroneous findings in other items Both false-positive and false-negative DIF findings Iterative process for each covariate until stable solution is reached (i.e., same items identified with DIF on separate runs of DIFdetect)
Adjustments to PARSCALE Create a new dataset that treat items according to their DIF status No DIF 1 DIF 2 No DIF 3 Group 1 Missing Group 2 Group 3
Modified data set 0001 12XX2 0002 12XX4 0003 01XX3 … 0132 1X2X2
PARSCALE code Need new lines (new blocks) for all new items that we create We are automating this step as an extension to DIFdetect Current best advice is to use a huge table in Word Creation of new items is easy; we have STATA code for creation of virtual items
Preparation of data for PARSCALE
Reminder of PARSCALE tips When outfiling from STATA, use wide format Use commas Change missing values to .x Open the file in Word and replace “.x” with X Remember to change 2-digit numbers to their appropriate letters
It gets complicated… This is the CASI, first run of education DIF, after looking at gender and age :
Table helps with PARSCALE code
Adjusted scores related to dementia and CIND In the ACT study, controlling for CASI score (continuous): odds ratio of 2.9 (1.8-4.9) for low DIF-adjusted IRT score (among those with low CASI scores) Adjusted for gender, education, and age Strict 2-stage sample design verification bias In the CSHA, controlling for 3MS score (continuous): weighted odds ratio of 1.6 (1.1-2.3) for dementia for low DIF-adjusted IRT score, and 1.4 (1.2-1.8) for CIND Adjusted for education and language Sampling and weighting to deal with verification bias
Incorporation of adjusted scores into analyses Here we are in novel territory Is there a reason not to adjust scores for DIF? Questions and comments
Comparison of OLR with other techniques OLR is more flexible (can look at continuous constructs, e.g., education, without dichotomizing or grouping) DIFdetect is very fast When using IRT-derived θ scores, a re-parameterization of IRT analyses DIFdetect OLR incorporates epidemiology concepts of confounding and effect modification Teresi (ed) special issue of Medical Care to come out