National Conference on Student Assessment An IRT-based approach to detection of aberrant response patterns for tests with multiple components National Conference on Student Assessment June 21, 2016 – Philadelphia, PA Li Cai, Kilchan Choi, & Mark Hansen
Overview testing for differences in projected and “observed” score estimates illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment other potential applications caveats/limitations
Testing for differences in projected and “observed” score estimates In item response theory (IRT) scoring, the posterior distribution describes a plausible distribution for an individual’s true ability, given the evidence collected in the test (item responses) The posterior distribution is the source of the a scale score point estimate, as well as information about the precision of that estimate (the standard error of measurement) In this study, we examine the possibility of using posterior distributions estimated from separate parts of a test in order to detect unlikely/inconsistent response patterns.
Calibrated projection For multidimensional tests, calibrated projection (Thissen et al., 2010) is used to obtain from one part of a test plausible score distributions that can be compared to the posterior distributions obtained directly through the scoring of another part. 1 2 1. calibration 1 2 2. projection 2 3. scoring 4. evaluation
A test statistic for comparing two posterior distributions squared difference between the two score estimates 𝜒 𝑖 2 = 𝜃 1𝑖 − 𝜃 2𝑖 2 𝜎 𝜃 1𝑖 2 + 𝜎 𝜃 2𝑖 2 degrees of freedom = 1 sum of the squared standard errors of measurement (which is the error variance of the numerator, since the estimates are independent)
Illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment
Compare the projected writing score with the “actual” writing score Illustration #1 Using 3 domains of an English Language Proficiency Assessment (listening, reading, speaking) to predict performance on the 4th domain (writing) Compare the projected writing score with the “actual” writing score “actual” L R S W W domains projection items
about 97% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right
About 3% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right
mean difference = -0.01
⇔ Illustration #2 Test consists of two segments/components Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) 𝜌=1 𝑁 𝜇, 𝜎 2 1 2 𝑁 𝜇, 𝜎 2 𝑁 𝜇, 𝜎 2 ⇔ items items segment 1 segment 2 segment 1 segment 2
Test consists of two segments/components Illustration #2 Test consists of two segments/components Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) Compare score estimates from 2 components of a English language arts/literacy (ELA/L) assessment 𝜌=1 𝜌=1 “actual” “actual” 1 2 2 1 1 2 projection projection segment 1 segment 2 segment 1 segment 2
about 93% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right
about 7% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right
mean difference = -0.08
Caveats and limitations Model must be approximately correct Reasonably good overall fit Stable item and structural parameter estimates Aberrant patterns in the calibration may reduce sensitivity Ability to detect unlikely responses is dependent on (projected) score precision Low power when projecting from small number of items Low power when projecting across weakly correlated domains Interpretation of findings not particularly clear. Rudner, Bracey, & Skaggs (1996, p. 107): “In general, we need more clinical oriented studies that find aberrant patterns of responses and then follow up with respondents. We know of no studies that empirically investigate what these respondents are like. Can anything meaningful be said about them beyond the fact that they do not look like typical respondents?” Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of a person-fit statistic with one high-quality achievement test. Applied Measurement in Education, 9(1), 91-109.
Other potential applications Test-retest consistency Comparisons of machine- and hand-scored responses Next Steps Examine calibration of the test statistic (error rates, power) Investigate causes/meaning of discrepancy Clustering of inconsistent patterns Student characteristics
THANK YOU! Li Cai, lcai@ucla.edu Kilchan Choi, kcchoi@ucla.edu Mark Hansen, markhansen@ucla.edu