Download presentation
Presentation is loading. Please wait.
1
National Conference on Student Assessment
An IRT-based approach to detection of aberrant response patterns for tests with multiple components National Conference on Student Assessment June 21, 2016 – Philadelphia, PA Li Cai, Kilchan Choi, & Mark Hansen
2
Overview testing for differences in projected and “observed” score estimates illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment other potential applications caveats/limitations
3
Testing for differences in projected and “observed” score estimates
In item response theory (IRT) scoring, the posterior distribution describes a plausible distribution for an individual’s true ability, given the evidence collected in the test (item responses) The posterior distribution is the source of the a scale score point estimate, as well as information about the precision of that estimate (the standard error of measurement) In this study, we examine the possibility of using posterior distributions estimated from separate parts of a test in order to detect unlikely/inconsistent response patterns.
4
Calibrated projection
For multidimensional tests, calibrated projection (Thissen et al., 2010) is used to obtain from one part of a test plausible score distributions that can be compared to the posterior distributions obtained directly through the scoring of another part. 1 2 1. calibration 1 2 2. projection 2 3. scoring 4. evaluation
5
A test statistic for comparing two posterior distributions
squared difference between the two score estimates 𝜒 𝑖 2 = 𝜃 1𝑖 − 𝜃 2𝑖 𝜎 𝜃 1𝑖 2 + 𝜎 𝜃 2𝑖 2 degrees of freedom = 1 sum of the squared standard errors of measurement (which is the error variance of the numerator, since the estimates are independent)
6
Illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment
7
Compare the projected writing score with the “actual” writing score
Illustration #1 Using 3 domains of an English Language Proficiency Assessment (listening, reading, speaking) to predict performance on the 4th domain (writing) Compare the projected writing score with the “actual” writing score “actual” L R S W W domains projection items
8
about 97% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right
9
About 3% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right
10
mean difference = -0.01
11
⇔ Illustration #2 Test consists of two segments/components
Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) 𝜌=1 𝑁 𝜇, 𝜎 2 1 2 𝑁 𝜇, 𝜎 2 𝑁 𝜇, 𝜎 2 ⇔ items items segment 1 segment 2 segment 1 segment 2
12
Test consists of two segments/components
Illustration #2 Test consists of two segments/components Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) Compare score estimates from 2 components of a English language arts/literacy (ELA/L) assessment 𝜌=1 𝜌=1 “actual” “actual” 1 2 2 1 1 2 projection projection segment 1 segment 2 segment 1 segment 2
13
about 93% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right
14
about 7% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right
15
mean difference = -0.08
16
Caveats and limitations
Model must be approximately correct Reasonably good overall fit Stable item and structural parameter estimates Aberrant patterns in the calibration may reduce sensitivity Ability to detect unlikely responses is dependent on (projected) score precision Low power when projecting from small number of items Low power when projecting across weakly correlated domains Interpretation of findings not particularly clear. Rudner, Bracey, & Skaggs (1996, p. 107): “In general, we need more clinical oriented studies that find aberrant patterns of responses and then follow up with respondents. We know of no studies that empirically investigate what these respondents are like. Can anything meaningful be said about them beyond the fact that they do not look like typical respondents?” Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of a person-fit statistic with one high-quality achievement test. Applied Measurement in Education, 9(1),
17
Other potential applications
Test-retest consistency Comparisons of machine- and hand-scored responses Next Steps Examine calibration of the test statistic (error rates, power) Investigate causes/meaning of discrepancy Clustering of inconsistent patterns Student characteristics
18
THANK YOU! Li Cai, lcai@ucla.edu Kilchan Choi, kcchoi@ucla.edu
Mark Hansen,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.