Download presentation
Presentation is loading. Please wait.
1
Detecting Item Parameter Drift
in a CAT program using the Rasch Measurement Model Mayuko Simon, David Chayer, Pam Hermann, and Yi Du Data Recognition Corporation April, 2012
2
How should banked item parameters be checked?
The idea for this study came about when the authors were faced with a large existing bank of CAT items with estimated item parameters that needed augmentation.
3
Re-calibration of banked item parameters and item parameter drift
Recalibration is recommended at periodic interval CAT item data is sparse matrix and range of students’ ability for each item are limited
4
What would be a reasonable way to recalibrate items?
The methods can be applied to Maintenance of CAT item bank Detecting item parameter drift Calibration of field test items
5
How did other researchers calibrate/re-calibrate CAT data?
Missing imputation to avoid sparseness (Harmes, Parshall, and Kromrey, 2003) Calibrate FT items by anchoring operational items (Wang and Wiley, 2004) Calibrate FT item anchoring ability (Kingsbury, 2009) Use ability to calibrate item parameter to detect drift (Stocking, 1988)
6
Simulation study 300 items in item bank
20,000 students’ simulated responses, N(0,1) Known item parameter drift (10% of item bank) Various drift sizes
7
Design Item difficulty # of items Item parameter drift size
Condition 1 Condition 2 Control Condition Easy d < -1.5 10 0.1, 0.2, 0.3, 0.4, 0.5 -0.1,- 0.2,- 0.3,- 0.4,- 0.5, 0.1, 0.2, 0.3, 0.4, 0.5 No change Medium -1.5 ≤ d ≤ 1.5 Difficult d > 1.5
8
Four calibration methods in this study
Anchor person ability (AP) Anchor person ability and anchor 200 items difficulty out of 300 items (API) Use of Displacement value from Winsteps output Item by Item calibration (IBI)
9
IBI: Item by Item calibration
A vector of responses for an item A vector of ability who took the item Same concept as logistic regression, but use Winsteps to calibrate No sparseness involved Less data is needed (especially when not all items in a bank needed to be checked)
10
Evaluation One sample t-test with alpha 0.01 for AP, API, and IBI
Cutoff value 0.4 for Displacement method Type I error rate Type II error rate Sensitivity (Type II + Sensitivity = 1) RMSE (average difference from banked value for flagged items) BIAS (average bias from banked value for flagged items)
11
Type I error rate Type I error for Control is also inflated
* Average over 40 replications Type I error for Control is also inflated Condition 1 had higher Type I error rate
12
Type II error rate Type II error for Displacement method is too high.
* Average over 40 replications Type II error for Displacement method is too high. Condition 1 had higher Type II error rate
13
Sensitivity Sensitivity for Displacement method is too low.
* Average over 40 replications Sensitivity for Displacement method is too low. Condition 1 had lower sensitivity rate
14
Items with small sample sizes and small drift are difficult to flag correctly.
16
Type II error were with items with small sample size and/or small drift
Item with small drift Items with small N large drift
17
Same item Same items
18
Which method has re-calibrated item difficulty closer to the banked value?
Median of the RMSE are similar across three methods IBI has less variance of RMSE than AP
19
Which method has less bias with the re-calibrated item difficulty?
All three methods has very small bias IBI has less variance of BIAS than AP
20
Conclusion Use caution with Displacement value to identify item parameter drift. AP, API, and IBI worked reasonably well. Items with small drift or small sample sizes are difficult to detect the item parameter drift Compared to AP, IBI had less variance of RMSE and BIAS Item parameter in one direction (condition 1) would cause more bias in the final ability estimate, leading to higher Type I and Type II errors.
21
Limitation and Future Study
Proportion of items with item parameter drift was 10% of the bank. How the results would change with various proportion? How about the size of drift? Used only Rasch model How about other models and software? Minimum sample size was 10 How about different minimum sample sizes (e.g., 30,50, etc)? No iterative procedure (no update of the item difficulty with drift) Does results get better if we do iteratively, updating the difficulty after detecting?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.