Detecting Item Parameter Drift

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

DIF Analysis Galina Larina of March, 2012 University of Ostrava.
LOGO One of the easiest to use Software: Winsteps
Comparison of Acceptance Criteria The acceptance rates of simulated samples with and without various problems were examined to compare different acceptance.
Integration Techniques
The DIF-Free-Then-DIF Strategy for the Assessment of Differential Item Functioning 1.
Introduction to Power Analysis  G. Quinn & M. Keough, 2003 Do not copy or distribute without permission of authors.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
PSY 307 – Statistics for the Behavioral Sciences
© Anita Lee-Post Quality Control Part 2 By Anita Lee-Post By.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
A Comparison of Progressive Item Selection Procedures for Computerized Adaptive Tests Brian Bontempo, Mountain Measurement Gage Kingsbury, NWEA Anthony.
Arun Srivastava. Types of Non-sampling Errors Specification errors, Coverage errors, Measurement or response errors, Non-response errors and Processing.
Quantify prediction uncertainty (Book, p ) Prediction standard deviations (Book, p. 180): A measure of prediction uncertainty Calculated by translating.
Copyright 2010, The World Bank Group. All Rights Reserved. Agricultural Census Sampling Frames and Sampling Section A 1.
10/03/2005NOV-3300-SL M. Weiss, F. Baret D. Allard, S. Garrigues From local measurements to high spatial resolution VALERI maps.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
© 2013 Toshiba Corporation An Estimation of Computational Complexity for the Section Finding Problem on Algebraic Surfaces Chiho Mihara (TOSHIBA Corp.)
for statistics based on multiple sources
Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.
University of Ostrava Czech republic 26-31, March, 2012.
Optimal Delivery of Items in a Computer Assisted Pilot Francis Smart Mark Reckase Michigan State University.
Module 1: Measurements & Error Analysis Measurement usually takes one of the following forms especially in industries: Physical dimension of an object.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Sample Size Determination
Summary of Bayesian Estimation in the Rasch Model H. Swaminathan and J. Gifford Journal of Educational Statistics (1982)
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
BUS304 – Chapter 7 Estimating Population Mean 1 Review – Last Week  Sampling error The radio station claims that on average a household in San Diego spends.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
Chapter 9 – Statistical Estimation Statistical estimation involves estimating a population parameter with a sample statistic. Two types of estimation:
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Using Simulation to evaluate Rasch Models John Little CEM, Durham University
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Pairwise comparisons: Confidence intervals Multiple comparisons Marina Bogomolov and Gili Baumer.
Multiple Imputation in Finite Mixture Modeling Daniel Lee Presentation for MMM conference May 24, 2016 University of Connecticut 1.
The Conditional Random-Effects Variance Component in Meta-regression
Big data classification using neural network
Sample Size Determination
Theme (i): New and emerging methods
Sampling Design and Analysis MTH 494
Quality Assurance in the clinical laboratory
Iterative Target Rotation with a Suboptimal Number of Factors Nicole Zelinsky - University of California, Merced - Introduction.
Item Analysis: Classical and Beyond
Evaluation of Parameter Recovery, Drift, and DIF with CAT Data
Booklet Design and Equating
Estimating with PROBE II
Chapter 10 Verification and Validation of Simulation Models
How to handle missing data values
Calculating Sample Size: Cohen’s Tables and G. Power
Aligned to Common Core State Standards
Mohamed Dirir, Norma Sinclair, and Erin Strauts
Smarter Balanced Scoring (AKA the “Marble Slides”)
Now What?: I’ve Found Nothing
Overfitting and Underfitting
Role of Statistics in Developing Standardized Examinations in the US
Reading Property Data Analysis – A Primer, Ch.9
15.1 The Role of Statistics in the Research Process
Power analysis Chong-ho Yu, Ph.Ds..
Diagnostics and Remedial Measures
Research Design and Methods
Item Analysis: Classical and Beyond
Quality Assessment The goal of laboratory analysis is to provide the accurate, reliable and timeliness result Quality assurance The overall program that.
Measurements & Error Analysis
Item Analysis: Classical and Beyond
Diagnostics and Remedial Measures
Playing the y-model XU Kun 24 Nov 2011.
Presentation transcript:

Detecting Item Parameter Drift in a CAT program using the Rasch Measurement Model Mayuko Simon, David Chayer, Pam Hermann, and Yi Du Data Recognition Corporation April, 2012

How should banked item parameters be checked? The idea for this study came about when the authors were faced with a large existing bank of CAT items with estimated item parameters that needed augmentation.

Re-calibration of banked item parameters and item parameter drift Recalibration is recommended at periodic interval CAT item data is sparse matrix and range of students’ ability for each item are limited

What would be a reasonable way to recalibrate items? The methods can be applied to Maintenance of CAT item bank Detecting item parameter drift Calibration of field test items

How did other researchers calibrate/re-calibrate CAT data? Missing imputation to avoid sparseness (Harmes, Parshall, and Kromrey, 2003) Calibrate FT items by anchoring operational items (Wang and Wiley, 2004) Calibrate FT item anchoring ability (Kingsbury, 2009) Use ability to calibrate item parameter to detect drift (Stocking, 1988)

Simulation study 300 items in item bank 20,000 students’ simulated responses, N(0,1) Known item parameter drift (10% of item bank) Various drift sizes

Design Item difficulty # of items Item parameter drift size Condition 1 Condition 2 Control Condition Easy d < -1.5 10 0.1, 0.2, 0.3, 0.4, 0.5 -0.1,- 0.2,- 0.3,- 0.4,- 0.5, 0.1, 0.2, 0.3, 0.4, 0.5 No change Medium -1.5 ≤ d ≤ 1.5 Difficult d > 1.5

Four calibration methods in this study Anchor person ability (AP) Anchor person ability and anchor 200 items difficulty out of 300 items (API) Use of Displacement value from Winsteps output Item by Item calibration (IBI)

IBI: Item by Item calibration A vector of responses for an item A vector of ability who took the item Same concept as logistic regression, but use Winsteps to calibrate No sparseness involved Less data is needed (especially when not all items in a bank needed to be checked)

Evaluation One sample t-test with alpha 0.01 for AP, API, and IBI Cutoff value 0.4 for Displacement method Type I error rate Type II error rate Sensitivity (Type II + Sensitivity = 1) RMSE (average difference from banked value for flagged items) BIAS (average bias from banked value for flagged items)

Type I error rate Type I error for Control is also inflated * Average over 40 replications Type I error for Control is also inflated Condition 1 had higher Type I error rate

Type II error rate Type II error for Displacement method is too high. * Average over 40 replications Type II error for Displacement method is too high. Condition 1 had higher Type II error rate

Sensitivity Sensitivity for Displacement method is too low. * Average over 40 replications Sensitivity for Displacement method is too low. Condition 1 had lower sensitivity rate

Items with small sample sizes and small drift are difficult to flag correctly.

Type II error were with items with small sample size and/or small drift Item with small drift Items with small N large drift

Same item Same items

Which method has re-calibrated item difficulty closer to the banked value? Median of the RMSE are similar across three methods IBI has less variance of RMSE than AP

Which method has less bias with the re-calibrated item difficulty? All three methods has very small bias IBI has less variance of BIAS than AP

Conclusion Use caution with Displacement value to identify item parameter drift. AP, API, and IBI worked reasonably well. Items with small drift or small sample sizes are difficult to detect the item parameter drift Compared to AP, IBI had less variance of RMSE and BIAS Item parameter in one direction (condition 1) would cause more bias in the final ability estimate, leading to higher Type I and Type II errors.

Limitation and Future Study Proportion of items with item parameter drift was 10% of the bank. How the results would change with various proportion? How about the size of drift? Used only Rasch model How about other models and software? Minimum sample size was 10 How about different minimum sample sizes (e.g., 30,50, etc)? No iterative procedure (no update of the item difficulty with drift) Does results get better if we do iteratively, updating the difficulty after detecting?