IRT Model Misspecification and Metric Consequences Sora Lee Sien Deng Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.

Slides:



Advertisements
Similar presentations
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.
Advertisements

The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.
Fit of Ideal-point and Dominance IRT Models to Simulated Data Chenwei Liao and Alan D Mead Illinois Institute of Technology.
Hierarchical Linear Modeling for Detecting Cheating and Aberrance Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS William Skorupski.
Item Response Theory in Health Measurement
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.
Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.
Wisdom of Crowds and Rank Aggregation Wisdom of crowds phenomenon: aggregating over individuals in a group often leads to an estimate that is better than.
A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics.
Multivariate Data Analysis Chapter 11 - Structural Equation Modeling.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.
Testing factorial invariance in multilevel data: A Monte Carlo study Eun Sook Kim Oi-man Kwok Myeongsun Yoon.
A Hierarchical Framework for Modeling Speed and Accuracy on Test Items Van Der Linden.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
Xitao Fan, Ph.D. Chair Professor & Dean Faculty of Education University of Macau Designing Monte Carlo Simulation Studies.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.
Is the Force Concept Inventory Biased? Investigating Differential Item Functioning on a Test of Conceptual Learning in Physics Sharon E. Osborn Popp, David.
Analysis and Visualization Approaches to Assess UDU Capability Presented at MBSW May 2015 Jeff Hofer, Adam Rauk 1.
Introduction Neuropsychological Symptoms Scale The Neuropsychological Symptoms Scale (NSS; Dean, 2010) was designed for use in the clinical interview to.
The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Clustering Seasonality Patterns in the Presence of Errors Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Mahesh Kumar Nitin R. Patel Jonathan Woo.
Dealing with Omitted and Not- Reached Items in Competence Tests: Evaluating Approaches Accounting for Missing Responses in Item Response Theory Models.
Investigating Faking Using a Multilevel Logistic Regression Approach to Measuring Person Fit.
Issues in Assessment Design, Vertical Alignment, and Data Management : Working with Growth Models Pete Goldschmidt UCLA Graduate School of Education &
Applying SGP to the STAR Assessments Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.
Calibration of Response Data Using MIRT Models with Simple and Mixed Structures Jinming Zhang Jinming Zhang University of Illinois at Urbana-Champaign.
Test Scaling and Value-Added Measurement Dale Ballou Vanderbilt University April, 2008.
OPENING QUESTIONS 1.What key concepts and symbols are pertinent to sampling? 2.How are the sampling distribution, statistical inference, and standard.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
The ABC’s of Pattern Scoring
University of Ostrava Czech republic 26-31, March, 2012.
Item Factor Analysis Item Response Theory Beaujean Chapter 6.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Item Response Theory in Health Measurement
Summary of Bayesian Estimation in the Rasch Model H. Swaminathan and J. Gifford Journal of Educational Statistics (1982)
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Aligning Assessments to Monitor Growth in Math Achievement: A Validity Study Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information Washington.
The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
Lesson 2 Main Test Theories: The Classical Test Theory (CTT)
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Daniel Muijs Saad Chahine
Vertical Scaling in Value-Added Models for Student Learning
Assessment Research Centre Online Testing System (ARCOTS)
Booklet Design and Equating
Neil T. Heffernan, Joseph E. Beck & Kenneth R. Koedinger
Margaret Wu University of Melbourne
Determining Which Method to use
Item Analysis: Classical and Beyond
Presentation transcript:

IRT Model Misspecification and Metric Consequences Sora Lee Sien Deng Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison

Overview The application of IRT methods to construct vertical scales commonly suggests a decline in the mean and variance of growth as grade level increases (Tong & Kolen, 2006) This result seems related to the problem of “scale shrinkage” discussed in the 80’s and 90’s (Yen, 1985; Camilli, Yamamoto & Wang, 1993) Understanding this issue is of practical importance with the increasing use of growth metrics for evaluating teachers/schools (Ballou, 2009).

Purpose of this Study To examine logistic positive exponent (LPE) models as a possible source of model misspecification in vertical scaling using real data To evaluate the metric implications of LPE- related misspecification by simulation

Data Structure (WKCE 2011) Item responses for students across two consecutive years (only including students that advanced one grade across years) 46 multiple-choice items each year, all scored 0/1 Sample sizes > 57,000 for each grade level Grade levels 3-8

2010 Scale Scores2011 Scale ScoresChange 2011 Grade Sample SizeMeanSDMeanSDMeanSD Wisconsin Knowledge and Concepts Examination (WCKE) Math Scores , Grades 4-8

The probability of successful execution of each subprocess g for an item i is modeled according to a 2PL model: while the overall probability of a correct response to the item is and ξ > 0 is an acceleration parameter representing the complexity of the item. Samejima’s 2PL Logistic Positive Exponent (2PL-LPE) Model

The probability of successful execution of each subprocess g for an item i is modeled according to a 2PL model: while the overall probability of a correct response to the item incorporates a pseudo-guessing parameter: and ξ > 0 is an acceleration parameter representing the complexity of the item. Samejima’s 3PL Logistic Positive Exponent (3PL-LPE) Model

Effect of Acceleration Parameter on ICC (a=1.0, b=0)

Item characteristic curves for an LPE item (a=.76, b=-3.62, ξ=8) when approximated by 2PL

Analysis of WKCE Data: Deviance Information Criteria (DIC) Comparing LPE to Traditional IRT Models 2pl2lpe3pl3lpe 3 Grade Grade Grade Grade Grade Grade

Example 2PL-LPE Item Parameter Estimates and Standard Errors (WKCE 8 th Grade) ItemaS.Eb ξ

Item Characteristic Curves of 2PL and 2PL-LPE (WKCE 7 th Grade)

Item Characteristic Curves of 3PL and 3PL-LPE (WKCE 7 th Grade)

Item Chi-squareP-value Goodness-of-Fit Testing for 2PL model (WKCE 6 th Grade Example Items)

Simulation Studies Study 1: Study of 2PL and 3PL misspecification (with LPE generated data) across groups Study 2: Hypothetical 2PL- and 3PL-based vertical scaling with LPE generated data

Study 1 Purpose: The simulation study examines the extent to which the ‘shrinkage phenomenon' may be due to the LPE-induced misspecification by ignoring the item complexity on the IRT metric. Method: Item responses are generated from both the 2PL- and 3PL-LPE models, but are fit by the corresponding 2PL and 3PL IRT models. All parameters in the models are estimated using Bayesian estimation methods in WinBUGS14. The magnitude of the ϴ estimate increase against true ϴ change were quantified to evaluate scale shrinkage.

Results, Study 1  2PL  3PL

Study 2 Simulated IRT vertical equating study, Grades 3-8 We assume 46 unique items at each grade level, and an additional 10 items common across successive grades for linking Data are simulated as unidimensional across all grade levels We assume a mean theta change of 0.5 and 1.0 across all successive grades; at Grade 3, θ ~ Normal (0,1) All items are simulated from LPE, linking items simulated like those of the lower grade level Successive grades are linked using Stocking & Lord’s method (as implemented using the R routine Plink, Weeks, 2007)

Results, Study 2 Table: Mean Estimated Stocking & Lord (1980) Linking Parameters across 20 Replications, Simulation Study 2

Results, Study 2 Figure: True and Estimated Growth By Grade, Simulation Study 2

Conclusions and Future Directions Diminished growth across grade levels may be a model misspecification problem unrelated to test multidimensionality Use of Samejima’s LPE to account for changes in item complexity across grade levels may provide a more realistic account of growth Challenge: Estimation of LPE is difficult due to confounding accounts of difficulty provided by the LPE item difficulty and acceleration parameters.