Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.

Slides:



Advertisements
Similar presentations
Hong Jiao, George Macredy, Junhui Liu, & Youngmi Cho (2012)
Advertisements

DIF Analysis Galina Larina of March, 2012 University of Ostrava.
Structural Equation Modeling Using Mplus Chongming Yang Research Support Center FHSS College.
How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.
1 Scaling of the Cognitive Data and Use of Student Performance Estimates Guide to the PISA Data Analysis ManualPISA Data Analysis Manual.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Item Response Theory in Health Measurement
Introduction to Item Response Theory
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
A Cognitive Diagnosis Model for Cognitively-Based Multiple-Choice Options Jimmy de la Torre Department of Educational Psychology Rutgers, The State University.
PRESENTATION AT THE 12 TH ANNUAL MARYLAND ASSESSMENT CONFERENCE COLLEGE PARK, MD OCTOBER 18, 2012 JOSEPH A. MARTINEAU JI ZENG MICHIGAN DEPARTMENT OF EDUCATION.
Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.
RESEARCH METHODS Lecture 18
Multivariate Data Analysis Chapter 11 - Structural Equation Modeling.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.
Testing factorial invariance in multilevel data: A Monte Carlo study Eun Sook Kim Oi-man Kwok Myeongsun Yoon.
VALIDITY & RELIABILITY Raja C. Bandaranayake. QUALITIES OF MEASUREMENT DEVICES  Validity Does it measure what it is supposed to measure?  Reliability.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
A comparison of exposure control procedures in CATs using the 3PL model.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item response modeling of paired comparison and ranking data.
MEASUREMENT MODELS. BASIC EQUATION x =  + e x = observed score  = true (latent) score: represents the score that would be obtained over many independent.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.
Calculations of Reliability We are interested in calculating the ICC –First step: Conduct a single-factor, within-subjects (repeated measures) ANOVA –This.
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
Dimensionality of the latent structure and item selection via latent class multidimensional IRT models FRANCESCO BARTOLUCCI.
Modeling Student Growth Using Multilevel Mixture Item Response Theory Hong Jiao Robert Lissitz University of Maryland Presentation at the 2012 MARCES Conference.
IRT Model Misspecification and Metric Consequences Sora Lee Sien Deng Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.
Measurement Bias Detection Through Factor Analysis Barendse, M. T., Oort, F. J. Werner, C. S., Ligtvoet, R., Schermelleh-Engel, K.
Calibration of Response Data Using MIRT Models with Simple and Mixed Structures Jinming Zhang Jinming Zhang University of Illinois at Urbana-Champaign.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
G Lecture 7 Confirmatory Factor Analysis
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
1 EPSY 546: LECTURE 1 SUMMARY George Karabatsos. 2 REVIEW.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
MEASUREMENT. MeasurementThe assignment of numbers to observed phenomena according to certain rules. Rules of CorrespondenceDefines measurement in a given.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
The ABC’s of Pattern Scoring
University of Ostrava Czech republic 26-31, March, 2012.
CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Item Response Theory in Health Measurement
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
Aligning Assessments to Monitor Growth in Math Achievement: A Validity Study Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information Washington.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
Vertical Scaling in Value-Added Models for Student Learning
Assessment Research Centre Online Testing System (ARCOTS)
Classical Test Theory Margaret Wu.
Validity and Reliability
Booklet Design and Equating
پرسشنامه کارگاه.
National Conference on Student Assessment
RESEARCH METHODS Lecture 18
Mohamed Dirir, Norma Sinclair, and Erin Strauts
Descriptive Statistics
Presentation transcript:

Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz

contents QuestionsMethodsResults

1.Unidimensionality of tests at each grade level 2.Test construct invariance across grade  ╳ bifactor model Vertical scaling

1.Computational simplicity for estimation 2.Ease of interpretation 3.Vertical scaling problems across grades Questions

Purpose to propose and evaluate a bifactor model for IRT vertical scaling that can incorporate construct shift across grades while extracting a common scale, to evaluate the robustness of the UIRT model in parameter recovery, and to compare parameter estimates from the bifactor and UIRT models

Method Bifactor Model Gibbons and Hedeker (1992) generalized the work of Holzinger and Swineford (1937) to derive a bifactor model for dichotomously scored item response data.  0 : general factor or ability  i0 : discrimination parameter for general factor  s :group special factor or ability  is : discrimination parameter for group-special factor d i :overall multidimensional item difficulty

simulation Common item design Bifactor model for generating data Concurrent calibration –Stable results

Sample size Number or percentage of common items –Test length:60 Variance of grade-specific factor: degree of construct shift 100 replication

Data generation Item discrimination parameters were set deliberately and repeatedly at 1.2, 1.4, 1.6, 1.8, 2.0, and 2.2 for the general dimension, and fixed at 1.7 for grade-specific

Identifications of Bifactor Model Estimation For the general dimension, the variance of the general latent dimension was fixed to 1, and the discrimination parameters (loadings) were freely estimated in the study For the grade-specific dimensions (s = 1, 2,..., k), the discrimination parameters (s = 1, 2,..., k) (loadings) were fixed to the true parameter value 1.7, so that the variances of the grade-specific dimensions could be freely estimated the common items answered by multiple groups were restricted so that they would have unique item parameters in the multiple-group concurrent calibration

Model Estimation Multigroup concurrent calibration was implemented The computer program IRTPRO (Cai, Thissen, & du Toit, 2011), using marginal maximumlikelihood estimation (MML) with an EM algorithm, was used to estimate the models

Evaluation criteria

RMSE = root mean square error; SE = standard error; SS = sample size; CI = common item; VR = variance of grade-specific factor. Results

Person parameter person parameter estimates of the general dimension were better recovered than that of the grade-specific dimensions when the degree of construct shift was small or moderate sample size the estimation accuracy

Group parameter overestimated

UIRT discrimination: overestimated Difficulty: well recovered construct shift person & group mean

ANOVA Effects for the Simulated Factors Three-way tests of between-subject effects (ANOVA) –bias, RMSE, and SE Sample sizeDegree of Construct shift Percentage of common item Bifactor modelsmall~ moderate No~small Small Bias in d UIRTsmall~ moderate Large :  & Group Mean ability Large bias in d  & d & SE Group Mean ability grade-specific variance parameter Large: SE

comparison UIRT: overestimate discrimation parameter person and group mean parameter: Less accurate

Real data 2006 fall Michigan mathematics assessments Grade 3, 4, 5 Randomly 4000 examinees Bifactor vs UIRT

Variance estimation

R=0.983

Discussion sample size the estimation accuracy & stability Variance of grade-specific dimension stability Be caution about construct shift Polytomounsly/mixed item format Incorporate covariates longitudinal studies

Common item measure two group-specific ability ? the item discriminate parameter fixed to the true value Multidimensional IRT?