Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.

Slides:



Advertisements
Similar presentations
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Advertisements

Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
1 CSSS Large Scale Assessment Webinar Adaptive Testing in Science Kevin King (WestEd) Roy Beven (NWEA)
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.
Validity In our last class, we began to discuss some of the ways in which we can assess the quality of our measurements. We discussed the concept of reliability.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Objectives Look at Central Limit Theorem Sampling distribution of the mean.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Topic 3: Regression.
Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Grade 3-8 English Language Arts and Mathematics Results August 8, 2011.
EngageNY.org Scoring the Regents Examination in Algebra I (Common Core)
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
EPSY 8223: Test Score Equating
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods.  Kinge Mbella Liz Burton Rob Keller Nambury.
Establishing MME and MEAP Cut Scores Consistent with College and Career Readiness A study conducted by the Michigan Department of Education (MDE) and ACT,
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Out with the Old, In with the New: NYS Assessments “Primer” Basics to Keep in Mind & Strategies to Enhance Student Achievement Maria Fallacaro, MORIC
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.
A Balancing Act: Common Items Nonequivalent Groups (CING) Equating Item Selection Tia Sukin Jennifer Dunn Wonsuk Kim Robert Keller July 24, 2009.
IRT Model Misspecification and Metric Consequences Sora Lee Sien Deng Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
ELA & Math Scale Scores Steven Katz, Director of State Assessment Dr. Zach Warner, State Psychometrician.
Differential Item Functioning. Anatomy of the name DIFFERENTIAL –Differential Calculus? –Comparing two groups ITEM –Focus on ONE item at a time –Not the.
Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
ICT Teachers Training “About ICT” Presented by Kulkarni S.A. Mob No :
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
University of Ostrava Czech republic 26-31, March, 2012.
Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Understanding the 2015 Smarter Balanced Assessment Results Assessment Services.
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
Aligning Assessments to Monitor Growth in Math Achievement: A Validity Study Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information Washington.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
Vertical Articulation Reality Orientation (Achieving Coherence in a Less-Than-Coherent World) NCSA June 25, 2014 Deb Lindsey, Director of State Assessment.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
Measuring Research Variables
Utilizing Item Analysis to Improve the Evaluation of Student Performance Mihaiela Ristei Gugiu Central Michigan University Mihaiela Ristei Gugiu Central.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
1 Main achievement outcomes continued.... Performance on mathematics and reading (minor domains) in PISA 2006, including performance by gender Performance.
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Classroom Analytics.
Reliability & Validity
Booklet Design and Equating
Partial Credit Scoring for Technology Enhanced Items
National Conference on Student Assessment
Mohamed Dirir, Norma Sinclair, and Erin Strauts
UNIT IV ITEM ANALYSIS IN TEST DEVELOPMENT
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Perspectives on Equating: Considerations for Alternate Assessments
MGS 3100 Business Analysis Regression Feb 18, 2016
  Using the RUMM2030 outputs as feedback on learner performance in Communication in English for Adult learners Nthabeleng Lepota 13th SAAEA Conference.
Presentation transcript:

Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title

Overview of Equating Designs and Methods Designs –Single Group –Random Groups –Common Item Nonequivalent Groups (CING) Methods –Mean –Linear –Equipercentile –IRT True or Observed

Guidelines for Selecting Common Items for Multiple-Choice (MC) Only Exams Representative of the total test (Kolen & Brennan, 2004) 20% of the total test Same item positions Similar average/spread of item difficulties (Durans, Kubiak, & Melican, 1997) Content representative (Klein & Jarjoura, 1985)

Challenges in Equating Mixed-Format Tests (Kolen & Brennan, 2004; Muraki, Hombo, & Lee, 2000) Constructed Response (CR) scored by raters Small number of tasks –Inadequate sampling of construct –Changes in construct across forms Common Items –Content/difficulty balance of common items –MC only may result in inadequate representation of groups/construct IRT –Small number of tasks may result in unstable parameter estimates –Typically assume a single dimension underlies both item types Format Effects

Current Research Number of CR Items –Smaller RMSD with larger numbers of items and/or score points (Li and Yin, 2008; Fitzpatrick and Yen, 2001) –Misclassification (Fitzpatrick and Yen, 2001) Fewer than 12 items, more score points resulted in smaller error rates Greater than 12 items, error rates less than 10% regardless of score points Trend Scoring (Tate, 1999, 2000; Kim, Walker, McHale, 2008) –Rescoring samples of CR items –Smaller bias and equating error

Cont. Format Effects (FE) –MC and CR measure similar constructs (Ercikan et al., 1993; Traub, 1993) –Males scored higher on MC; females higher on CR ( DeMars, 1998; Garner & Engelhard, 1999) –Kim and Kolen, 2006 Narrow-range tests (e.g., credentialing) Wide-range tests (e.g., achievement) Individual Consistency Index (Tatsuoka & Tatsuoka, 1982) –Detecting aberrant response patterns –Not specifically in the context of mixed-format tests

Purpose and Research Questions Purpose: Examine the impact of equating mixed format tests when student subscores differ across item types. Specifically, To what extent does the intra-individual consistency of examinee responses across item formats impact equating results? How does the selection of common items differentially impact equating results with varying levels of intra-individual consistency?

Data “Old Form” (OL) treated as “truth” –Large-scale 6 th grade testing program –Mathematics –54 point test 34 multiple choice (MC) 5 short answer (SA) 5 constructed response (CR) worth 4 points each –Approx. 70,000 examinees “New Form” (NE) –Exactly the same items as OL –Samples of examinees from OL

Scoring Test 39 Items OL (old form) All Examinees NE (new form) Samples of 3,000 Examinees Scoring Test 39 Items Both OL and NE contain the exact same items Only difference between the forms are the examinees

Intra-Individual Consistency Consistency of student responses across formats Regression of dichotomous item subscores (MC and SA) onto polytomous item subscores (CR) Standardized residuals –Range from approximately to –Example: Index of Student subscores on CR under-predicted by two standard deviations based on MC subscores

Samples Three groups of examinees based on intra- individual consistency index –Below (NEG) –-1.50 to (MID) –Above (POS) 3,000 examinees per sample Sampled from each group based on percentages Samples selected to have same quartiles and median as whole group of examinees

Sampling Conditions 60/20/20 –60% sampled from one of the groups (i.e., NEG, MID, POS) –20% sample from each of the remaining groups –Repeated for each of the three groups 40/30/30

Common Items Six sets of common items –MC only (12 points) –CR only (12 points) –MC (4) and CR (8) –MC (8) and CR (4) –MC (4), CR (4), and SA (4) –MC (7), CR (4), and SA (1) Representative of total test in terms of content, difficulty and length

Equating Common-item nonequivalent groups design Item parameters calibrated using Parscale 4.1 –3-parameter logistic model (3PL) for MC items –2PL model for SA items –Graded Response Model for CR items IRT scale transformation –Mean/mean, mean/sigma, Stocking-Lord, and Haebara IRT true score equating

OLNE “Common” Items Equating conducted using only a selection of items treated as common Equating OL and NE All items shared in common “Truth” established by equating NE to OL using all items as common items

Evaluation Bias and RMSE –At each score point –Averaged over score points Classification Consistency

Results: 60% Mid

Results: 40% Mid

In the extreme…

Across the Score Scale: Average Bias

Across the Score Scale: Average RMSE

Across the Score Scale: Misclassification Rates

Classification Consistency: Proficient

Discussion Different equating results based on sampling conditions Differences more exaggerated when using common items sets with mostly CR items Mid 60 most similar to data, small differences across common item selections

Limitations and Implications Limitations –Sampling conditions –Common item selections –Only one equating method Implications for future research –Sampling conditions, common item selections, additional equating methods –Other content areas and grade levels –Other testing programs –Simulation studies

Thanks! Rob Keller Mike, Louis, Won, Candy, and Jessalyn