Test Equating Zhang Zhonghua Chinese University of Hong Kong.

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Hong Jiao, George Macredy, Junhui Liu, & Youngmi Cho (2012)
AP STATISTICS LESSON 12 – 2 ( DAY 2 )
Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
Presented at Measured Progress
Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.
Statistics for the Behavioral Sciences
Overview of Main Survey Data Analysis and Scaling National Research Coordinators Meeting Madrid, February 2010.
III Choosing the Right Method Chapter 10 Assessing Via Tests p235 Paper & Pencil Work Sample Situational Judgment (SJT) Computer Adaptive chapter 10 Assessing.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability n Consistent n Dependable n Replicable n Stable.
Concept of Reliability and Validity. Learning Objectives  Discuss the fundamentals of measurement  Understand the relationship between Reliability and.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.
1 IRT basics: Theory and parameter estimation Wayne C. Lee, David Chuah, Patrick Wadlington, Steve Stark, & Sasha Chernyshenko.
+ A New Stopping Rule for Computerized Adaptive Testing.
Comparison of Reliability Measures under Factor Analysis and Item Response Theory —Ying Cheng , Ke-Hai Yuan , and Cheng Liu Presented by Zhu Jinxin.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Validity and Reliability EAF 410 July 9, Validity b Degree to which evidence supports inferences made b Appropriate b Meaningful b Useful.
© Copyright 2000, Julia Hartman 1 An Interactive Tutorial for SPSS 10.0 for Windows © Analysis of Covariance (GLM Approach) by Julia Hartman Next.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
Ch 10 Comparing Two Proportions Target Goal: I can determine the significance of a two sample proportion. 10.1b h.w: pg 623: 15, 17, 21, 23.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
SAS PROC IRT July 20, 2015 RCMAR/EXPORT Methods Seminar 3-4pm Acknowledgements: - Karen L. Spritzer - NCI (1U2-CCA )
1 The Development of A Computer Assisted Design, Analysis and Testing System for Analysing Students’ Performance Q. He & P. Tymms, CEM CENTRE, UNIVERSITY.
Tests and Measurements Intersession 2006.
Calibration of Response Data Using MIRT Models with Simple and Mixed Structures Jinming Zhang Jinming Zhang University of Illinois at Urbana-Champaign.
1 An Investigation of The Response Time for Maths Items in A Computer Adaptive Test C. Wheadon & Q. He, CEM CENTRE, DURHAM UNIVERSITY, UK Chris Wheadon.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
LECTURE 1 - SCOPE, OBJECTIVES AND METHODS OF DISCIPLINE "ECONOMETRICS"
Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.
University of Ostrava Czech republic 26-31, March, 2012.
Optimal Delivery of Items in a Computer Assisted Pilot Francis Smart Mark Reckase Michigan State University.
Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Reliability n Consistent n Dependable n Replicable n Stable.
- We have samples for each of two conditions. We provide an answer for “Are the two sample means significantly different from each other, or could both.
Item Response Theory in Health Measurement
Week 7 Chapter 6 - Introduction to Inferential Statistics: Sampling and the Sampling Distribution & Chapter 7 – Estimation Procedures.
The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Overview of Item Response Theory Ron D. Hays November 14, 2012 (8:10-8:30am) Geriatrics Society of America (GSA) Pre-Conference Workshop on Patient- Reported.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Gary W. Phillips Vice President & Institute Fellow American Institutes for Research Next Generation Achievement Standard Setting Symposium CCSSO NCSA New.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Instrument Development and Psychometric Evaluation: Scientific Standards May 2012 Dynamic Tools to Measure Health Outcomes from the Patient Perspective.
Daniel Muijs Saad Chahine
Lecture 5 Validity and Reliability
Assessment Research Centre Online Testing System (ARCOTS)
Classical Test Theory Margaret Wu.
Booklet Design and Equating
CHAPTER 10 Comparing Two Populations or Groups
III Choosing the Right Method Chapter 10 Assessing Via Tests
Sampling Distribution
Sampling Distribution
Evaluating and Institutionalizing OD Interventions
Aligned to Common Core State Standards
By ____________________
International Comparison Program 2011 Round
Five things you probably don’t know from PISA….
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Margaret Wu University of Melbourne
United Nations Statistics Division
CHAPTER 10 Comparing Two Populations or Groups
Presentation transcript:

Test Equating Zhang Zhonghua Chinese University of Hong Kong

Question ? Two sets of Standardized Test which measure the same trait: A and B. A and B were administrated separately to two groups of students (Group 1 and Group 2). Group 1 students only took Test A, and Group 2 students only took Test B. The mean score on Test A for Group 1 is 84. And the mean score on Test B for Group2 is 80. t-test result indicated that there was a statistically significant difference between the mean score for Group 1 and Group 2 (p<0.05). Then, should the conclusion that the Group 1 students were better than the Group 2 students on the trait that the two tests measured be gotten?

Why Equate ? To compare test scores of different forms of tests (Strictly speaking, Parallel tests) which measure the same latent trait To construct the item bank/pool Computerized Adaptive Testing (CAT)

What’s Equating ? “Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. Equating adjusts for differences in difficulty among forms that are built to be similar in difficulty and content” (Kolen & Brennan, 2004). The two alternate test forms for equating: Same content and statistical specification Equity Symmetric Group Invariance

Lord’s Equity Property Examinees with a given true score would have identical observed score means, standard deviations, and distributional shapes of converted scores on Form X and scores on Form Y. First-order Equity Property Examinees with a given true score have the same means converted score on Form X as they have on Form Y.

Form Y RawForm X 1 RawForm X 2 Raw ………

Equating Design Single Group Random Groups Single Group with Counterbalance Anchored/Common-item Nonequivalent Group Preequating

Single Group SampleForm XForm Y G1√√

Single Group with Counterbalancing SampleTime 1Time 2 G1 Form XForm Y G2Form YForm X

Random Groups SampleForm XForm Y G1√ G2√

Common-item Nonequivalent Groups SampleForm XForm YCommon Items V G1√√ G2√√

Preequating Precalibrated IRT Parameter Item Bank Items form Bank (Operational items) New Items (Non-Operational Items)

Equating Methods Based on Classical Testing Theory (CTT) Based on Item Response Theory (IRT)

Downloadable Equating Procedures Equating/Linking Programs atingLinkingPrograms.htm IRT Scale Transformation Programs Programs.htm

Equating Methods Based on CTT Mean Equating Linear Equating Equipercentiel equating

CTT-Mean Equating In mean equating, Form X is considered to differ in difficulty from Form Y by the difference of the mean scores between the two forms. Example: M X =70, M Y =75. Let Form X as the base Form, Form Y as the target Form. For the score 80 on Form Y, the Equated Score on the scale of Form X is 80-(75-70)=75.

CTT-Linear Equating In Linear Equating, scores that are an equal distance from their means in standard deviation units are set equal.

CTT-Equipercentile For a given Form X score, find the percentage of examinees earning scores at or below that Form X score. Find the Form Y score that has the same percentage of examinees at or below it. The Form X and Form Y score are considered to be equivalent. Example: 70% of the examinees got a score 75 or below on Form X. 70% of the examinees got a score 80 or below on Form Y. Then a Form X score of 75 would be considered to represent the same level of achievement as a Form Y score of 80.

Equating Methods Based on IRT IRT Parameters Equating IRT Observed Score and IRT Truce Score Equating

Item Response Theory Take IRT Three-Parameter Model as an example, Item parameters: Item Discrimination, Item Difficulty, Guessing

Probability Item 1Item 2 Scale Score Difficulty Item 1 Item 2

Probability Item 1Item 2 Scale Score Difficulty Item 1 Item 2

Probability Item 1Item 2 Scale Score Difficulty Item 1 Item 2

Item Parameter Equating Linking Separate Calibration (Mean/Mean Method, Mean/Sigma Method, Stocking-Lord Method, Haebara Method) Concurrent Calibration Fixed Common-Precalibrated Item Parameter Method

IRT-Linking Separate Calibration

IRT-Moment Methods Mean/Mean Method Mean/Sigma Method

IRT-Characteristic Curve Method Stocking-Lord method: Haebara method:

Example Take Form Y as the base test, Form X as the target Test Item 1 on Form X: Item Difficulty is 1.0; Item Discrimination is 1.896; Guessing is 0.18 Equated item parameters for Item 1 on Form X onto the scale of Base Form Y can be computed as follows, Stocking-LordHaebaraMean/MeanMean/Sigma B A

IRT- Concurrent Calibration Concurrent calibration method involves estimating item and ability parameters simultaneously on a single computer run. In the procedure, the items that are not taken by one group of subjects are taken as not reached or missing data and the item parameters for all items on the two test forms are simultaneously estimated. This one estimation run makes the item parameters for all items from the two test forms put on the same scale (Kim & Hanson, 2002; Kim & Cohen, 1998). Example

Concurrent Calibration for Replication 16 >COMMENTS Horizontal Equating Concurrent Calibration for Replication 16 >GLOBAL NPARM=3,DFNAME='D:\RESEARCH\REP16\CONH-16\CONH- 16.DAT',SAVE; >SAVE PARM='D:\RESEARCH\REP16\CONH-16\CONH-16.PAR'; >LENGTH NITEMS=140; >INPUT NTOTAL=80,SAMPLE=2000,NALT=4,NIDCH=4,FORMS=2; (4X,4A1,6X,I1,1X,80A1) >FORM1 LENGTH=80,ITEMS=(1(1)80); >FORM2 LENGTH=80,ITEMS=(1(1)20,81(1)140); >TEST ITEMS=(1(1)140), LINK=(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); >CALIB CYCLES=20; >SCORE;

IRT-Fixed Common-Item Parameters This procedure combines the features of concurrent calibration and linking separate calibration methods. In the method, the item parameters for the two test forms are estimated separately. What differs from linking separate calibration is that the common item parameters from the target test will be fixed at the estimated values from the base test. Example

Fixed Common Item Parameters for Replication 16 >COMMENTS FCIP for Replication 16 Target Test Form B with N (0,1) >GLOBAL NPARM=3,DFNAME='D:\RESEARCH\REP16\FIXV-16\B11-16.DAT',SAVE; >SAVE PARM='D:\RESEARCH\REP16\FIXV-16\FIXV-16.PAR'; >LENGTH NITEMS=(80); >INPUT NTOTAL=80,SAMPLE=1000,NALT=4,NIDCH=4; (4A1,1X,80A1) >TEST ITEMS=(1(1)80); >CALIB TPRIOR,SPRIOR,GPRIOR,READPRI,CYCLES=20; >PRIORS TMU=(-0.639,1.041,1.701,0.482,-1.144,-0.023,0.616,1.133,0.668,0.577, ,0.029,0.904,0.232,1.602,1.642,0.537,-0.228,1.439,0.517,0.0(0)60), TSIGMA=(0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001, 0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,2.0(0)60), SMU=(-0.688,0.011,-0.810,0.614,-0.811,-0.445,-0.142,-0.387,0.292,-0.449, 0.040,-0.522,0.080,0.660,0.301,0.408,-0.689,-0.079,0.294,-0.174,0.0(0)60), SSIGMA=(0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001, 0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.5(0)60), ALPHA=( , , , , , , , , , , , , , , , , , , , ,6(0)60), BETA=( , , , , , , , , , , , , , , , , , , , ,16(0)60); >SCORE;

Comparison of Different Equating Methods No agreements have been gotten Methods based on CTT can be used to equate tests. Methods based on IRT are essential to construct item bank/pool. Among the methods based on IRT, some researches indicated that Concurrent Calibration Method could produce more accurate equating results than that of Linking Separate Calibration Method and FCIP method.

Thank You Very Much!