University of Ostrava Czech republic 26-31, March, 2012.

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Test Development.
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.
Latent Growth Modeling Chongming Yang Research Support Center FHSS College.
M 1 and M 2 – Masses of the two objects [kg] G – Universal gravitational constant G = 6.67x N m 2 /kg 2 or G = 3.439x10 -8 ft 4 /(lb s 4 ) r – distance.
VALIDITY AND RELIABILITY
Part II Sigma Freud & Descriptive Statistics
Part II Sigma Freud & Descriptive Statistics
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Item Response Theory in Health Measurement
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova
Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
MEASUREMENT. Measurement “If you can’t measure it, you can’t manage it.” Bob Donath, Consultant.
VERTICAL SCALING H. Jane Rogers Neag School of Education University of Connecticut Presentation to the TNE Assessment Committee, October 30, 2006.
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
By Dr. Mohammad H. Omar Department of Mathematical Sciences May 16, 2006 Presented at Statistic Research (STAR) colloquium, King Fahd University of Petroleum.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Network Theorems SUPERPOSITION THEOREM THÉVENIN’S THEOREM
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Classical Test Theory By ____________________. What is CCT?
Radial Basis Function Networks
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
EPSY 8223: Test Score Equating
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
Systems and Matrices (Chapter5)
The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods.  Kinge Mbella Liz Burton Rob Keller Nambury.
Linking & Equating Psych 818 DeShon. Why needed? ● Large-scale testing programs often require multiple forms to maintain test security over time or to.
Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Review and Validation of ISAT Performance Levels for 2006 and Beyond MetriTech, Inc. Champaign, IL MetriTech, Inc. Champaign, IL.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
WEEK 8 SYSTEMS OF EQUATIONS DETERMINANTS AND CRAMER’S RULE.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
Estimation. The Model Probability The Model for N Items — 1 The vector probability takes this form if we assume independence.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Item Response Theory in Health Measurement
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
LECTURE 14 NORMS, SCORES, AND EQUATING EPSY 625. NORMS Norm: sample of population Intent: representative of population Reality: hope to mirror population.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Classical Test Theory Psych DeShon. Big Picture To make good decisions, you must know how much error is in the data upon which the decisions are.
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.
Jean-Guy Blais Université de Montréal
Chapter 7. Classification and Prediction
Vertical Scaling in Value-Added Models for Student Learning
Assessment Research Centre Online Testing System (ARCOTS)
Classical Test Theory Margaret Wu.
SIMPLE LINEAR REGRESSION MODEL
Booklet Design and Equating
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
By ____________________
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Presentation transcript:

University of Ostrava Czech republic 26-31, March, 2012

 Different forms of a test  Item banking  Achievement monitoring

Classical Test Theory Item ResponseTheory  It is applied only for different test forms equating  It is often ignored (conception of parallel test forms)  Establishes equivalent scores on different test forms  Doesn’t create a common scale  Allows to satisfy all equating needs  Allows to put all estimates of item and examinee parameters to the common scale

 It is a special procedure that allows to establish relation between examinee scores on different test forms and place them onto the same scale.  As a result, measure based on responses to one test can be matched to a measure based on responses to another test, and the conclusions drawn about examinee are identical, regardless of the test form that produced the measure.  Equating of different test forms is called horizontal equating.

 The purpose: comparison of student achievements at different grade levels  Test forms are designed to be of different difficulties  Measures from different tests should be placed on the same linear continuum  Procedure of this test equating is called vertical equating.

Item bank – a set of items from which test forms that create equivalent measures may be constructed. Item bank is composed of a set of test items that have been placed onto a common scale, so that different subsets of these items produce interchangeable measures for an examinee. In the presence of item bank we dont need in further equating

 Both are designed to place estimated parameters onto a common scale  In test equating the goal is to place person measures from the multiple test forms onto the same scale  In item banking the goal is to place item calibrations on the same scale  Procedures are nearly identical when we use Rasch measurement

 Equating – procedure that ensures the examinee measures obtained from different subsets of items are interchangeable. When two tests are equated, the resulting measures are placed onto the same scale.  Scaling – procedure that associates numbers with the performance of examinees. Tests can be scaled identically, but have not been equated.

 Applies only to compare examinee test scores on two different test forms  A problem can be ignored (introduction of “parallel” test froms)  Implies only an establishment of relation between test scores on different test forms  Doesn’t imply creation of a common scale

 Linear equating  Equipercentile equating

It is based on equating the standard score on test X to the standard score on test Y: Thus,, where,

 Scores on tests X and Y are considered to be equivalent if their respective percentile ranks in any given group are equal.

 Both methods require assumptions concerning identity of test score destrubutions and about equivalence of examinee groups  Equating in CTT doesn’t imply creation of a common scale

 Measuring the same trait – tests of different content can not be equated (but can be scaled in a similar manner).  Invariance of equating results across samples of examinees  Independence of equating results on which test is used as a reference test

Method of common items: linkage between two test forms is accomplished by means of a set of items which are common for two test forms Method of common persons: linkage between two test forms is accomplished by means of a set of persons who respond to both test forms Combined methods: linkage between two test forms is accomplished by means of common items and / or common persons plus common raters

Internal anchor: Each test form has one set of items that is shared with other forms and another set of items that is unique to this form

External anchor: Each test form has an additional set of items, that are not from these test forms

 Involving all examinees respond both test forms.  There are two approaches to this design: - same group/ same time - same group/ different time

Linkage between two test forms is accomplished by means of a set of examinees who respond to all items.

 Selecting an equating method  Parameter estimation  Transformation of parameters from different test froms to the same scale  Evaluating the quality of the links between test froms

 Simultaneous calibration: all parameters are estimated simultaneously in one run of the estimation software. Data are automatically scaled to the same scale.  Separate calibration: parameters are estimated for each test form separately. That is, the data are calibrated in multiple runs of the estimation software.  Separate calibration may be more difficult to accomplish because the test developer needs to transform measures to a common scale

 Separate calibration of all test forms with transformating measures to the common scale  Simultaneous calibration of all test forms and placing all measures on the common scale  Separate calibration of all test forms with anchoring the difficulty values of the common items and consecutive placing all parameters on the common scale

 As a rule this procedure is used with method of common items that are called nodal items in this case  Each test form is calibrated separately. As a result for each test form all estimates lie on the own scale. The only difference between scales is in difference between origins of the scales  This difference can be removed by means of calculating location shift  It is desirable to have not less that % nodal items (some of them can be deleted from the link later).

 Choice of a common scale  Selection of nodal items  Calibration of all test forms  Calculating equating constants  Link quality evaluation  Transformating all parameters onto a common scale

t 12 – shift constant from test form 1 to test form 2; δ i1 – difficulty estimate of item i in test from 1; δ i2 – difficulty estimate of item i in test from 2; l – the number of common items.  Sometimes other formulas are applied - weighted mean, dispersion shift, etc.

δ i1 ' = δ i1 + t 12, where δ i1 – difficulty estimate for item i in test form 1; δ i1 ' – difficulty estimate for the same item on the scale of test form 2, i=1,…,k, k – the total number of test items; θ n1 ' = θ n1 + t 12, where θ n1 – ability estimate for examinee n who respond items of test form 1; θ n1 ' – ability estimate for the same examinee on the scale of test form 2, n=1,…, N; N – the total number of examinees who respond items of test form 1.  Shifted by this way parameter estimates of test from 1 will be placed to the scale of test form 2.

 Item-within-link (fit analysis of linking items);  Item-between-link (stability of the item calibrations between two test forms)

where σ i12 is defined by σ i12 2 = σ i1 2 + σ i2 2 ; σ i1, σ i2 - standard errors of measurement for item i under calibration of test form 1 and 2; δ i1 - difficulty estimate for item i in test form 1; δ i1 ' - difficulty estimate for the same item on the scale of test form 2; U i ~ N(0,1)

 All parameters of all test forms are estimated simultaneously  Is the simplest approach to equating test forms or calibrating an item bank because it requires no subsequent transformation of the estimated measures or calibrations. Data are automatically scaled to the same scale in one run the estimation software

 As a rule this procedure is used with method of common items that are called anchor items in this case  Common items are estimated one time during calibration of the first test form  During calibration of another test form the calibration values for these items are treated as being fixed or known and are not estimated. As a result, the remaining parameter estimates are forced onto the same scale as the anchor items  It is easy to anchor items in most estimation software

IAFILE=* * Numbers of anchor items and their difficulties are specified. These difficulty values will be fixed and not be estimated during calibration of new test form

 Choice of a common scale  Selection of anchor items  Calibration of the test form which scale is accepted as a common scale  Sequential calibration of other test forms with fixing the difficulty values of anchor items  Item-Within Link Fit (fit analysis of linking items);

 If we use different equating procedures, obtained scales will be different and can not be directly compared. It is connected with different ways of origin selection in different procedures.  There are papers (for example, Smith R.M. «Applications of Rasch Measurement». Chicago: Mesa Press ) where all three procedures are analyzed. The precision of estimated examinee and item parameters is approximately the same and correlation between measures is high.

 Each test form has 26 dichotomous items  Both test forms have 6 common items: № 4, 6, 7, 14, 20, 24 (23 % of the total number of items)  The total number of examinees for test form 1 is 654, for test form  For test calibration Winsteps software was used  Means of examinee measures are -1,07 и -0,72 logits for test form 1 and 2 correspondingly  The first test form scale was chosen as a common scale

Item numbe r Test form 1Test form 2 uiui Difficult y estimate δ i Standar d Error σ i Difficu lty estimat e δ i Standard Error σ i Shifted Difficul ty estimate δ i ' Sum Mean Shift constant t 12 = - 0,298.

It implies creation of a common response matrix for both test forms containing 1315 examinees and 46 different items. Measures of all examinees and difficulty values of all items will be placed on a common scale that is centered in the difficulty mean of all 46 items

 Calibration of test form 1  Calibration of test form 2 with fixing the difficulty values of anchor items from the first calibration IAFILE=* *  As a result examinee measures from both test forms will be on the first test form scale

 Comparison of examinee measures from three equating procedures revealed approximately similar results: correlation is closed to 1  The choice of equating procedure is determined by the real data design and purpose of research