Download presentation
Presentation is loading. Please wait.
1
Booklet Design and Equating
2
Items do not fit (unidimensional) IRT models
Construct is artificially put together For example, mathematics consists of number, data, measurement, algebra, space. Number Algebra Measurement Geometry Data Australia 498 499 511 491 531 Korea 586 597 577 598 569 Netherlands 539 514 549 513 560 New Zealand 481 490 500 488 526 Norway 456 428 461 Russian Federation 505 516 507 515 484
3
Considerations in Test Design for Equating
Item position effect
4
Use rotated booklets to reduce position effect, and to cover content
Block 1 Block 2 Block 3 1 C1 C2 C4 2 C3 C5 3 C6 4 C7 5 6 7
5
Balanced Incomplete Block Design
Booklet Block 1 Block 2 Block 3 Block 4 1 M1 M2 M4 R1 2 M3 M5 R2 3 M6 PS1 4 M7 PS2 5 S1 6 S2 7 8 9 10 11 12 13
6
Requirements for BIB Number of pairs: k clusters, then there are kC2 pairs: k(k-1)/2 Example, k=13, then we need to form 78 pairs (13x12/2) Each booklet has four blocks, there will be 6 pairs (4x3/2) Since there are 13 booklets, we can have 13x6=78 pairs
7
Equating
8
Why is Equating needed? When different tests are administered, the results from the tests are not directly comparable. When a person performs well on a test, we do not know whether it is because the person is very able, or because the test is easy. All we can say is that the person found a particular test easy. Group 1 took Test 1, and Group 2 took Test 2. We cannot compare Group 1 and Group 2 results.
9
Common Item Equating Two tests containing common items are administered to different groups of respondents. There are a number of ways to perform equating using common items. We discuss the shift method, anchor method and joint calibration method. Test 1 items Common items Test 2 items Test 1 respondents Test 2 respondents
10
Before carrying out equating …
Check for Item Invariance
11
Number of common items needed for equating
We recommend a minimum of 30 link items for equating purposes, and more if possible. If the purpose of equating is to put different grade level students’ results on the same scale, one should be aware that a yearly increase (i.e., between two adjacent grades of students) in proficiency is of the order of 0.5 logit (about half a standard deviation of the student ability distribution for a year level). If the purpose of equating is to monitor trend from a calendar year to another calendar year for students in the same grade, one needs to be aware that the average cohort change across time is typically very small (perhaps less than 0.05 logit, or less than one month of growth)
12
Factors influencing item difficulty
Curriculum change Exposure to an item Opportunity to learn Item position in a test – fatigue effect
13
Shift Method Tests are calibrated separately, and the item and ability parameters for one test are placed on the scale of another test by a constant shift, the magnitude of which is computed as the amount needed to make the means of the common item parameters between the two tests the same
14
Shift and scale method – common items
A variation to the shift method is to transform both the scale and the location of the item parameters. In the case where the standard deviations of the common items for two tests are different and are taken into account in matching test 2 to test 1, the equating transformation is
15
Shift and scale methods – match ability distributions
IRT scaling is carried out for each test separately so that two sets of item parameters are produced. A re-scaling of the first test using item parameters for the common items for the second test is carried out, where the item parameters are fixed at the calibrated values of the second test. Work out a transformation based on the means and standard deviations of the ability distributions of the two calibrations of the first test.
16
Anchoring method Item parameters for the second test are anchored at the values for the first test. The anchoring method differs from the shift method in that every common item in the second test is fixed to the parameter value of the first test, while in the shift method, only the means of the set of common items are made equal so that individual items in the common item set may not have equal parameter values. Anchoring method may be preferred if one test has small sample size and the item parameters are not reliable.
17
The joint calibrations method (concurrent calibration)
Test 1 items Common items Test 2 items Test 1 respondents Test 2 respondents
18
Common person equating method
Test 1 items Test 2 items Test 1 respondents Respondents taking both tests Test 2 respondents
19
Horizontal and Vertical equating
The terms horizontal and vertical equating have been used to refer to equating tests aimed for the same target level of students (horizontal) and for different target levels of students (vertical). Vertical equating is more challenging because item placements in the tests, also because of opportunities to learn.
20
Equating errors The idea is to capture the variability of estimated item parameters in two different tests. The magnitudes of equating errors are generally large (in comparison with trend or growth estimates). See example in PISA technical reports (e.g., 2003 technical report)
21
Challenges in test equating
real-life challenges of keeping test items invariant. Curriculum changes Item position effect Fatigue Sequencing of items Balanced test design is essential Check for test delivering mode effect, e.g., online versus paper
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.