Download presentation
Presentation is loading. Please wait.
Published byArron Stevens Modified over 9 years ago
1
1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut
2
AN EXAMPLE CONVERTING TEMPERATURE FROM ONE SCALE TO ANOTHER CELSIUS TO FAHRENHEIT F = C * (9/5) + 32 THIS IS A LINEAR MAPPING OF ONE SCORE ON TO THAT OF ANOTHER
3
AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS MEASURING BLOOD SUGAR Equivalence of Lab Measurement and Home Measurement How do we do this? 1.Measure Blood Sugar in the Lab 2.Measure Blood Sugar Using Home Instrument 3.Plot one value against the other
4
AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS LAB = 1.021*HOME + 4.554
5
AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS COST OF LIVING COST OF GOODS IN 2015 COMPARED WITH COST IN 1975 How do we do this? 1.Record costs of comparable goods (Market Basket) 2.Find cost ratio (Economist’s procedure) or 3.Plot one value against the other
6
AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS COST NOW = 7.35*COST 70 + 7.175
7
The above are examples of relating two measurement scales to each other. In the first example, we realize that one scale bears a simple relation to the other. We are just changing the units of measurement. In the second, we are using two different instruments to measure the same entity, and using prediction to establish equivalence. In the third, we are trying to relate two measurements, under the assumption that the objects we measure remain the same. We are “equating” in all these cases. EQUATING
8
Multiple forms of a test are necessary in many testing situations to ensure the validity of scores. Examples: Many large scale testing programs have several administration dates in a given year. Students may be permitted multiple attempts on an exam. To assess growth or change, an individual must be assessed on several occasions over time. Many states are required to release some or all test items after administration and hence require new forms. THE CONTEXT OF EQUATING
9
In each case, the validity of the scores would be threatened by using the same test. Using the same test across administrations at different times would result in exposure of the test items and would mean that students who were tested later had an advantage. Using the same test across time for the same students would confound practice or memory effects with true growth or change. THE CONTEXT OF EQUATING
10
To avoid these problems, different test forms are constructed to the same specifications. The different test forms are intended to measure the same skills and be of similar difficulty and discrimination. Tests that measure the same construct, have the same mean, variance, reliability, and correlations with other tests are said to be “parallel”. THE CONTEXT OF EQUATING
11
If the test forms were truly parallel, equating would not be necessary. However, test forms will always differ somewhat, and it is necessary to take this into account somehow before scores on different forms can be compared. THE CONTEXT OF EQUATING
12
Successful equating will result in scores that can be validly compared even though they were obtained on different test forms. The overarching principle of equating is that it should be equitable for all test-takers: after equating, it should not matter to individuals whether they take Form X or Form Y. THE CONTEXT OF EQUATING
13
Equating procedures are designed to produce scores that can be validly compared across individuals even when they took different versions of a test (different “test forms”). Linking, Scale aligning, or simply Scaling, involves transforming the scores from different tests onto a common scale. These procedures produce scores that are expressed on a common scale but the scores of individuals who took different tests cannot be compared in the strict sense. EQUATING AND LINKING
14
Linking is used for establishing the relationship between two tests that are not built to the same content specifications or to have the same difficulty. One common example is two establish a relationship between the scores obtained on SAT and ACT, two college entrance examinations. The ACT and SAT are measures of the same general constructs (aptitude). Concordance tables are constructed to express the relationship between the two scores; they are however, not exchangeable. LINKING VS. EQUATING
15
Similarly, in India, the relationship between scores obtained on secondary school certificate examinations conducted by two boards must be established through linking. While these tests are not exchangeable, the linking provides some degree of comparability. If all students take a common college entrance examination, the board scores can be linked through this common examination. This linking will be a little stronger. LINKING VS. EQUATING
16
Equating is one of the core applications of measurement theory: it is central to any testing program that uses multiple test administrations. There are two different measurement frameworks in common use today; equating is approached differently in these frameworks The older framework (c. 1900s) is now referred to as Classical Test Theory (CTT) The newer (c. 1970s) framework is Item Response Theory (IRT) MEASUREMENT FRAMEWORKS FOR EQUATING
17
Classical Test Theory is based on number-correct (raw) test scores. Equating under CTT involves creating a table for converting raw scores on one test to the scale of raw scores on the other test. The equated scores may then be linearly transformed to a desired reporting scale. Equating in a classical framework lacks a strong theoretical basis. MEASUREMENT FRAMEWORKS FOR EQUATING
18
IRT provides a strong theoretical basis for equating. IRT provides mathematical models that relate an individual’s performance on a test item to characteristics of the item and the individual’s value on the underlying latent trait measured by the test. Instead of using raw score as a measure of performance, we use the estimated trait value for each individual. Equating is a simpler matter under IRT than CTT. MEASUREMENT FRAMEWORKS FOR EQUATING
19
However, IRT requires strong assumptions and large sample sizes for estimating the parameters of the model. CTT equating procedures remain in wide use in situations where the assumptions of IRT are not met, where testing populations are small, or where raw scores are the preferred scores. MEASUREMENT FRAMEWORKS FOR EQUATING
20
Within the classical test theory framework, equating can only be achieved under certain conditions. 1.The tests must measure the same construct. Scores that measure different constructs can never be equated - we could not fairly compare the mathematics score of one student with the reading score of another! REQUIREMENTS FOR CLASSICAL EQUATING
21
Within the classical framework, equating can only be achieved under certain conditions: 2.The tests should be equally reliable. If the test forms are unequally reliable, higher performing students would benefit from the more reliable test, and lowerperforming students would benefit from the less reliable test. REQUIREMENTS FOR CLASSICAL EQUATING
22
Within the classical framework, equating can only be achieved under certain conditions: 3.The conversion should be symmetric: it should result in the same equated scores regardless of whether Form Y scores are converted to the scale of Form X or Form X scores are converted to the scale of Form Y, i.e., the equating should produce the same result in both directions. For example, if a score of 28 on Form Y equates to a score of 26 on Form X, then a score of 26 on Form X should equate to a score of 28 on Form Y. REQUIREMENTS FOR CLASSICAL EQUATING
23
Within the classical framework, equating can only be achieved under certain conditions: 4.The equating function should be invariant across subpopulations of students at different levels of performance. The distribution of performance in the groups taking the different test forms should not affect the equating function; if it does, then the equating may not be equitable to groups of students with a different distribution of performance. REQUIREMENTS FOR CLASSICAL EQUATING
24
Fully meeting these requirements is almost impossible in practice (particularly the requirements of equal reliability and invariance across all subpopulations). Nevertheless, some adjustment of scores from different test forms must be made to compensate for lack of equivalence of test forms. We keep the requirements for equating in mind to ensure that we do the best possible job. REQUIREMENTS FOR CLASSICAL EQUATING
25
Linking of test scores can only be achieved if appropriate data are collected to establish the relationship between the score scales. There are several possible data collection designs: 1.Single-group (SG) design 2.Equivalent-groups (EG) design 3.Non-equivalent anchor test (NEAT) design DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
26
Single-Group Design: One group of test-takers takes both test forms. This is the ideal design, as we know that any difference in the distributions of scores is due to differences in difficulty and discrimination of the test forms. We then adjust the scores on one form to make the two distributions equal. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XForm Y P1**
27
Single-Group Design: Advantages: There is no confounding of test differences with group differences. Only one group of test-takers is needed. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
28
Single-Group Design: Disadvantages: Practice and fatigue effects may affect scores. These effects can be controlled by counterbalancing the order of administration of the forms, but this generally cannot be incorporated into an operational administration; it requires a special equating study. Extended testing time is required (may not be feasible). DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
29
Equivalent-Groups Design: Two equivalent groups take one test form each. This design is an approximation to the single-group design. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XForm Y P1* P2*
30
Equivalent-Groups Design: Advantages: There is no confounding of test difficulty with group differences. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
31
Equivalent-Groups Design: Disadvantages: Large samples are required to ensure statistical equivalence. If the base form has been previously administered, it must be re-administered at the same time as the new form to ensure equivalent samples; this poses a test security risk. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
32
Non-equivalent Anchor Test (NEAT) Design: Two different groups take one test each; both groups take a common set of items (anchor test A) The anchor test controls for group differences. Individuals taking Form Y with the same anchor test score as individuals taking Form X should have the same score on Form X. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XAnchorForm Y P1** Q2**
33
Non-equivalent Anchor Test (NEAT) Design: Advantages: The groups do not need to be equivalent. A new form can be equated to a previously administered form without re-administering the old form. Students take only one form of the test plus the anchor test. The design is easily incorporated into operational test administrations. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
34
Non-equivalent Anchor Test (NEAT) Design: Disadvantages: The anchor test must be of sufficient length and appropriate nature. The anchor test items must behave in the same way for the two different groups (be equally difficult). If equating to a previously administered test, the anchor test items have also been previously administered and therefore may be known to test- takers. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
35
The NEAT design is the most commonly used in practice. The purpose of the anchor is to determine to what extent score differences on the test forms are due to ability or proficiency differences in the two groups. Differences in test difficulty can then be separated from group differences. Equating should adjust scores only for differences in test difficulty. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
36
Anchor tests may be external (anchor items not included in score) or internal (anchor items included in score). External anchors are often administered as separately timed sections. Internal anchor items are spread throughout the test. Both types of anchor are susceptible to exposure; internal anchors are susceptible to context effects. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
37
In testing programs where scored items must be released after administration, an external anchor makes sense. For Classical equating methods, the anchor test should be a “mini-test” in composition. Longer anchors will give better results. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING
38
THE EQUATING PROBLEM Mean = 10.4; SD = 3.9Mean = 8.4; SD = 4.4 Suppose we administer two test forms X and Y to a group of students and obtain the score distributions shown below:
39
Form Y is harder than Form X (the average score is lower on Form Y). On average, students scored about two points lower on Form Y than Form X. A student with a score on 10 on Form Y would probably have obtained a higher score if he or she had taken Form X. Why not just add 2 points to the score of everyone who took Form Y? Then the average scores would be about the same. THE EQUATING PROBLEM
40
But suppose the two tests were made up of items with difficulties like those shown below: THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21
41
For low ability students, Form Y is harder because there are fewer easy questions, so they should get points added to their Form Y score. THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21
42
For high ability students, Form Y is slightly easier because there are fewer hard questions, so they should have points subtracted from their Form Y score. THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21
43
Adding or subtracting the same number of points for every student would not be equitable*. We need an adjustment that is different for students at different levels of ability. * CLASSICAL EQUATING PROCEDURES
44
Why not do a regression analysis to predict the scores on Form X from the scores on Form Y? CLASSICAL EQUATING PROCEDURES Prediction equation is X = 4.142 +.751X A student with a score on 10 on Form Y is predicted to have a Form X score of 4.142 +.751(10) = 11.65
45
For the conversion to be symmetric, a student with a (theoretical) score of 11.65 should get a predicted score of 10 on Form Y. CLASSICAL EQUATING PROCEDURES Prediction equation is X = -1.473 +.944X A student with a score on 12.44 on Form X is predicted to have a Form Y score of -1.473 +.944(11.65) = 9.52
46
The regression procedure is not symmetric: we get different equated scores depending on which test we use as the base form. CLASSICAL EQUATING PROCEDURES
47
We can solve these problems in one of two ways. There are two general approaches to equating test scores using classical procedures: Linear methods Equipercentile methods Both require scores of the same or equivalent groups on the two tests (but can also be used with non-equivalent groups after some fancy footwork). CLASSICAL EQUATING PROCEDURES
48
Under linear equating, scores on the two forms are considered equivalent if they are the same distance from the mean of the form in standard deviation units (i.e., have the same z-score). Hence, there is a linear relationship between the two sets of scores. LINEAR EQUATING
49
Scores x (on Form X) and y (on Form Y) are considered equivalent if they have the same z-score: where are the mean and standard deviation of scores on Form X and are the mean and standard deviation of scores on Form Y. LINEAR EQUATING
50
Thus, Given sample data, the transformation required for rescaling Form Y scores to the scale of Form X: The transformation for rescaling Form Y scores to the scale of Form X is LINEAR EQUATING
51
After equating, the rescaled Form Y scores have the same mean and standard deviation as the Form X scores. Note that this is not a prediction (regression) equation; it can be inverted to obtain the transformation for rescaling Form X scores to the scale of Form Y: LINEAR EQUATING
52
If we substitute the transformed Form Y score y*(now on the Form X scale) for x in the equation, y* converts back to the original y value: LINEAR EQUATING
53
Rescaling usually results in non-integer scores; these are generally rounded to integer values. Under Linear equating, it is possible to obtain rescaled scores outside the permissible range. Depending on the final reporting scale, these are often truncated to the lowest and highest permissible scores (arbitrarily decided). LINEAR EQUATING
54
For the test score distributions shown earlier, Mean on Form X:10.429 SD of Form X: 3.918 Mean of Form Y: 8.375 SD of Form Y: 4.394 LINEAR EQUATING EXAMPLE
55
Score on Form Y Equated Score on Form X Rounded Equated Score 0 2.963 1 3.854 2 4.755 3 5.646 4 6.537 5 7.427 6 8.318 7 9.219 8 10.1010 9 10.9911 10 11.8812 11 12.7713 12 13.6714 13 14.5615 14 15.4515 16.3416 17.2317 18.1318 19.0219 19.9120 20.8020 LINEAR EQUATING EXAMPLE A score of 3 on Form Y is equivalent to a score of 6 on Form X A score of 9 on Form Y is equivalent to a score of 11 on Form X Note that we have to truncate the equated scores at 20
56
Score on Form Y Equated Score on Form X Rounded Equated Score 0 2.963 1 3.854 2 4.755 3 5.646 4 6.537 5 7.427 6 8.318 7 9.219 8 10.1010 9 10.9911 10 11.8812 11 12.7713 12 13.6714 13 14.5615 14 15.4515 16.3416 17.2317 18.1318 19.0219 19.9120 20.8020 LINEAR EQUATING EXAMPLE Note that we have added more points to scores at the low end of the scale than at the top end
57
Linear equating matches the means and standard deviations of the two distributions but it does not change the shape of the score distribution. A Form Y score with the same z-score as a Form X score may have a different percentile rank. EQUIPERCENTILE EQUATING
58
Under equipercentile equating, scores on two forms are considered to be equivalent if they have the same percentile rank in the target population. After equating, the equated scores have the same distribution as the scores on the base form. EQUIPERCENTILE EQUATING
59
Scores on the two forms with equal percentile ranks are plotted against each other. The curve is usually irregular; smoothing procedures are applied under the assumption that the irregularity is due to sampling error. Smoothing can be applied to each cumulative distribution function before obtaining equivalent scores (pre-smoothing) or it can be applied to the equipercentile equivalent curve (post-smoothing). Equating usually results in non-integer scores; these are generally rounded to integer values. EQUIPERCENTILE EQUATING
60
Score on Form X % of students Cumulative % of Students Score on Form Y % of students Cumulative % of Students 0 0.1 0 1.0 1 0.20.3 1 3.54.5 2 1.21.5 2 4.48.9 3 2.13.5 3 5.114.0 4 2.76.3 4 5.919.9 5 4.610.9 5 7.827.7 6 5.516.4 6 8.135.8 7 6.823.2 7 9.345.1 8 9.032.2 8 8.853.9 9 10.843.0 9 9.062.8 10 9.552.5 10 7.870.6 11 9.762.2 11 6.977.5 12 7.869.9 12 5.182.5 13 7.076.9 13 4.186.6 14 6.883.7 14 3.189.7 15 5.489.1 15 2.992.5 16 3.292.4 16 2.294.7 17 3.195.5 17 2.296.9 18 2.798.2 18 1.498.3 19 1.499.6 19 0.999.3 20 0.4100.0 20 0.7100 EQUIPERCENTILE EQUATING EXAMPLE A score of 9 on Form Y has (almost) the same percentile rank as a score of 11 on Form X; a score of 9 on Form Y is equivalent to a score of 11 on Form X
61
EQUIPERCENTILE EQUATING EXAMPLE Mean = 10.9; SD = 3.5Mean = 8.5; SD = 4.5 Form Y is harder than Form X
62
EQUIPERCENTILE EQUATING EXAMPLE Form Y is harder than Form X
63
EQUIPERCENTILE EQUATING EXAMPLE Cumulative test score distributions: Note that the curves are not perfectly smooth
64
EQUIPERCENTILE EQUATING EXAMPLE Smooth the curves and plot the smoothed distributions together:
65
Score on Form X Percentile rank after smoothing Score on Form Y Percentile rank after smoothing 0 0.0 0 1 1 1.5 2 0.1 2 4.7 3 0.6 3 10.0 4 2.0 4 17.1 5 4.8 5 25.0 6 9.2 6 33.3 7 15.4 7 41.8 8 23.2 8 50.1 9 32.1 9 57.9 10 41.9 10 64.9 11 51.9 11 71.4 12 61.5 12 77.3 13 70.4 13 82.3 14 78.5 14 86.6 15 85.5 15 90.2 16 91.4 16 93.2 17 95.9 17 95.5 18 98.8 18 97.3 19 99.9 19 98.7 20 100.0 20 99.6 EQUIPERCENTILE EQUATING EXAMPLE What score on Form X has the same percentile rank as a score of 10 on Form Y?
66
Score on Form X Percentile rank after smoothing Score on Form Y Percentile rank after smoothing 8 23.2 8 50.1 9 32.1 9 57.9 10 41.9 10 64.9 11 51.9 11 71.4 12 61.5 12 77.3 13 70.4 13 82.3 14 78.5 14 86.6 EQUIPERCENTILE EQUATING EXAMPLE We interpolate between the scores of 12 and 13 to find the score on Form X that has a percentile rank of 68.2 Equated score = 12 + (64.9 - 61.5)/(70.4 - 61.5) = 12.38 We round this to a score of 12 for reporting purposes
67
Score on Form Y Equated Score on Form X Rounded Equated Score 0 --2 1 3.173 2 4.494 3 5.626 4 6.597 5 7.508 6 8.418 7 9.319 8 10.2010 9 11.1211 10 12.0912 11 13.0413 12 13.9114 13 14.7215 14 15.4515 16.1316 16.8617 17.3517 18 17.7218 19 19.0519 20 19.9020 EQUIPERCENTILE EQUATING EXAMPLE A score of 3 on Form Y is equivalent to a score of 6 on Form X A score of 14 on Form Y is equivalent to a score of 15 on Form X
68
Score on Form Y Equated Score on Form X Rounded Equated Score 0 ---2 1 3.173 2 4.494 3 5.626 4 6.597 5 7.508 6 8.418 7 9.319 8 10.2010 9 11.1211 10 12.0912 11 13.0413 12 13.9114 13 14.7215 14 15.4515 16.1316 16.8617 17.3517 18 17.7218 19 19.0519 20 19.9020 EQUIPERCENTILE EQUATING EXAMPLE Note that as in linear equating, we have added more points to scores at the low end of the scale than at the top
69
The nonlinearity in the equating relationship arises because we are trying to match the shape of the distribution of Form Y to the shape of the distribution of Form X, so we have to stretch the scale at some parts and compress it at others. EQUIPERCENTILE EQUATING
70
Problems with equipercentile equating: Smoothing introduces subjectivity; different smoothing procedures may produce different equating results. Equivalents can only be produced in the observed score range for the samples used in the equating; extrapolation is required for scores outside the observed range. EQUIPERCENTILE EQUATING
71
Linear equating assumes that the difference in difficulty between the forms is constant across the scale and that the same adjustment formula can be used everywhere. If this assumption is met, linear and equipercentile methods will produce the same equating results. EQUIPERCENTILE VS LINEAR EQUATING
72
Classical equating approaches with NEAT designs require the construction of a “synthetic” population that took both forms. The anchor test is used to impute the distribution of scores for each group on the form they did not take, so that score distributions are obtained for both groups on both forms. CLASSICAL EQUATING WITH NEAT DESIGNS
73
The synthetic population mean is a weighted combination of the two populations. Equipercentile or linear equating procedures are then performed using the score distributions for the synthetic population. CLASSICAL EQUATING WITH NEAT DESIGNS
74
The two most common linear equating approaches for NEAT designs are called the Tucker method and the Levine method. These procedures make different assumptions about the relationship between the Anchor test scores and the Form scores. We will not explore the technical details of these procedures here. CLASSICAL EQUATING WITH NEAT DESIGNS
75
The Levine method is generally better when the two groups have different distributions. The Tucker method is generally better when one form is more difficult than the other. When the groups are of similar proficiency levels and the tests are fairly similar in difficulty, the two approaches will produce similar results. COMPARISON OF TUCKER AND LEVINE EQUATING
76
Chained linear equating is another form of linear equating that is used with the NEAT design. Under chained linear equating, scores on Form Y are linearly equated to the anchor test scores, which are in turn linearly equated to the Form X scores. Chained linear equating does not require strong assumptions about the relationship between the anchor test scores and the Form scores (unlike the Tucker and Levine methods). CHAINED LINEAR EQUATING
77
Two groups take the same 20-item tests as used in previous examples, but both take an external 10-item anchor test Group 1 takes Form X and the anchor test Group 2 takes Form Y and the anchor test Group 2 is of higher average ability and takes the harder test NEAT EQUATING EXAMPLE
78
Mean score on Form X and Form Y are similar. Can we assume that the groups of equal ability? Group 1: Mean on Form X: 10.396 s.d. on Form X: 3.985 Group 2: Mean on Form Y: 10.439 s.d. on Form Y: 4.619
79
NEAT EQUATING EXAMPLE The anchor test mean for Group 2 is higher, which indicates that Group 2 has higher average ability, which in turn implies that Form Y must have been harder Group 1: Mean on Anchor: 4.473 s.d. on Anchor: 2.181 Group 2: Mean on Anchor: 5.417 s.d. on Anchor: 2.321
80
NEAT EQUATING EXAMPLE After constructing the synthetic population of test-takers who took both forms of the test, we see that the mean on Form Y is lower than the mean on Form X, i.e., Form Y is harder than Form X Synthetic Population: Tucker Levine Mean on Form X: 11.043 11.333 s.d. on Form X: 4.110 4.243 Mean on Form Y: 9.675 9.434 s.d. on Form Y: 4.593 4.573
81
NEAT EQUATING EXAMPLE The equating coefficients for all methods are similar to those of the single group design Tucker linear equating coefficients: Slope = 0.895 Intercept = 2.385 Levine linear equating coefficients: Slope = 0.928 Intercept = 2.582 Chained linear equating coefficients: Slope = 0.918 Intercept = 2.537
82
NEAT EQUATING EXAMPLE Score on Form Y Rounded equated Score on Form X: Tucker Rounded equated Score on Form X: Levine Rounded equated Score on Form X: Chained Linear 0233 1343 2444 3555 4666 5777 6888 7999 810 9 11 101112 111213 121314 13141514 151615 16 17 18 19 20
83
IRT provides a strong theoretical basis for equating. Under IRT, the test score is replaced by an estimate of the test-taker’s level of the trait being measured by the test. The trait values do not depend on the particular items that are administered. The only equating issue in IRT is setting a scale for reporting trait values and putting all parameter estimates on that common scale. IRT PROCEDURES FOR SCORE LINKING
84
Under all IRT models, the scale for item and trait parameters is indeterminate, i.e., has no natural metric. The issue is simply one of setting a scale for item and trait parameter estimates on one form and linearly transforming estimates from another form to the same scale. IRT PROCEDURES FOR SCORE LINKING
85
BASIC PRINCIPLES OF IRT In IRT, we think of observed score as an (imperfect) indicator of a latent trait. Item response models specify the relationship between performance on a test item and the latent trait or traits measured by the test. Specifically, item response models specify a mathematical relationship between the probability of a given response to an item and the set of underlying traits. 85
86
BASIC PRINCIPLES OF IRT The probability of the response is determined by characteristics of the item as well as the individual’s trait value(s). The most widely used models assume that the test is unidimensional, i.e., measures a single trait or construct. 86
87
ITEM RESPONSE FUNCTIONS FOR DICHOTOMOUS ITEMS (UNIDIMENSIONAL MODEL) 87
88
ADVANTAGES OF IRT When an IRT model holds, 1.Person parameters (trait values) are invariant, i.e., they do not depend on the set of items administered; 2.Item parameters are invariant, i.e., they do not depend on the characteristics of the population. These properties allow us to compare the trait estimates of individuals who have taken different forms of a test. 88
89
IRT MODELS There are three widely used models for dichotomous responses and three widely used models for polytomous responses. The majority of test items used on standardized assessments are dichotomously scored multiple choice items. The models for dichotomous responses differ in the number of item characteristics that are assumed to influence test-taker performance in addition to test- taker ability or proficiency. 89
90
IRT MODELS The two-parameter (2P) model assumes that responses are influenced by the difficulty and discrimination of the items. The one-parameter (1P) model restricts the two- parameter model by assuming that item responses are influenced only by item difficulty, i.e., that all items are equally discriminating. The three-parameter model (3P) extends the 2P model by including a parameter that allows for a non-zero probability of a correct response even at the lowest trait values (a “guessing” parameter). 90
91
THE 2P LOGISTIC MODEL 91 u ij is the response of individual i to item j is the latent trait value of individual i a j is the discriminating power of item j b j is the difficulty level of item j
92
THE 2P LOGISTIC MODEL 92 Discrimination a is proportional to slope at b
93
THE 1P LOGISTIC MODEL 93 u ij is the response of individual i to item j is the latent trait value of individual i b j is the difficulty level of item j
94
THE 3P LOGISTIC MODEL 94 u ij is the response of individual i to item j is the latent trait value of individual i a j is the discriminating power of item j b j is the difficulty level of item j c j is the “guessing “ parameter (lower asymptote) of item j
95
IRT requires relatively large samples for accurate estimation of the model parameters (at least 1000 for the 3P model). Specialized computer programs are required. The advantages of IRT hold only when the model adequately fits the data. It is essential that the fit of the model to the data be investigated before the results are used. USING IRT MODELS
96
Under the 1P model, the scale for is arbitrary as long as b is expressed on the same scale: where We can choose whatever metric we want for, as long as we use the same metric for item difficulty b. IRT PROCEDURES FOR SCORE LINKING
97
Under the 2P and 3P models, the term is invariant over linear transformations of the scale, as long as a and b are transformed correspondingly: where IRT PROCEDURES FOR SCORE LINKING
98
The scale for parameter estimates is set during calibration, either by setting the mean and SD of to be 0 and 1, respectively, or setting the mean and SD of the difficulty values to be 0 and 1. Most software packages fix the scale on. When different tests are calibrated in different groups, the scales will be set differently. However, estimates for the same items or people will differ only by a linear transformation. IRT PROCEDURES FOR SCORE LINKING
99
We need to rescale the parameter estimates from the different tests and for the different groups so that they are on a common scale. Item response data must be collected using one of the data collection designs. For the single group design, we can simply calibrate all items together; the estimates will automatically be on the same scale. For the equivalent groups design, we assume that the groups have the same mean and SD and fix the scale on in each group/test. IRT PROCEDURES FOR SCORE LINKING
100
For the NEAT design, there are two general sets of methods: Calibration methods -Concurrent Calibration -Fixed Item Parameter Calibration Transformation methods -Mean and sigma approach (and variations) -Characteristic curve method (Stocking-Lord method) IRT PROCEDURES FOR SCORE LINKING
101
Calibration methods automatically put all parameter estimates on a common scale. Transformation methods require determination of the linear relationship between parameter estimates from the two forms. IRT PROCEDURES FOR SCORE LINKING
102
For the NEAT design, we can calibrate all items and all groups together, treating the items not taken in each group as missing (indicated by 9): 00110101001000000000000110001099999999999999999999 00011110101101000110001101001199999999999999999999 01011111101000110100001110001199999999999999999999 00111111001000000000001110000099999999999999999999 99999999999999999999111011110111111111111111101111 99999999999999999999111101101111111011100111111111 99999999999999999999100000001001110000001100010000 99999999999999999999100001001011111010111111000000 99999999999999999999111100111111110111111111101111 Form X items, Form Y items, Anchor Items, Missing responses This procedure is called concurrent calibration. CONCURRENT CALIBRATION
103
Concurrent calibration does not work well if there are large mean differences in trait levels between groups, unless group membership is explicitly taken into account. Modern computer programs allow a group membership code and separate trait distributions in each group. The mean and SD of the trait values in each group are allowed to differ and are estimated. CONCURRENT CALIBRATION
104
An alternative approach is to calibrate the items for each group separately, but hold the item parameters for the common items in the second group fixed at the values obtained for the common items in the base group. This procedure is called fixed item parameter calibration. Fixed item calibration sometimes causes estimation problems, especially when the groups are very different in trait distributions. FIXED ITEM PARAMETER CALIBRATION
105
With this approach, we calibrate items separately in each group and determine the linear transformation needed to rescale one group to the scale of the other by comparing the parameter estimates on the anchor items. Concurrent calibration is theoretically preferred in general; however, the sparseness of the response matrix in NEAT designs may cause estimation problems in some cases. TRANSFORMATION METHODS
106
Calculate the mean and SD of the difficulty parameters for the anchor items in the two groups. Determine the linear transformation that would transform the mean and SD of anchor item difficulties in Group 1 (Form X) to equal those of Group 2 (Form Y). We know that MEAN AND SIGMA METHOD
107
Since, and Therefore, and MEAN AND SIGMA METHOD
108
This transformation can be inverted to place parameters from Form Y/Group2 on the scale of Form X/Group 1 (Symmetry requirement): Once the equating coefficients M and K are determined, item and trait parameter estimates for the unique (non-anchor) items of Form Y are transformed to the scale of Form X. MEAN AND SIGMA METHOD
109
Parameter estimates for the anchor items are averaged. For the 1P model, only the additive constant K is calculated. The trait estimates for Group 2 are transformed using the same transformation as was used for the item difficulty parameters MEAN AND SIGMA METHOD
110
MEAN AND SIGMA METHOD EXAMPLE Form XForm Y Unique ItemsCommon Items Unique Items Item abcabcabcabc 1 1.080.590.21 2 0.631.220.12 3 0.92-0.480.18 4 1.370.690.22 5 0.57-1.110.09 6 0.94-0.530.06 7 0.710.660.150.630.480.19 8 1.31-0.890.221.08-0.990.15 9 0.92-0.050.160.78-0.360.26 10 1.250.410.091.180.240.13 11 1.34-0.290.11 12 0.48-1.040.23 13 0.760.450.19 14 0.691.210.13 15 1.09-0.570.08 16 0.49-0.230.17 Mean b:.0325 SD b:.682 Mean b: -.1575 SD b:.658
111
Calculate equating constants: Transform Form Y parameters: MEAN AND SIGMA METHOD EXAMPLE
112
Form XForm Y Unique ItemsCommon Items Unique Items Item abcabcabcabc 1 1.080.590.21 2 0.631.220.12 3 0.92-0.480.18 4 1.370.690.22 5 0.57-1.110.09 6 0.94-0.530.06 7 0.710.660.15 0.740.650.19 8 1.31-0.890.22 1.36-0.770.15 9 0.92-0.050.16 0.95-0.160.26 10 1.250.410.09 1.300.420.13 11 1.39-0.090.11 12 0.50-0.820.23 13 0.790.620.19 14 0.721.350.13 15 1.13-0.370.08 16 0.51-0.040.17 After transformation of Form Y parameters:
113
This method uses the information in both the difficulty and discrimination parameter estimates to determine the equating relationship. The linear transformation is found that would minimize the sum of squared differences between expected scores on the common items based on the item parameter estimates for each group. This method is considerably more complex than the mean and sigma method. CHARACTERISTIC CURVE METHOD
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.