Download presentation
Presentation is loading. Please wait.
Published byIsabella Hart Modified over 9 years ago
1
SCOTT MARION CENTER FOR ASSESSMENT CCSSO JUNE 22, 2010 Ensuring, Evaluating, & Documenting Comparability of AA-AAS Scores
2
What is comparability? In an assessment context, comparability means that the inferences from the scores on one test can be psychometrically related to a score on another “comparable” test. In other words, we could consider the scores interchangeable from two comparable tests. Marion. Center for Assessment
3
Why do we care about comparability? In fully individualized assessments, we don’t, BUT we need scores to be comparable when… Judging two or more scores against a common standard, Aggregating scores to the school or district level, (we are assuming that scores are comparable) Judging scores for the same students and/or the same schools across years. Marion. Center for Assessment
4
Comparability and Flexibility Flexibility or individualization can pose challenges to comparability. Using the same items and the same (extended) content standards each year would appear to ameliorate any comparability concerns. But, not everything is as it appears…issues with “teaching to the test” threaten comparability. Obviously, completely individualized tasks addressing non-systematic selection of standards raises considerable comparability concerns. Marion. Center for Assessment
5
Traditional Methods Scaling is simply placing raw scores on a numerical scale intended to reflect a continuum of achievement or ability so that similar scale scores have similar meaning across tests (Peterson, Kolen, & Hoover, 1989). Linking describes a family of approaches (including equating) by which we can place the scores from one assessment on the same SCALE as another assessment (e.g., putting the 2006 scores on the 2005 scale). Marion. Center for Assessment
6
Scaling Requirements We can create scales from many different types of raw scores, but for the scale to lead to valid inferences, the original raw scores must have a similar conceptual foundation (i.e., the raw scores should be derived from similar assessment experiences, unless we move to a normative approach). Marion. Center for Assessment
7
Linking (Equating) Requirements There is a fairly extensive literature regarding the requirements for valid equating. Depending on content and contextual relationships between the two tests, the linking could be as strong as formal equating. If equating assumptions are not met, calibration, projection, or even judgmental methods could be applied to connect the two sets of test scores. It is challenging for AA-AAS to meet many of the assumptions necessary for strict equating Marion. Center for Assessment
8
Mislevy on Linking (1992) In, Linking Educational Assessments (1992), Mislevy states, “The central problems related to linking two or more assessments are : discerning the relationships among the evidence the assessments provide about conjectures of interest, and figuring out how to interpret this evidence correctly” (p. 21). A brief summary of the three most valid approaches to linking (in descending order of quality) follow Marion. Center for Assessment
9
Equating (Mislevy, 1992, p. 21) The linking is strongest (and simplest) if the two tests were designed from the same test blueprint and were designed to measure the same construct(s). The most common example is two or more forms of the same test. “Under these carefully controlled circumstances, the weight and nature of the evidence the two assessments provide about a broad array of conjectures is practically identical”. It is a statistical method of relating the scores on one test to the scores on a second test in order to separate differences in item/test difficulty from changes in student achievement. Marion. Center for Assessment
10
Calibration (Mislevy, 1992, p. 24). If the two (or more) tests were not designed from the same test blueprint, but both have been constructed to provide evidence about the same type of achievement, then the scores can be related through calibration. “Unlike equating, which matches tests to one another directly, calibration relates the results of different assessments to a common frame of reference, and thus to one another only indirectly” (Mislevy, 1992, p. 24). There are several situations in which calibration is used. The two most common are: (1) constructing tests of differing lengths from essentially the same blueprint, and (2) using IRT to link responses to a set of items (e.g., item bank) built to measure the “same construct” Marion. Center for Assessment
11
Projection (Mislevy, 1992, p. 24) "If assessments are constructed around different types of tasks, administered under different conditions, or used for purposes that bear different implications for students' affect and motivation, then mechanically applying equating or calibration formulas can prove seriously misleading: X and Y do not 'measure the same thing.” Mislevy's concern here is that the two assessments measure qualitatively different information. While it might make sense to administer the two tests to students, just because the two tests are correlated does not mean we should try to link them. Marion. Center for Assessment
12
Current Operational Tests Most current operational program rely on statistical links to relate scores from one year to another under the assumption that the connection is strong enough to conduct some form of equating Marion. Center for Assessment
13
We Give the Same “Test” Every Year Comparability is relatively easy—almost a non- issue, but what do you do when… you have to replace “tired” or poorly functioning items? you find out that there is score inflation due to “teaching to the test”? Teaching to the test could become an issue—long history of this with regular assessments This would be a threat to valid comparability and accountability validity Marion. Center for Assessment
14
Is the “same” really the “same”? What if you introduce new items to your supposedly common form? How many new items can your test absorb before you will feel the need for formal equating? Do not just think of this as a one-year issue—a little here, a little there, and pretty soon you have a new test Once you replace 5-10% of the items or so, you should think about formal linking/equating Marion. Center for Assessment
15
Issues of Flexibility/Standardization Common or unique items Common or unique indicators (finer grain than standards) Common or variable forms Unique or common scoring Marion. Center for Assessment
16
But, we have considerable flexibility… Flexible items/same standards Alignment methods such as the Porter-Smithson approach—aligning to common standard—might work Other judgmental approaches Flexible items/flexible standards Judgmental methods only Marion. Center for Assessment
17
A Few Ways to Establish “Comparability” Establish construct comparability based on similar content Establish comparability based on similar or compensatory functionality Establish comparability based on judgments of relatedness or comparability Marion. Center for Assessment
18
Comparability based on similar content Establish construct comparability based on similar content – for example, one assessment item taps the same construct as another assessment item. This may be based on a content and/or cognitive analysis This approach needs to be documented and defended in terms of the process and the results Marion. Center for Assessment
19
Similar or compensatory use Establish comparability based on similar or compensatory use – distributional requirements often specify profiles of performance will be treated as comparable; total scores based on a compensatory system do similarly. In other words if students perform similarly, as a group, on one set of items compared to another, they may be treated as comparable. This is a much weaker connection than the content or cognitive analysis approach Marion. Center for Assessment
20
Judgments of comparability Establish comparability based on judgments of relatedness or comparability – disciplined judgments may be made to compare almost anything in terms of specified criteria (e.g., is this bottle as good a holder of liquid as this glass is?). Decision-support tools and a common universe of discourse undergird such judgments. Obviously, this approach should be used when either of the other two approaches cannot be used Marion. Center for Assessment
21
Performance categories While the item-based approaches to comparability are the most appropriate, more holistic judgments can be made at the performance category level In other words, is there evidence that two students designated as proficient, for example, have comparable academic knowledge and skills? Marion. Center for Assessment
22
Summary AA-AAS pose significant challenges to comparability which, in turn, poses challenges to validity of score inferences across students and/or across years We cannot blindly employ statistical procedures and “pretend” to equate when we haven’t met many of the assumptions… We must articulate a clear rationale for our approaches to comparability and document the methods and results as described in this presentation Marion. Center for Assessment
23
For more information Scott Marion www.nciea.org Marion. Center for Assessment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.