Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010
Pearson Copyright Some CAT Questions You want to implement CAT, but you wonder about what IRT model should you use? You want to implement CAT, but you wonder how to put together a CAT pool and how you can best implement CAT? You want to implement CAT, and you wonder about whether it has to be limited to on-grade items only?
Pearson Copyright Which Model to Use for CAT? The Rasch and three-parameter logistic (3PL) models are the most popular for IRT applications with multiple-choice items In applications to conventional fixed-form tests, the differences between the two models are not that great, i.e., when you do parallel forms equating, you get about the same answer based on either model
Pearson Copyright Which Model to Use for CAT? With CAT, there are much greater differences between the Rasch and 3PL models. For example: –Rasch CAT only supports a reduction in test length of about 20% compared to a conventional test –3PL CAT supports a reduction in test length of about % compared to a conventional test Why? –With the Rasch model, the information functions for all items have the same shape and the information for an “optimally administered” item is not that much greater than a typical item
Pearson Copyright Reduced Test Length for an Optimal Rasch CAT
Pearson Copyright Reduced Test Length for an Optimal Rasch CAT Most conventional tests are about here for most students
Pearson Copyright Which Model to Use for CAT? With CAT, there are much greater differences between the Rasch and 3PL models. For example: –3PL CAT tends to select some items in the pool very often and may never select many perfectly good items –Rasch CAT selects items in a much more uniform manner Why? –With the 3PL, some highly discriminating items provide much more information than other items and therefore are more attractive to the item selection algorithm
Pearson Copyright Rasch vs. 3PL Exposure
Pearson Copyright Rasch vs. 3PL Exposure – An Example 50% of 3PL items used 5% of the time or less 10% of 3PL items used more then 25% of the time
Pearson Copyright Which Model to Use for CAT? Both Rasch and 3PL models have been used successfully in CAT applications Psychometricians will offer different opinions about which model is best for CAT Either model is defensible for CAT, but the models do behave quite differently IRT model is just one consideration related to CAT; other considerations related to design are as important or even more important
Pearson Copyright Preparing Item Pools for CAT Transition Ideally, the number of items in a CAT pool should be times the number of items to be administered in the CAT (rule of thumb based on M. Stocking from ETS) The CAT pool must include a sufficient number of easy and difficult items; this is usually a big challenge More items are needed if students test from the same CAT pool multiple times, especially if previously seen items are not eligible to be used in repeat administrations
Pearson Copyright Preparing Item Pools for CAT Transition Items for a CAT item pool must be calibrated to the same IRT scale Most states have pools of calibrated items with good psychometric properties that might be used for CAT –These items have gone through extensive reviews –These items may have been used operationally –These items have been shown to have good psychometric characteristics
Pearson Copyright Preparing Item Pools for CAT Transition However, there often challenges in using these items with the old statistics, such as: –The items were calibrated in paper but CAT is online –The items were in tests measuring old standards and CAT will be measuring new standards –Minor edits or format changes may be needed –Items may have come from different places How can we make use of these items in a new adaptive test?
Pearson Copyright CAT Transition Strategy: Fixed-form Transition Two year transition strategy In year one, construct and administer a number (e.g., 6 to 10) of fixed-forms (field-test items can be embedded) using previous (paper-based) statistics for test construction Administer the fixed form online Re-calibrate the data from the fixed forms and link them to a common scale Conduct standard setting on subset of the items from the fixed forms (can be a “synthetic” form) Apply new cut to each fixed form for reporting
Pearson Copyright CAT Transition Strategy: Fixed-form Transition In year two, combine all the items from the online conventional fixed-forms (plus additional field-tested items) to create the CAT pool All items in the CAT pool will have item parameters on a common scale based on an online administration Issues include: –Deciding how many fixed-forms to develop –Making the fixed-forms as parallel as possible –Building effective equating links between forms –Determining whether the fixed-forms should count –Making a smooth transition from fixed-forms to CAT (since the measurement properties will be different)
Pearson Copyright CAT Transition Strategy—Barely Adaptive Tests (BAT) Another strategy for transition to CAT is to use “Barely Adaptive Testing” (BAT) In this approach, the CAT algorithm is used to administer items from the pool based on paper- based IRT calibrations However, the CAT algorithm does not adapt the difficulty to student performance as strongly as it normally would The result is that each student takes a unique test, that is “slightly” targeted to them Some examples help to clarify
Pearson Copyright This slide shows how a conventional test would be administered to three students at different levels of ability
Pearson Copyright Conventional tests are better for calibrating items but not so good for targeting measurement
Pearson Copyright This slide shows how CAT would be administered to three students at different levels of ability
Pearson Copyright CAT is best for targeting measurement but not so good for estimating item statistics No responses here for calibration
Pearson Copyright This slide shows how BAT would be administered to three students at different levels of ability
Pearson Copyright Why Does BAT Make Sense? BAT is a compromise during a year of transition—it does better measurement that a conventional test and is better than CAT for calibrating items BAT also permits the administration in the transition year to be very similar to the full CAT administration that will occur in year two and beyond (you can even call it CAT!)
Pearson Copyright CAT and Off-Grade-Level Testing There are obvious psychometric benefits to including off-grade-level content in K-12 assessments, if supported by vertically articulated content standards These benefits would seem particularly apparent for struggling students, including SWDs –Item pools can be substantially improved for measuring struggling students accurately –All students start at the same place (no “out of level” labeling)
Pearson Copyright CAT and Off-Grade-Level Testing Some advocate of SWDs insist that CAT should consist only of on-grade-level content The basis for this position seems to be a concern about washback effect A psychometrician’s plea: The important consideration is instruction not assessment –The goal of the common core standards is college readiness for all students –The instructional imperative does not change based on what items are allowed in a CAT item pool
Pearson Copyright CAT and Off-Grade Level Testing Could off-grade level content be included in accountability? –Yes, if ESEA relaxes “on-grade level requirements –Perhaps, if content standards span multiple grades Some will say it “doesn’t matter” and that CAT works just fine with only on-grade level content But it does matter. If we really want to do better at measuring student status and growth and we want to take full advantage of adaptive testing for all students, we need to allow the adaptive test to extent above and below grade level
Pearson Copyright Questions?