Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

Test Development.

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.

VALIDITY AND RELIABILITY

Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.

What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.

© Cambridge International Examinations 2013 Component/Paper 1.

Statistical Issues in Research Planning and Evaluation

III Choosing the Right Method Chapter 10 Assessing Via Tests p235 Paper & Pencil Work Sample Situational Judgment (SJT) Computer Adaptive chapter 10 Assessing.

Homework Planners as an Intervention for Homework Completion Audrey Bullock Fall 2009 Math 5090 Audrey Bullock Fall 2009 Math 5090.

Chapter 4 Validity.

A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics.

LESSON 22: MATERIAL REQUIREMENTS PLANNING: LOT SIZING

Concept of Reliability and Validity. Learning Objectives  Discuss the fundamentals of measurement  Understand the relationship between Reliability and.

Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.

+ A New Stopping Rule for Computerized Adaptive Testing.

BA 427 – Assurance and Attestation Services

Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.

Questions to check whether or not the test is well designed: 1. How do you know if a test is effective? 2. Can it be given within appropriate administrative.

Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

A comparison of exposure control procedures in CATs using the 3PL model.

Measurement and Data Quality

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction Butte County Office of Education September 19, 2014 Interim.

Technical Adequacy Session One Part Three.

Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Cluster 5 Spring 2005 Assessment Results Sociocultural Domain.

Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.

Becoming Familiar with the GRE General Test GRE Test Preparation Workshop for Campus Educators.

1 Race to the Top Assessment Program General & Technical Assessment Discussion Jeffrey Nellhaus Deputy Commissioner January 20, 2010.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

Optimal Delivery of Items in a Computer Assisted Pilot Francis Smart Mark Reckase Michigan State University.

Measurement MANA 4328 Dr. Jeanne Michalski

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Practical Issues in Computerized Testing: A State Perspective Patricia Reiss, Ph.D Hawaii Department of Education.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Chapter 6 - Standardized Measurement and Assessment

Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.

The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.

TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.

Sampling and Sampling Distribution

What is a CAT? What is a CAT?.

Charlton Kings Junior School

Principles of Language Assessment

MSA / Gage Capability (GR&R)

Unit 5: Hypothesis Testing

Classical Test Theory Margaret Wu.

Item Analysis: Classical and Beyond

Reliability & Validity

پرسشنامه کارگاه.

III Choosing the Right Method Chapter 10 Assessing Via Tests

Office of Education Improvement and Innovation

Comparing Populations

Shasta County Curriculum Leads November 14, 2014 Mary Tribbey Senior Assessment Fellow Interim Assessments Welcome and thank you for your interest.

Aligned to Common Core State Standards

Using statistics to evaluate your test Gerard Seinhorst

Mohamed Dirir, Norma Sinclair, and Erin Strauts

Significance Tests: The Basics

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond

Presentation transcript:

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

The following are some putative advantages of CAT relative to paper-based tests, and so-called linear tests delivered by computer. Some of these are listed below with comments: CAT allows a pool of items to be used for on-demand testing of students and/or multiple testing of the same student…AND, at the same time, preserves the security of the item bank. CAT presents test items at a level appropriate for a student’s ability level. It should also be mentioned that CAT eliminates test booklets that can be stolen or lost--thereby, compromising test security. 2

CATs are significantly shorter, and have higher measurement efficiency than linear tests. They are shorter because time is not wasted by: – Presenting low proficiency students difficult items leads to many incorrect responses from which little is learned about student proficiency. – Presenting highly proficient students with items that are so easy that the extent of their knowledge is not revealed. – Even though CATs are shorter than linear tests, they are capable of increasing the reliability of measurement in the extremes of the proficiency distribution relative to linear tests of greater length. 3

Comments Any type of on-demand or repeated testing carries the risk of item exposure. A crucial variable for increasing the risk of item exposure during on demand, repetitive testing is the degree to which the test is high-stakes; the higher the stakes, the greater the pressure on the item pool for exposure. The prime example here is the CAT-GRE, which was abandoned partly because of item security issues. 4

To insure reasonable levels of CAT security, two methods have been found to be most effective in simulations: – Stochastic exposure control using the Sympson-Hetter method (or a similar method). – Increasing the number of items in the pool. These findings are from initial R&D done for development of the CAT-ASVAB. 5

Comments It is true that CATs can be considerably shorter in length. For example, the CAT-ASVAB is 1/3 shorter than the paper-based version (129 vs. 200 items), and the reliability coefficients run about 15% higher. – However, the CAT-ASVAB has moderate exposure control and very little content balancing imposed on optimum item selection. – Increasing the levels of exposure-and-content controls can lead to longer test lengths and BATs (barely adaptive tests). – Increased levels of exposure control and content balancing lead to longer tests with lower reliability. 6

Existing test forms can be used to produce item pools for CAT. 7

Comments CAT item pool development can be a daunting task. As an illustration, suppose a current, paper-based testing program is administered with three forms of a 50-item test. [Note that the item exposure rate for the current procedure is 1/3 (each time a test is given, 1/3 of the total collection of items is exposed).] If it assumed that a CAT system can reduce test length to 35 items, how many items need to be developed to form the pools needed? A general rule is to have pool size five times the length of the CAT; this leads to 175 items in each pool in this example. 8

Now, further assume that students will be allowed to take the CAT three times during a year. How many item pools are needed to attain the same exposure rate as the 50-item paper-based test being replaced? – Three pools will be needed to achieve the same theoretical exposure rate as the paper-based test. Also, a statistical exposure control (such as Sympson-Hetter) will be needed to overcome the fact that, within a pool, certain items are selected very frequently by a procedure that maximizes test information. 9

So, we are left with these number for item-pool development: 3 pools of 175 items each = 525 items, and using the general rule that one must write twice as many items as necessary, this means that 1,050 items must be written for this rather modest CAT project. Or perhaps more realistically, (525 – 150)*2 = 750 new items will have to be written if all the paper-based items are used in the item pools. 10

The bottom line is that CAT: – can provide tests at a level appropriate for a student’s ability. – can save testing time and increase test reliability. – is unlikely to save money because it can be a giant, item- eating machine. – Offers the possibility of greater protection of the items from compromise than would be possible by the computer administration of a current paper-based test. 11

Evaluating a CAT Item Pool using Optimal Adaptive Tests (OATs) We are now going to construct some adaptive tests in an optimal way in order to illustrate some problems and to indicate an interesting possibility for implementing CAT. If one knew a person’s standing on the latent trait, θ, it would be easy to choose a fixed number of items (from some item pool) that will maximize the test information. —We call such a test an “optimal adaptive test” (OAT) in that no other test from this item pool, and of the same length, could exceed this test’s measurement accuracy. 12

The use of OATs for evaluating an item pool is now illustrated using an operational item pool for mathematics. – This item pool contains 84 items, and is used to construct 15 item adaptive tests for various values of the latent trait. – The items in the pool have an average a-value of 1.61; S.D. =.51 an average b-value of -.06; S.D. = 1.10 and an average c-value of.15; S.D. =.07 – For its intended purpose, this is an excellent item bank. 13

Using a grid of θ’s from -3 to 3 at intervals of.5, 13 OATs were constructed from the 84-item bank. In order to illustrate the item-overlap in this collection of OATs, three of these were designated as focal OATs. – These focal OATs were those at θ = -1.5, 0 and 1.5. – One might think of these as the optimal tests for three cut- scores. – In the next three slides (one for each of the focal OATs), the overlap with neighboring OATs are shown. – Accuracy of the OATs are indicated with information functions and reliability coefficients. 14

OAT at θ = -1.5 and Overlap with Neighboring OATs Theta ItemsInfo.ItemsInfo.ItemsInfo.ItemsInfo.ItemsInfo Test Info Reliability OAT(-1.5)

OAT for θ = 0 and Neighboring OATs Theta ItemsInfo.ItemsInfo.ItemsInfo Test Info Reliability OAT(0)

OAT for θ = 1.5 and Neighboring OATs Theta ItemsInfo.ItemsInfo.ItemsInfo.ItemsInfo.ItemsInfo Test Info Reliability OAT(1.5)

Focal OATS Derived Using the Rasch Model 18

Conclusions The previous slides indicate that there will be considerable overlap in the CATs constructed from this item bank--in spite of the fact that there is considerable variability in the difficulty of the items. – Hence, many of the items will be “overly-exposed” and subject to compromise. In the actual use of this item bank, the exposure of items is controlled using the Sympson-Hetter Exposure Control method. 19

The previous slides also indicate that the three focal, OATs, optimal for θ = -1.5, 0 and 1.5, do a rather remarkable job of providing accurate measurement across the θ-continuum even though they only contain 15 items each. OATs—and by implication, CATs in general—will differ depending on the IRT model used in development and implementation. 20

This also suggests, that a two-stage, CAT procedure would work quite well with this item bank. – In a two-stage CAT, an initial, Stage 1 test is administered in either a CAT mode or as a fixed, medium difficulty test. – Scores on the Stage 1 test are used to assign examinees to one of several Stage 2 tests which vary in overall difficulty from easy to difficult—for example one of the three, focal OATs described above. – In this case (and perhaps in most cases), a pure CAT, where items are selected “on the fly”, does not seem to have any advantages over the pre-selected, optimal, Stage 2 tests. 21