Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Copyright © 2012 Pearson Education, Inc. or its affiliate(s). All rights reserved
Consistency in testing
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova
Copyright © 2015, 2011, 2008 Pearson Education, Inc. Chapter 5, Unit B, Slide 1 Statistical Reasoning 5.
Introduction to Statistics
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.
CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction 2014 Assessment and Accountability Information Meeting Smarter.
Technical Considerations in Alignment for Computerized Adaptive Testing Liru Zhang, Delaware DOE Shudong Wang, NWEA 2014 CCSSO NCSA New Orleans, LA June.
Computerized Adaptive Testing: What is it and How Does it Work?
Issues in Experimental Design Reliability and ‘Error’
John Cronin, Ph.D. Director The Kingsbury NWEA Measuring and Modeling Growth in a High Stakes Environment.
NEXT GENERATION BALANCED ASSESSMENT SYSTEMS ALIGNED TO THE CCSS Stanley Rabinowitz, Ph.D. WestEd CORE Summer Design Institute June 19,
Interpreting Assessment Results using Benchmarks Program Information & Improvement Service Mohawk Regional Information Center Madison-Oneida BOCES.
Technical Adequacy Session One Part Three.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
RCSD Testing Guidance 2013 February 20, Goals Provide clarity regarding format and content of 2013 NYS Math Exams Discuss content emphases and impact.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Liru Zhang, Delaware DOE Shudong Wang, NWEA Presented at the 2015 NCSA Annual Conference, San Diego, CA 1.
Getting the Most Value for Your Assessment Dollar – Designing Adapting and Maintaining Quality Assessment Programs During Tough Economic Times To Consortia,
Becoming Familiar with the GRE General Test GRE Test Preparation Workshop for Campus Educators.
Supporting Growth Interpretations Using Through-Course Assessments Andrew Ho Harvard Graduate School of Education Innovative Opportunities and Measurement.
 Integrate the Bacc Core category learning outcomes into the course.  Clarify for students how they will achieve and demonstrate the learning outcomes.
1 Race to the Top Assessment Program General & Technical Assessment Discussion Jeffrey Nellhaus Deputy Commissioner January 20, 2010.
Price.  Price is what is charged by the supplier to the consumer  Can be a deciding factor in a consumer choosing your product over you consumers 
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.
The hybrid success model: Theory and practice G. Gage Kingsbury Martha S. McCall Northwest Evaluation Association A paper presented to the Seminar on longitudinal.
1 "Your Marks" Presentation, 28 June 1998 Your Marks An explanation for TEE students.
Writing Multiple Choice Questions. Types Norm-referenced –Students are ranked according to the ability being measured by the test with the average passing.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
MDE / OEAA 1 Un-distorting Measures of Growth: Alternatives to Traditional Vertical Scales Presentation on June 19, 2005 to 25 th Annual CCSSO Conference.
University of Ostrava Czech republic 26-31, March, 2012.
DCAS “Standard Setting” Appoquinimink School District Board of Education December 14, 2010 Odessa, DE.
Review: Alternative Assessments Alternative/Authentic assessment Real-life setting Performance based Techniques: Observation Individual or Group Projects.
Gary W. Phillips American Institutes for Research CCSSO 2014 National Conference on Student Assessment (NCSA) New Orleans June 25-27, 2014 Multi State.
Practical Issues in Computerized Testing: A State Perspective Patricia Reiss, Ph.D Hawaii Department of Education.
Understanding the 2015 Smarter Balanced Assessment Results Assessment Services.
Minnesota Manual of Accommodations for Students with Disabilities Training January 2010.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Overview of Item Response Theory Ron D. Hays November 14, 2012 (8:10-8:30am) Geriatrics Society of America (GSA) Pre-Conference Workshop on Patient- Reported.
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Understanding the Results Ye Tong, Ph.D. Pearson.
What is a CAT? What is a CAT?.
Technology Enhanced Items — Signal or Noise?
Vertical Scaling in Value-Added Models for Student Learning
Item pool optimization for adaptive testing
Booklet Design and Equating
Policy Approaches to Cut Scores for College & Career Readiness
The psychometrics of Likert surveys: Lessons learned from analyses of the 16pf Questionnaire Alan D. Mead.
Annual Assessment and Accountability Meeting Updates
Partial Credit Scoring for Technology Enhanced Items
Shasta County Curriculum Leads November 14, 2014 Mary Tribbey Senior Assessment Fellow Interim Assessments Welcome and thank you for your interest.
Mohamed Dirir, Norma Sinclair, and Erin Strauts
Key Stage One National Testing Arrangements
Formative Assessments Director, Assessment and Accountability
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Margaret Wu University of Melbourne
Assessment Literacy: Test Purpose and Use
To Consortia, or not to Consortia
Deborah Schnipke, PhD Virtual Psychometrics
Presentation transcript:

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010

Pearson Copyright Some CAT Questions You want to implement CAT, but you wonder about what IRT model should you use? You want to implement CAT, but you wonder how to put together a CAT pool and how you can best implement CAT? You want to implement CAT, and you wonder about whether it has to be limited to on-grade items only?

Pearson Copyright Which Model to Use for CAT? The Rasch and three-parameter logistic (3PL) models are the most popular for IRT applications with multiple-choice items In applications to conventional fixed-form tests, the differences between the two models are not that great, i.e., when you do parallel forms equating, you get about the same answer based on either model

Pearson Copyright Which Model to Use for CAT? With CAT, there are much greater differences between the Rasch and 3PL models. For example: –Rasch CAT only supports a reduction in test length of about 20% compared to a conventional test –3PL CAT supports a reduction in test length of about % compared to a conventional test Why? –With the Rasch model, the information functions for all items have the same shape and the information for an “optimally administered” item is not that much greater than a typical item

Pearson Copyright Reduced Test Length for an Optimal Rasch CAT

Pearson Copyright Reduced Test Length for an Optimal Rasch CAT Most conventional tests are about here for most students

Pearson Copyright Which Model to Use for CAT? With CAT, there are much greater differences between the Rasch and 3PL models. For example: –3PL CAT tends to select some items in the pool very often and may never select many perfectly good items –Rasch CAT selects items in a much more uniform manner Why? –With the 3PL, some highly discriminating items provide much more information than other items and therefore are more attractive to the item selection algorithm

Pearson Copyright Rasch vs. 3PL Exposure

Pearson Copyright Rasch vs. 3PL Exposure – An Example 50% of 3PL items used 5% of the time or less 10% of 3PL items used more then 25% of the time

Pearson Copyright Which Model to Use for CAT? Both Rasch and 3PL models have been used successfully in CAT applications Psychometricians will offer different opinions about which model is best for CAT Either model is defensible for CAT, but the models do behave quite differently IRT model is just one consideration related to CAT; other considerations related to design are as important or even more important

Pearson Copyright Preparing Item Pools for CAT Transition Ideally, the number of items in a CAT pool should be times the number of items to be administered in the CAT (rule of thumb based on M. Stocking from ETS) The CAT pool must include a sufficient number of easy and difficult items; this is usually a big challenge More items are needed if students test from the same CAT pool multiple times, especially if previously seen items are not eligible to be used in repeat administrations

Pearson Copyright Preparing Item Pools for CAT Transition Items for a CAT item pool must be calibrated to the same IRT scale Most states have pools of calibrated items with good psychometric properties that might be used for CAT –These items have gone through extensive reviews –These items may have been used operationally –These items have been shown to have good psychometric characteristics

Pearson Copyright Preparing Item Pools for CAT Transition However, there often challenges in using these items with the old statistics, such as: –The items were calibrated in paper but CAT is online –The items were in tests measuring old standards and CAT will be measuring new standards –Minor edits or format changes may be needed –Items may have come from different places How can we make use of these items in a new adaptive test?

Pearson Copyright CAT Transition Strategy: Fixed-form Transition Two year transition strategy In year one, construct and administer a number (e.g., 6 to 10) of fixed-forms (field-test items can be embedded) using previous (paper-based) statistics for test construction Administer the fixed form online Re-calibrate the data from the fixed forms and link them to a common scale Conduct standard setting on subset of the items from the fixed forms (can be a “synthetic” form) Apply new cut to each fixed form for reporting

Pearson Copyright CAT Transition Strategy: Fixed-form Transition In year two, combine all the items from the online conventional fixed-forms (plus additional field-tested items) to create the CAT pool All items in the CAT pool will have item parameters on a common scale based on an online administration Issues include: –Deciding how many fixed-forms to develop –Making the fixed-forms as parallel as possible –Building effective equating links between forms –Determining whether the fixed-forms should count –Making a smooth transition from fixed-forms to CAT (since the measurement properties will be different)

Pearson Copyright CAT Transition Strategy—Barely Adaptive Tests (BAT) Another strategy for transition to CAT is to use “Barely Adaptive Testing” (BAT) In this approach, the CAT algorithm is used to administer items from the pool based on paper- based IRT calibrations However, the CAT algorithm does not adapt the difficulty to student performance as strongly as it normally would The result is that each student takes a unique test, that is “slightly” targeted to them Some examples help to clarify

Pearson Copyright This slide shows how a conventional test would be administered to three students at different levels of ability

Pearson Copyright Conventional tests are better for calibrating items but not so good for targeting measurement

Pearson Copyright This slide shows how CAT would be administered to three students at different levels of ability

Pearson Copyright CAT is best for targeting measurement but not so good for estimating item statistics No responses here for calibration

Pearson Copyright This slide shows how BAT would be administered to three students at different levels of ability

Pearson Copyright Why Does BAT Make Sense? BAT is a compromise during a year of transition—it does better measurement that a conventional test and is better than CAT for calibrating items BAT also permits the administration in the transition year to be very similar to the full CAT administration that will occur in year two and beyond (you can even call it CAT!)

Pearson Copyright CAT and Off-Grade-Level Testing There are obvious psychometric benefits to including off-grade-level content in K-12 assessments, if supported by vertically articulated content standards These benefits would seem particularly apparent for struggling students, including SWDs –Item pools can be substantially improved for measuring struggling students accurately –All students start at the same place (no “out of level” labeling)

Pearson Copyright CAT and Off-Grade-Level Testing Some advocate of SWDs insist that CAT should consist only of on-grade-level content The basis for this position seems to be a concern about washback effect A psychometrician’s plea: The important consideration is instruction not assessment –The goal of the common core standards is college readiness for all students –The instructional imperative does not change based on what items are allowed in a CAT item pool

Pearson Copyright CAT and Off-Grade Level Testing Could off-grade level content be included in accountability? –Yes, if ESEA relaxes “on-grade level requirements –Perhaps, if content standards span multiple grades Some will say it “doesn’t matter” and that CAT works just fine with only on-grade level content But it does matter. If we really want to do better at measuring student status and growth and we want to take full advantage of adaptive testing for all students, we need to allow the adaptive test to extent above and below grade level

Pearson Copyright Questions?