Automated Scoring is a Policy and Psychometric Decision Christina Schneider The National Center for the Improvement of Educational Assessment

Slides:



Advertisements
Similar presentations
Ed-D 420 Inclusion of Exceptional Learners. CAT time Learner-Centered - Learner-centered techniques focus on strategies and approaches to improve learning.
Advertisements

Assessment types and activities
The Test of English for International Communication (TOEIC): necessity, proficiency levels, test score utilization and accuracy. Author: Paul Moritoshi.
You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.
Steps to Developing an Entry Part Two Module V NJ APA Teacher Training – Module V 1.
Introduction to On-Demand Writing
RTI Implementer Webinar Series: Establishing a Screening Process
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
National Center on Response to Intervention RTI Implementer Webinar Series: What is Screening?
Designing Scoring Rubrics. What is a Rubric? Guidelines by which a product is judged Guidelines by which a product is judged Explain the standards for.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Doctoral Training Workshops Getting published and the reviewing process Steve Potter, Alex Borda-Rodriguez, Sue Oreszczyn and Julius Mugwagwa February.
Doctoral Training Workshops Getting published and the reviewing process Steve Potter and Sue Oreszczyn January 2015.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
1 The New Adaptive Version of the Basic English Skills Test Oral Interview Dorry M. Kenyon Funded by OVAE Contract: ED-00-CO-0130 The BEST Plus.
CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction Gina Koency California Department of Education (CDE) Senior.
MCAS-Alt: Alternate Assessment in Massachusetts Technical Challenges and Approaches to Validity Daniel J. Wiener, Administrator of Inclusive Assessment.
1 Reading First Internal Evaluation Leadership Tuesday 2/3/03 Scott K. Baker Barbara Gunn Pacific Institutes for Research University of Oregon Portland,
COE Reading Basics Lesley Klenk October 5,
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Assessment Cadre #3: “Assess How? Designing Assessments to Do What You Want”
Student Learning Objectives 1 Phase 3 Regional Training April 2013.
Ohio’s Assessment Future The Common Core & Its Impact on Student Assessment Evidence by Jim Lloyd Source doc: The Common Core and the Future of Student.
Psychometric Issues in the Use of Testing Accommodations Chapter 4 David Goh.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Classroom Assessment A Practical Guide for Educators by Craig A
By Jo Ann Vertetis and Karin Moe. Self-Assessment Can you define RTI? What is its purpose? Rate your understanding of RTI and how to implement it on a.
The Developmental Reading & English Placement Test
School Improvement Planning Today’s Session Review the purpose of SI planning Review the components of SI plans Discuss changes to SI planning.
Using Turnitin® and ETS e-rater® with myWriteSmart
Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation.
The use of asynchronously scored items in adaptive test sessions. Marty McCall Smarter Balanced Assessment Consortium CCSSO NCSA San Diego CA.
{ Principal Leadership Evaluation. The VAL-ED Vision… The construction of valid, reliable, unbiased, accurate, and useful reporting of results Summative.
Assessing Learning for Students with Disabilities Tom Haladyna Arizona State University.
The Four P’s of an Effective Writing Tool: Personalized Practice with Proven Progress April 30, 2014.
Assessment Module 5B ESUHSD June Outcomes Increase understanding of the Common Core State Standards (CCSS) in Mathematics by exploring assessment.
CAROLE GALLAGHER, PHD. CCSSO NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 26, 2015 Reporting Assessment Results in Times of Change:
Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.
Communicating Results what can we say and to whom.
Assessment and Testing
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
PLC Team Leader Meeting
Georgia will lead the nation in improving student achievement. 1 Georgia Performance Standards Day 3: Assessment FOR Learning.
Summary of Assessments By the Big Island Team: (Sherry, Alan, John, Bess) CCSS SBAC PARCC AP CCSSO.
Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.
Copyright © 2014 by Educational Testing Service. All rights reserved. Influencing Education: Implementing Online Reporting Systems to Support Assessment.
Assessing Information Literacy with SAILS Juliet Rumble Reference & Instruction Librarian Auburn University.
APA NJ APA Teacher Training 2 What is the Purpose of the APA? To measure performance of students with the most significant cognitive disabilities.
Enriching Assessment of the Core Albert Oosterhof, Faranak Rohani, & Penny J. Gilmer Florida State University Center for Advancement of Learning and Assessment.
How will we prepare the children for the new Year 2 Tests? All the topic areas have excellent opportunities for the children to develop skills that they.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Chapter 3 of Your Research Project AED 615 Fall 2006 Dr. Franklin.
Standards-Based Tests A measure of student achievement in which a student’s score is compared to a standard of performance.
Lessons Learned. Communication, Communication, Communication Collaborative effort Support of all stakeholders Teachers, Principals, Supervisors, Students,
Department of Curriculum and Instruction Considerations for Choosing Mathematics Progress Monitoring Measures from K-12 Anne Foegen, Ph.D. Pursuing the.
Classroom Assessment A Practical Guide for Educators by Craig A
Phyllis Lynch, PhD Director, Instruction, Assessment and Curriculum
Ch. 15 S. 1 What Are Psychological Tests?
3rd-5th Grade FSA Parent Night
3rd-5th Grade FSA Parent Night
Scoring: Measures of Central Tendency
Raising the Bar: Debunking Myths About Scoring
Raising the Bar: Debunking Myths About Scoring and Data Charts Dan Wiener Administrator of Inclusive Assessment Massachusetts Department of.
Using Data for Program Improvement
Deputy Commissioner Jeff Wulfson Associate Commissioner Michol Stapel
Using Data for Program Improvement
3rd-5th Grade FSA Parent Night
Presentation transcript:

Automated Scoring is a Policy and Psychometric Decision Christina Schneider The National Center for the Improvement of Educational Assessment

Automated Scoring is Complex For the same models to work in different administrations – the general ability levels of examinees must be constant, – the features of submissions must be constant, and – the human rating standards must be constant (Trapani, Bridgeman, and Breyer, 2011) Which population should be used to train engines – Field test or OP? Should models that work for a consortium average be applied to individual states? To subgroups within states?

Understanding system flags is essential to understanding a system Always ask how and where in the process a system flags responses for quality control purposes and scoring purposes (gaming or unscorable) Flags occur by examining different combinations of features for outliers: – sophisticated words, good organization and content, but many grammar and spelling errors could be indicative of child who is dyslexic. Response routed to a human. Is the system set up for administrators to pre-flag students for human scoring? – sophisticated words, good organization, overly long development, grammar and other issues could be a gaming attempt. Response routed to a human – Redundant word choice could be a gaming attempt. Response routed to a human

Formative Systems Flagging rules for gaming are a quality control method after model building has occurred. There is tension between policy and psychometric needs on the formative side because teachers are not always happy to score many papers by hand, yet students will often practice how to game using a formative system. For young students, errors in spelling can trigger large numbers of flags Teacher training on how to use the system wisely is essential.

Hybrid-Either Engine or Human Understanding under what conditions scoring a paper via an automated system is not optimal is important to establishing and providing validity evidence for the scoring process but this is not a component of many technical reports. Investigate which students (i.e., the demographic characteristics) are flagged and the demographic percentages of students that are flagged compared to the population. Comparability focus is the accuracy of the score, not whether a human or engine is scoring. Requires good communication with stakeholders. This is a good study for states interested in moving to all automated scoring sometime in the future. The best scoring was obtained when 20% of responses were routed to humans. May be best to plan for up to 20% human scoring when AS is planned as Reader 1

Using a Weighted Approach Weighted hybrid approach means using both the automated score and the human score, with one counting more than the other depending on the task. Promising approach for improving the accuracy of automated scoring for writing based on the writing genre. Results were for an ELL population. Needs to be studied with Common Core items

Hybrid Scoring Hybrid scoring is likely the best practice of the future. – Cost savings from using engines – Hybrid scoring includes humans reading too. – Begin with a program of research that can improve human scoring as well as automated scoring Look at work related to human scorer interaction with rubrics (Leacock, 2013 & 2014) and how this influences engines – Early indications show hybrid approach improved reliability above that of two humans (Kieftenbeld & Barrett, 2014) for AS-scorable prompts – Not all prompts can be AS scored Need to investigate why – engine functioning is often related to the size of the training and validation set