Issues in Comparability of Test Scores Across States Liru Zhang, Delaware DOE Shudong Wang, NWEA Presented at the 2014 CCSSO NCSA New Orleans, LA June.

Slides:



Advertisements
Similar presentations
Project VIABLE: Behavioral Specificity and Wording Impact on DBR Accuracy Teresa J. LeBel 1, Amy M. Briesch 1, Stephen P. Kilgus 1, T. Chris Riley-Tillman.
Advertisements

Performance Assessment
National Accessible Reading Assessment Projects Defining Reading Proficiency for Accessible Large Scale Assessments Principles and Issues Paper American.
Instructional Decision Making
Scale Construction and Halo Effect in Secondary Student Ratings of Teacher Performance Ph.D. Dissertation Defense Eric Paul Rogers Department of Instructional.
Standardized Scales.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
CHAPTER 3 ~~~~~ INFORMAL ASSESSMENT: SELECTING, SCORING, REPORTING.
Chapter 4 Validity.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
What are competencies – some definitions ……… Competencies are the characteristics of an employee that lead to the demonstration of skills & abilities,
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Alignment Inclusive Assessment Seminar Brian Gong Claudia.
Dr. Robert Mayes University of Wyoming Science and Mathematics Teaching Center
Chapter 7 Correlational Research Gay, Mills, and Airasian
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Assessing and Evaluating Learning
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Understanding Validity for Teachers
Questions to check whether or not the test is well designed: 1. How do you know if a test is effective? 2. Can it be given within appropriate administrative.
Robert delMas (Univ. of Minnesota, USA) Ann Ooms (Kingston College, UK) Joan Garfield (Univ. of Minnesota, USA) Beth Chance (Cal Poly State Univ., USA)
performance INDICATORs performance APPRAISAL RUBRIC
Fundamentals of Human Resource Management 8e, DeCenzo and Robbins
Maine Course Pathways Maine School Superintendents’ Conference June 24 – 25, 2010.
Technical Considerations in Alignment for Computerized Adaptive Testing Liru Zhang, Delaware DOE Shudong Wang, NWEA 2014 CCSSO NCSA New Orleans, LA June.
Scales and Indices While trying to capture the complexity of a phenomenon We try to seek multiple indicators, regardless of the methodology we use: Qualitative.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Classroom Assessments Checklists, Rating Scales, and Rubrics
Classroom Assessment A Practical Guide for Educators by Craig A
Measuring Complex Achievement
Fundamentals of Human Resource Management
8-1 McGraw-Hill/IrwinCopyright © 2011 by The McGraw-Hill Companies, Inc. All Rights Reserved. fundamentals of Human Resource Management 4 th edition by.
Raises, Merit Pay, Bonuses Personnel Decisions (e.g., promotion, transfer, dismissal) Identification of Training Needs Research Purposes (e.g., assessing.
Week 5 Lecture 4. Lecture’s objectives  Understand the principles of language assessment.  Use language assessment principles to evaluate existing tests.
Performance Assessment OSI Workshop June 25 – 27, 2003 Yerevan, Armenia Ara Tekian, PhD, MHPE University of Illinois at Chicago.
Graduate studies - Master of Pharmacy (MPharm) 1 st and 2 nd cycle integrated, 5 yrs, 10 semesters, 300 ECTS-credits 1 Integrated master's degrees qualifications.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
Qualifications Update: Human Biology Qualifications Update: Human Biology.
PLCS & THE CONNECTION TO RESPONSE TO INTERVENTION Essentials for Administrators Sept. 27, 2012.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
The Theory of Sampling and Measurement. Sampling First step in implementing any research design is to create a sample. First step in implementing any.
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
Managing Human Resources, 12e, by Bohlander/Snell/Sherman © 2001 South-Western/Thomson Learning 8-1 Managing Human Resources Managing Human Resources Bohlander.
Catholic College at Mandeville Assessment and Evaluation in Inclusive Settings Sessions 3 & /14/2015 Launcelot I. Brown Lisa Philip.
Alternative Assessment Chapter 8 David Goh. Factors Increasing Awareness and Development of Alternative Assessment Educational reform movement Goals 2000,
Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
1 Scoring Provincial Large-Scale Assessments María Elena Oliveri, University of British Columbia Britta Gundersen-Bryden, British Columbia Ministry of.
MAVILLE ALASTRE-DIZON Philippine Normal University
Pay Reform Perspectives in Jordan Amman-Jordan September 2006 Ministry of Public Sector Development.
Chapter 6 - Standardized Measurement and Assessment
VALIDITY, RELIABILITY & PRACTICALITY Prof. Rosynella Cardozo Prof. Jonathan Magdalena.
Assessment My favorite topic (after grammar, of course)
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
RELIABILITY AND VALIDITY Dr. Rehab F. Gwada. Control of Measurement Reliabilityvalidity.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Classroom Assessments Checklists, Rating Scales, and Rubrics
Classroom Assessment A Practical Guide for Educators by Craig A
Chapter 6: Checklists, Rating Scales & Rubrics
Classroom Assessments Checklists, Rating Scales, and Rubrics
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Assessment for Learning
Analyzing Reliability and Validity in Outcomes Assessment
Web CPI Instructions for Student Rating - Quick reference
Collecting and Interpreting Quantitative Data
Presentation transcript:

Issues in Comparability of Test Scores Across States Liru Zhang, Delaware DOE Shudong Wang, NWEA Presented at the 2014 CCSSO NCSA New Orleans, LA June 25-27, 2014

Performance-Based Assessments  The educational reforms movements over the past decades have promoted the use of performance tasks in K–12 assessments, which mirror classroom instruction and provide authentic information about what students know and are able to do.  The next generation assessments based on the Common Core State Standards are designed to address the important 21 st Century competencies of mastering and applying core academic content and cognitive strategies related to complex thinking, communication, and problem solving. 1

Advantages and Limitations  Advantages of performance-based assessments Provide direct measures on both process and product Promote higher-level thinking and problem-solving skills Generate more information about what students know and can do Motivate student learning and improve classroom instruction  Limitations of performance-based assessments Low reliability and inconsistency across occasions over time Limited generalizability across performance tasks Subjectivity and error in scoring 2

Rater Errors in Scoring  One of the essential features of performance-based assessments is that they fundamentally depend on the quality of professional judgment from raters (Engelhard,1994). Raters are human and they are therefore subject to rating errors (Guilford, 1936). Severity or leniency is the tendency of raters to consistently score lower or higher than warranted by student performance. Halo effect is the tendency of raters to fail to distinguish between aspects or specific dimensions that are conceptually distinct from a task. Central tendency is that ratings are clustered about the midpoint of the rating scale. Restriction of score range is that ratings cannot discriminate student performance. 3

Rater Effects and Rater Drift  A variety of factors that can affect the scoring behavior of raters have been identified, such as the characteristics of raters, quality of training and preparation, time pressure, productivity, and scoring conditions.  Operationally, rater effect is far from identical, but rather inconsistent from task to task, within a scoring event, across occasions, and over time.  Rater effects may result in item-parameter drift from its original value, which not only introduces bias to the estimate of student ability, but it also threatens the validity of the underlying test construct. In an extreme case, construct shift could cause scale drift, particularly for the vertical scale. 4

Comparability of Test Scores  In practice, rater effects or rater drift could be more serious when the variation is present in the scoring process from state to state.  The disparities across states may involve rater quality (e.g., source of qualified raters, recruiting, screening, training), scoring method (e.g., human, automated, or a combination), scoring design (e.g., number of raters, scoring rules, monitor system), and scoring conditions (e.g., on-site or distributed, timeline, workload). These may cause rater effects across states.  Statistically, the shift in student motivation, the change of item assigning mode, and the lack post-adjustment may contribute to the item-parameter drift from the field test results.  These construct-irrelevant variability create challenges to the comparability of test scores across states. 5

Validity of Scoring  In validation of test scores based on student responses, it is important to document that the raters’ cognitive processes are consistent with the intended construct being measured (Standard, 1999).  In principle, constant monitoring of the scoring process is necessary, whether human scoring or automated scoring is applied since automated systems are often “trained” based on human scores (McClellan, 2010; Bejar, 2012).  In an effort to improve performance assessments, psychometric models and statistical methods have been proposed for analyzing empirical data to detect rater effects and monitor the scoring process. 6

Purpose of the Presentation (1 of 2)  With the implementation of the next generation assessments, the challenge that each consortia will encounter is how to monitor the scoring process that is operated separately by states, prevent potential rater effects and rater drift across occasions, and retain the comparability of test scores across states, particularly with performance tasks.  Are the commonly used procedures (e.g., second rater and read-behind) and criteria for evaluating rating quality (e.g., inter-rater agreement, agreement with validity papers) sufficient for the challenge? 7

Purpose of the Presentation (2 of 2)  How can we monitor the scoring process in practice?  How can we detect the impact of rater drift on student test scores across states?  Issues discussed in this presentation are in the context of score comparability. 8

Monitor the Scoring Process (1 of 2)  With the advancements in technology, automated scoring with its obvious potentials (e.g., efficiency, consistency, and robust against to internal and external influences) provides a new device in scoring and a supplement to human scoring.  To capitalize on the strengths and narrow the limitations of both human scoring and automated engines, procedures to combine the two have been investigated as a mechanism to monitor the process of scoring performance tasks. 9

Monitor the Scoring Process (2 of 2)  The most extensively used procedure in recent years is to monitor human raters with an automated scoring engine reported by Pacific Metrics (Lottridge, Schulz & Mitzel, 2012); Pearson (Shin, Wolfe, Wilson & Foltz, 2014); Kieftenbeld and Barrentt, 2014); and ETS (Yoon, Chen, and Zechner, 2014).  This procedure could be altered to monitor automated scoring with expert raters (randomly read-behind) and for combination or weighted combination of human and automated scoring. 10

Considerations in Monitoring Process (1 of 3)  Prior to the operation, much preparation must take place for scoring, for instance the design for the monitoring process, including control rater effects, is a critical aspect not only for the quality of rating but also to achieve the comparability of scores. 1.Benchmark Papers must be selected from the expert-scored responses from the field test, which represent the performance of the target population along the scoring scale with at least 250–300 responses per score point. The same benchmark papers are used for training the automated engine in the monitoring process or for the actual scoring. 11

Considerations in Monitoring Process (2 of 3) 2.Rater Training must achieve two additional goals in addition to the regular rater training. One is to help raters build in a “mental construct” of student proficiency across states for a certain subject at a given grade, and on a specific topic(s). During the scoring process, frequent trainings should be offered based on issues uncovered in the constant monitoring process. 3.Standardized Scoring Conditions are essential for the comparability of test scores. Such conditions can be generated into four categories, which are: 12

Considerations for Monitoring Process (3 of 3)  Rater (e.g., criteria for selection, qualification, training); scoring method (e.g., use human scoring with automated scoring to monitor the process)  Scoring Design (e.g., number of raters, random distribution or first-in first-serve, one rater per student or one rater per item or per trait in analytic scoring, and use the same package of benchmark papers)  Monitoring system (e.g., design and functions)  Scoring environment (e.g., on-site or distributed)  Although identical scoring conditions may not be easily achieved due to state policy, budget, schedule, and availability, certain essential conditions must be considered and implemented. 13

Many-Facets Measurement Model  The Rasch Measurement Model generalized by Linacre (1989) provides a framework for examining the psychometric quality of professional judgments on constructed responses by students on performance tasks.  The model may include many facets, such as rater severity, item difficulty, and student ability. The Facets model is a unidimensional model with a collection of many facets as the independent variables and a single student competence parameter as the dependent variable.  Engelhard (1994) demonstrated the procedures for detecting four general categories of rater errors with facets for a large-scale writing assessment. He concluded that this model offers a promising approach. It is likely to discover most rater errors and minimize their potential effects. 14

Detect Rater Effects by Facets (1 of 3)  Facets is commonly used to detect rater effects in performance-based assessments. The output of Facets provides detailed information about rater behavior.  Because rater facet is centered at 0, positive or negative values are considered the presence of rater effects. The degree of rater severity or leniency can be determined based on how far the mean of ratings is away from 0. Also, the reliability of the separation index provides evidence if the raters systematically differ in severity or leniency in scoring.  Rater severity, central tendency, and halo effect could all lead to restriction of score range; while the halo effect may cause central tendency and/or restriction of score range as well. 15

Detect Rater Effects by Facets (2 of 3)  Observed positively or negatively skewed frequency distributions of test scores could be an indication of rater errors.  Rasch fit statistics, Infit and Outfit, can be used to detect halo effect. Infit statistics with the expectation of 1.0 measures the degree of intra-rater consistency. When Infit ˂ 0.6 indicates too little variation and overuse of inner scale categories, such as 2 and 3 on a 0–5 scale; while Infit ˃ 1.5 indicates excess variation and overuse of outer scale categories. Outfit statistics with expectation 1.0 measures the same thing as Infit but is more sensitive to outliers. 16

Detect Rater Effects by Facets (3 of 3)  Halo effect usually appears when analytic judgments are influenced by the rater’s overall impression on student performance. With halo effect, the correlations of ratings on different dimensions are statistically exaggerated, while the variance components are diminished. The presence of halo effect may alter the construct and minimize the opportunity for students to demonstrate their strength and weakness. Using a single rater for scoring all traits of a given response may introduce bias due to the halo effect. 17

A Facets Model for Cross-State Consistency (1 of 3) 18

A Facets Model for Cross-State Consistency (2 of 3)  The proposed model can be used to identify and evaluate the consistency in scoring constructed-responses and essays across states. For each element, such as student, performance task, rater, and rating consistency across states, Facets provides a measure, its standard error, and fit statistics for aberrant observations; and quantifies the function of each.  Through the analysis, the probability of obtaining a rating category for an item (e.g., score point of 2 on a 0–5 scale for the item) is estimated depending on student ability, item difficulty, rater’s rating, and scoring consistency in a state.  The proposed model creates an opportunity to put student ability, rater performance, and cross-state consistency on the same scale for comparisons. 19

A Facets Model for Cross-State Consistency (3 of 3)  It is important to note that simulation studies and empirical analyses are needed to validate the proposed model, examine its potentials in identifying and quantifying the consistency in scoring across states. For evaluation purpose, a threshold should be determined.  In extreme cases, the results from the cross-state consistency analyses could be used to adjust student scores. 20

A Final Note  Among many factors, a quality system and constant monitoring for the scoring process is necessary, and standardized scoring conditions are essential for the validity of scoring performance tasks and to enhance the comparability of test scores across states.  The proposed model can be used to quantify the rater effects by state, examine the potential impact of the variations in scoring on student performance, and identify the consistency in scoring across states. Thus, the model can be used to validate the scoring process. For evaluation purpose, a threshold should be determined.  In reality, unidentified or hard to quantify factors, such as the availability of qualified raters, possible item selection in adaptive nature for performance tasks, and state policy (e.g., budget, schedule) may introduce additional challenges to the scoring. 21

Thanks!! 22 We would like to hear from you. Please contact us if you have comments, suggestions, and/or questions regarding the presentation, particularly about the notion of the comparability of test scores across states and the proposed Cross-State Facets Model. Liru Zhang: Shudong Wang: