Partial Credit Scoring for Technology Enhanced Items

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

Assessing Student Performance
Item Analysis.
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Peer Assessment of Oral Presentations Kevin Yee Faculty Center for Teaching & Learning, University of Central Florida Research Question For oral presentations,
Increasing your confidence that you really found what you think you found. Reliability and Validity.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
1 CSSS Large Scale Assessment Webinar Adaptive Testing in Science Kevin King (WestEd) Roy Beven (NWEA)
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
© 2008 McGraw-Hill Higher Education. All rights reserved. CHAPTER 16 Classroom Assessment.
1 The New York State Education Department New York State’s Student Reporting and Accountability System.
Teacher Evaluation Training June 30, 2014
What We Know About Effective Professional Development: Implications for State MSPs Part 2 Iris R. Weiss June 11, 2008.
Creating Assessments with English Language Learners in Mind In this module we will examine: Who are English Language Learners (ELL) and how are they identified?
Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.
Ensuring State Assessments Match the Rigor, Depth and Breadth of College- and Career- Ready Standards Student Achievement Partners Spring 2014.
Out with the Old, In with the New: NYS Assessments “Primer” Basics to Keep in Mind & Strategies to Enhance Student Achievement Maria Fallacaro, MORIC
National Accessible Reading Assessment Projects Research on Making Large-Scale Reading Assessments More Accessible for Students with Disabilities June.
4/16/07 Assessment of the Core – Science Charlyne L. Walker Director of Educational Research and Evaluation, Arts and Sciences.
SOL Changes and Preparation A parent presentation.
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
Student assessment AH Mehrparvar,MD Occupational Medicine department Yazd University of Medical Sciences.
TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.
Accommodations and Modification in Grades Do NOT fundamentally alter or lower expectations or standards in instructional level, content, or performance.
1 Teacher Evaluation Institute July 23, 2013 Roanoke Virginia Department of Education Division of Teacher Education and Licensure.
Overview of Types of Measures Margaret Kasimatis, PhD VP for Academic Planning & Effectiveness.
1 Main achievement outcomes continued.... Performance on mathematics and reading (minor domains) in PISA 2006, including performance by gender Performance.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
How to Use These Modules 1.Complete these modules with your grade level and/or content team. 2.Print the note taking sheets. 3.Read the notes as you view.
VALIDATING SCORING RULES FOR TECHNOLOGY- ENHANCED ITEMS (TEIs): A CASE STUDY FROM ELPA21 NATIONAL CONFERENCE ON STUDENT ASSESSMENT - JUNE 20, 2016 TERRI.
A New Trend Line in Student Achievement “Virginia's public schools are beginning a new trend line with the implementation of more challenging standards.
Reduced STAAR test blueprints
Designing Scoring Rubrics
Survey Methodology Reliability and Validity
PARCC Information for Parents Rockaway Borough Schools Mark Schwarz, Superintendent Jamie Argenziano, Supervisor of Curriculum and Instruction January.
PeerWise Student Instructions
What is a CAT? What is a CAT?.
Technology Enhanced Items — Signal or Noise?
Preliminary Review of the 2012 Math SOL Results
Classroom Assessment A Practical Guide for Educators by Craig A
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Concept of Test Validity
Information and Guidance on the Changes and Expectations for 2016/17
EDU 385 Session 8 Writing Selection items
Test Design & Construction
Test Validity.
The Florida Standards Assessments: What Every Parent Should Know
Classroom Analytics.
Science and Tech/Eng MCAS Update
Classroom Assessment Validity And Bias in Assessment.
Week 3 Class Discussion.
Office of Education Improvement and Innovation
9th Grade Literature & Composition
RESEARCH METHODS Lecture 18
An Introduction to e-Assessment
Interim Assessment Training NEISD Testing Services
Rubrics for academic assessment
Writing a Free Response Essay
Assessment Literacy: Test Purpose and Use
Milwaukee Public Schools University of Wisconsin-Milwaukee
Florida Standards Assessment
Testing Schedule.
Year 6 SATs Meeting.
Teaching The Teachers To Teach Us To Learn
Driven by Data to Empower Instruction and Learning
Tests are given for 4 primary reasons.
  Using the RUMM2030 outputs as feedback on learner performance in Communication in English for Adult learners Nthabeleng Lepota 13th SAAEA Conference.
Presentation transcript:

Partial Credit Scoring for Technology Enhanced Items CCSSO National Conference on Student Assessment 22 June 2016 1

Overview Possible scoring rules for technology enhanced items (TEIs) Dichotomous Partial credit scoring (polytomous) Evaluating whether partial credit scoring is “working” Study methods & results Additional considerations Do TEIs with more correct answers take longer to answer? Are TEIs with more correct answers more difficult? Do TEIs take more time to answer compared to multiple-choice items? Should TEIs be worth more points than multiple-choice items? How might partial credit scoring impact student subgroups?

Possible Scoring Rules for TEIs Dichotomous Currently used in Virginia Students must select all correct answers to receive 1 point; all other responses receive 0 points Polytomous Virginia is looking to implement partial credit scoring in the future for TEIs Students get “full credit” if they select all correct answers Different methods allow for different ways to determine how partial credit versus no credit is determined This research looked at three possible methods: “N Method” “N-1 Method” “N/2 Method”

Partial Credit Scoring: N Method N = the total number of correct answers for a particular TEI Each correct response receives partial credit For example, if there are 5 correct answers, each correct answer could be worth 1 point for a total of 5 possible points Items with different numbers of correct answers will have different numbers of score points Could scale each item to have the same score range (e.g., 0-1), but there will be N+1 score categories (e.g., 0, 1/5, 2/5, 3/5, 4/5, and 1)

Partial Credit Scoring: N Method Benefits: perceived face validity might be similar to how credit would be awarded in the classroom for multi-part items Limitations: each score point may not really be discriminating between distinct levels of content knowledge may be difficult to obtain sufficient numbers of students within all score categories having items with different score ranges may make meeting a test blueprint more challenging

Partial Credit Scoring: N-1 Method N = the total number of correct answers for a particular TEI “N-1 Method” – Students get partial credit if they select all but one correct answer; if they select two or more incorrect answers, they receive no credit N-1 Method will have 3 score categories for all items (e.g., 0, 1, 2) Example: If an item has 5 correct answers and the student selects all 5 correct, they receive full credit. If the student selects 4 of the 5 correct answers, they receive partial credit, and if they get 0-3 correct answers, they receive no credit.

Partial Credit Scoring: N-1 Method Benefits: all items, regardless of the number of correct answers, have three possible score categories having the same score range for all TEIs will make meeting a test blueprint more straight-forward with only three score categories, sparse data is less likely to be a problem, which will make parameter estimation more stable Limitations: one-size-fits-all scoring approach may not work well for very different TEI types tends to be a very conservative partial credit scoring approach such that very few students may obtain partial credit (dichotomous outcome)

Partial Credit Scoring: N/2 Method N = the total number of correct answers for a particular TEI “N/2 Method” – Students get partial credit if they select at least half of the correct answers; if they select less than half of the correct answers, they receive no credit N/2 Method will have 3 score categories for all items (e.g., 0, 1, 2) N/2 Method and N Method are identical for items with 2 correct answers Example: If an item has 5 correct answers and the student selects all 5 correct, they receive full credit. If the student selects 3 or 4 of the 5 correct answers, they receive partial credit, and if they get 0-2 correct answers, they receive no credit.

Partial Credit Scoring: N/2 Method Benefits: all items, regardless of the number of correct answers, have three possible score categories having the same score range for all TEIs will make meeting a test blueprint more straight-forward with only three score categories, sparse data is less likely to be a problem, which will make parameter estimation more stable Limitations: one-size-fits-all scoring approach may not work well for very different TEI types

Is PCS “Working”? Evaluation Criteria Number of Students in Each Score Category Do we have enough data to estimate differences in difficulty from one category to the next? Are we making a meaningful differentiation between groups of students? As item scores increase, is the total test score increasing? Are item-total correlations increasing? Are we improving measurement precision, or just adding “noise” into student scores?

Methods Focused on newly field-tested TEIs in one grade of science and one grade of mathematics Analysis focused on Hot Spot and Drag and Drop TEIs that required two or more correct answers for full credit. Note: This study used items that were not specifically developed with partial credit scoring in mind. Virginia intends to begin development of TEIs that are created with partial credit scoring considered from the outset moving forward.

Results Item Type Number of Students Scoring Approach Dichotomous N/2 1 2 Hot Spot 5,231 31.27 39.07 26.91 34.00 3,938 26.42 35.99 23.68 26.64 3,904 32.51 35.61 21.83 32.62 Drag-and-Drop 3,939 29.34 35.38 21.13 29.53 3,955 30.62 37.59 30.22 31.08 5,089 25.43 35.72 17.62 26.77 Item Type Number of Students Scoring Approach N 1 2 3 4 5 Hot Spot 5,231 24.50 27.40 34.00 39.07   3,938 23.68 26.64 35.99 3,904 13.00 – 22.64 27.76 33.00 35.61 Drag-and-Drop 3,939 25.25 17.00 29.53 35.38 3,955 31.26 23.11 30.71 31.54 37.59 5,089 17.50 17.71 26.42 27.25 35.72

Results Item Type Scoring Approach Dichotomous N/2 N Hot Spot 0.42 0.51 0.35 0.17 0.19 Drag-and-Drop 0.20 0.39 0.38 0.34

Results N Method N-1 Method N/2 Method Many score categories with very few students, especially for TEIs with a larger numbers of correct answers Many cases where the total score did not increase as the item score increased N-1 Method Very few students would receive partial credit using this method N/2 Method Most consistent with how content experts assigned partial credit scoring to TEIs Appears to work well with Virginia items in terms of having sufficient numbers of students within each of the three score categories Did not appear to result in huge improvements in measurement properties (item total correlations), but also did not systematically decrease technical quality of scores

Do TEIs with More Correct Answers Take Longer to Answer?

Are TEIs with More Correct Answers More Difficult?

Do TEIs take More Time than MC Items to Answer? Subject TEI MC Min Max Average Science 28.32 214.73 81.45 58.50 Math 76.49 466.98 185.55 132.29

Should TEIs be Worth More Points than MC Items? Yes? On average TEIs take more time to answer TEIs often require more than one student interaction, while MC items require students to select only one answer TEIs often used to try to measure higher level skills TEIs look similar to items that might receive more than a 0/1 score in the classroom

Should TEIs be Worth More Points than MC Items? No? Number of interactions not correlated with the amount of time students spend on an item TEIs not necessarily more difficult than MC items when scored dichotomously Often each interaction is not measuring a distinct skill, but is just more thoroughly evaluating a single skill

How Might Partial Credit Scoring Impact Student Subgroups? TEIs make up 15-20% of the items on Virginia assessments Partial credit scoring essentially splits the category of students who received 0 points under dichotomous scoring into 2 or more score categories—no credit and various degrees of partial credit This may help lower performing students show some content mastery However, each type and subtype of TEI often requires understanding of a different mechanism for responding (e.g., various forms of clicking and/or dragging) This may result in an additional obstacle for English Language Learners and Poor Readers who may struggle to understand item-specific response directions The Virginia Department of Education provides practice items and released items to give students and schools an opportunity to become familiar with different types of TEIs

Take-Aways Not all technology enhanced items support partial credit scoring (e.g., one blank FIB items) For items that do support partial credit scoring, determining which responses merit partial credit is not always straight-forward There are many methods for applying partial credit, some which may work better than others for a given context Partial credit scoring may not result in vast improvements in measurement precision but may still enhance face validity and public perception