Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation.

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

An Introduction to Test Construction
Qualifications Update: Practical Electronics Qualifications Update: Practical Electronics.
Chapter 6 Process and Procedures of Testing
Learning Outcomes Participants will be able to analyze assessments
Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing.
Elif Kongar*, Mahesh Baral and Tarek Sobh *Departments of Technology Management and Mechanical Engineering University of Bridgeport, Bridgeport, CT, U.S.A.
Iowa Assessment Update School Administrators of Iowa November 2013 Catherine Welch Iowa Testing Programs.
Aligning to standards from the "get go:" Designing alternate assessments based on states’ standards, expanded benchmarks, and universal design Sue Bechard,
Common Formative Assessments
Writing B. Finco. A little light reading! B. Finco.
DEEPENING ASSESSMENT LITERACY Fall Objective  Identify best practices for local assessment development  Provide a working knowledge of the WPSD.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Assignment Marking via Online Self Assess Margot Schuhmacher, Lecturer Higher Education Development Unit, Centre for Learning and Teaching Support, Monash.
WRITING NEXT: A Report to Carnegie Corporation of New York
Chapter 5 Data mining : A Closer Look.
EngageNY.org Scoring the Regents Examination in Algebra I (Common Core)
For Better Accuracy Eick: Ensemble Learning
March, What does the new law require?  20% State student growth data (increases to 25% upon implementation of value0added growth model)  20%
State Assessment Website Address: NYSESLAT The NYSESLAT will be administered annually for five grade clusters: K-1,
Office of Institutional Research, Planning and Assessment January 24, 2011 UNDERSTANDING THE DIAGNOSTIC GUIDE.
Ohio’s Assessment Future The Common Core & Its Impact on Student Assessment Evidence by Jim Lloyd Source doc: The Common Core and the Future of Student.
Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.
Designing Tests and Paper Questions. A Task Write down a recent test question or paper topic you used Finish this sentence stem: “The most important thing.
DOK Depth of Knowledge An Introduction.
Math rigor facilitating student understanding through process goals
A Model to Facilitate Effective Blended E-learning within Universities in Developing Countries B. Aguti, R. J. Walters, G. B. Wills Electronics and Computer.
Change Karen Crisco Leslie Key Emily Rone Kasey Verneer.
Automated Scoring: Smarter Balanced Studies CCSSO- NCSA San Diego, CA June, 2015.
Service Transition & Planning Service Validation & Testing
Good Assessment by Design International GCSE and GCE Comparative Analyses Dr. Rose Clesham.
PARCC Assessments Updates Updates Arrived 2/6/13! general specifics.
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
ASSESSMENT IN EDUCATION ASSESSMENT IN EDUCATION. Copyright Keith Morrison, 2004 ITEM TYPES IN A TEST Missing words and incomplete sentences Multiple choice.
Florida Standards Assessment 4 th Grade English Language Arts and Mathematics.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
COPYRIGHT © 2007 Thomson South-Western, a part of The Thomson Corporation. Thomson, the Star logo, and South-Western are trademarks used herein under license.
Project Life Cycle – Project Initiation © Ed Green Penn State University All Rights Reserved.
Automated Scoring is a Policy and Psychometric Decision Christina Schneider The National Center for the Improvement of Educational Assessment
14 October 2010 Leveraging Technical Expertise via Boeing Library Services* Diane Brenes, Librarian, Boeing Library & Learning Center Services.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
New Zealand Diploma in Business National External Moderation Reports Tertiary Assessment & Moderation.
Richard Woods, Georgia’s School Superintendent “Educating Georgia’s Future” gadoe.org Assessment for Learning Series Module 4: Working through Complex.
Scoring Technology Enhanced Items Sue Lottridge Director of Machine Scoring Amy Burkhardt Senior Research Associate of Machine Scoring.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
TEACHER EVALUATION IMPLEMENTATION DAY: STUDENT GROWTH AND GOAL SETTING September 25, 2015 Shorewood High School 9/25/15 1.
You Can’t Afford to be Late!
1 Science, Learning, and Assessment: (Eats, Shoots, and Leaves) Choices for Comprehensive Assessment Design Eva L. Baker UCLA Graduate School of Education.
Foundations of American Education: Perspectives on Education in a Changing World, 15e © 2011 Pearson Education, Inc. All rights reserved. Chapter 11 Standards,
IN READING, LANGUAGE, AND LISTENING Grades 6-8 Florida Standards Assessment in English Language Arts.
Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.
Welcome Parents! FCAT Information Session. O Next Generation Sunshine State Standards O Released Test Items O Sample Test.
FSM NSTT Teaching Competency Test Evaluation. The NSTT Teaching Competency differs from the three other NSTT tests. It is accompanied by a Preparation.
STEPHANIE CHENEY EDU 547 DR. ORDU Validity and Reliability in Competency Based Assessment.
1 Visual Computing Institute | Prof. Dr. Torsten W. Kuhlen Virtual Reality & Immersive Visualization Till Petersen-Krauß | GUI Testing | GUI.
NAPLAN Online: Contentious Issues and Risk Mitigation Strategies
Hands-on Automated Scoring
Oleh: Beni Setiawan, Wahyu Budi Sabtiawan
Math Field Day Meeting #2 October 28, 2015
Language Proficiency Assessment Detlev Kesten Associate Provost, Academic Support.
Grades 6-8 Florida Standards Assessment in English Language Arts
Partial Credit Scoring for Technology Enhanced Items
TEI Session – Discussant Comments
ISTE Workshop Research Methods in Educational Technology
Table 2. A Sample Lesson on Teaching Prepositions
Deputy Commissioner Jeff Wulfson Associate Commissioner Michol Stapel
Test Administrators Last updated: 08/20/09.
Georgia Department of Education Assessment and Accountability Division
Presentation transcript:

Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation

Next Generation World (Winter, Burkhardt, Freidhoff, Stimson, & Leslie, 2013) Computer-based testing Adaptive testing Potentially large item pools Technology-enhanced items Automated item generation Personalized learning/formative uses

Key Topics in Today’s Presentation Cataloguing Constructed Response items Combining scoring sources Adaptive Testing TEIs/Automated Item Generation

Cataloguing CR Items

Item Type & Scoring Knowing item type can help to determine appropriate scoring approach and whether item is ‘score-able’ (Lottridge, Winter & Mugan, 2013) There are almost an infinite ways to create a Constructed Response Item!

Constructed Response Items Covers a very broad range of item types – When different from an essay? – When different from a performance event or task? Need better definition around these items – Types – Structural considerations – Content – Score points/rubric

Types Technology-enhanced items Text-based – One-word to phrasal typed response – Single sentence – Multiple sentence Constrained entry – Numbers – Equations/expressions

Structural Considerations (Ferrara et al., 2003; Scalise & Gifford, 2006) CR items are often multi-part – What defines a part? – Parts can differ from entry boxes – How are parts scored in the rubric? – How many points per part? CR items consist of multiple types – TEI + text ‘explain your answer’ – Solution + Equation (+ text ’explain your answer’) – Solution + text ‘explain your answer’

Math CR Items CR Item StructureCount Solution + explanation28 [Solution + explanation] + [solution + explanation]10 [N] Solution + explanation9 [N] Solution5 Equation + solution4 Equation + solution + explanation4 TEI with or without labeling8 TEI + solution/expression/equation4 Other3

Science CR Items CR Item StructureCount Identify 1 + explain/describe7 Identify 2 + explain/describe7 Identify 4 + explain/describe1 Identify [N]11 Explain/describe [N]5 [M] [identify [N] + explain/describe [N]]4 TEI + explain/describe [N] + identify [N]3 Solution + explain/describe [N] + identify [N]3 Solution [N]1

Reading CR Items CR Item StructureCount Identify details22 Identify [N] + [N] details9 2 details7 [N] details5 [N] List 1 + explain/describe5 Summary with specifications5 Inference/generalization + [N] details5

Content Considerations Math – What is required in ‘explain your answer’ or ‘show your work’ responses? Science (Baxter & Glaser, 1998) – Content lean to Content rich – Process constrained to Process rich Reading – Detail versus explanation – Prediction/generalization – Summarization

Combining Scoring Sources

There are many ways to leverage different scoring sources (Lottridge, Winter & Mugan, 2013) – 100% human + N% computer second read – Complementary human and computer scoring – 100% computer with N% human second read – Blended human and computer scoring But can we use different computer scoring models to produce better results? – Adjudication Model – Ensemble Model

Adjudication Model Two engines independently trained on 5 math CR items – Each engine underperforms relative to humans – One engine is ‘rule-based;’ the other is heavily NLP/Machine Learning based Restrict computer scoring to those responses in which engines agree – The remaining results go to humans for scoring Results show promise

Exact Agreement Rates (Complete Validation Sample) ItemN Human 1 – Human 2 Engine 1 – Human 2 Engine 2 – Human %76%80% %79%85% %77% %88%89% %74%70%

Adjudication Proportions ItemN Engine Assigns Score Humans Assign Score %29% %20% %26% %10% %38%

Engine Assigns Score Condition (Exact Agreement Performance) ItemN Human 1 – Human 2 Engine 1/2– Human %74% % % %92% %91%

Humans Assign Score Condition (Exact Agreement Performance) ItemN Human 1 – Human 2 Engine 1 – Human 2 Engine 2 – Human %39%52% 25279%31%59% 38785%48%51% 42492%46%54% 57790%52%42%

Adjudication Summary When we restrict scoring to when the engines agree, they perform similarly to humans. When they do not agree, the engines perform poorly relative to humans. This suggests the adjudication criteria are adequate for retaining the responses that should be scored by automated scoring.

Ensemble Model Combining scores from two different engines to produce a score – Weighted average – Optimization via regression – Other methods (decision trees, etc.)

Results Exact Agreement Rate with Human Raters ItemEngine 1Engine 2EnsembleImprovement 159%61%66%5% 275%67%76%1% 357%58%63%5% 13 Reading CR items Ensembling by averaging scores from two engines 10 items exhibited no improvement 3 items exhibited some improvement

Adaptive Testing and CRs Item pools – Potentially large number of CRs (thousands) – Low number of examinees per CR (if any) Impacts on hand scoring and engine scoring – Training readers and engines – Requires large AS staff to train or a shift from ‘expert-based’ to ‘automated’ training models

TEIs and Automated Item Generation Many TEIs/AIG templates are scored 0-1 or are multiple choice (Winter et al., 2013) – But often require multiple steps by examinee Can we involve item authors in the configuring scoring rules to enable partial-credit scoring? – Expands usefulness of item to examinees – Removes expert scoring labor from training process

References Baxter, G., & Glaser, R. (1998). Investigating the cognitive complexity of science assessments. Educational Measurement: Issues & Practice, 17(3), Ferrara, S., Duncan, T., Perie, M., Freed, R., McGovern, J., & Chilukuri, R. (2003, April). Item construct validity: Early results from a study of the relationship between intended and actual cognitive demands in a middle school science assessment. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Lottridge, S., Winter, P., & Mugan, L. (2013). The AS Decision Matrix: Using Program Stakes and Item Type to Make Informed Decisions about Automated Scoring Implementations. Pacific Metrics Corporation. Retrieved from papers/ASDecisionMatrix_WhitePaper_Final.pdf. Scalise, K., & Gifford, B. R. (2006). Computer-Based Assessment in E-Learning: A Framework for Constructing "Intermediate Constraint" Questions and Tasks for Technology Platforms. Journal of Teaching, Learning and Assessment, 4(6). Winter, P. C., Burkhardt, A. K., Freidhoff, J. R., Stimson, R. J., & Leslie, S. C. (2013). Astonishing impact: An introduction to five computer-based assessment issues. Michigan Virtual Learning Research Institute. Retrieved from

Questions? Pacific Metrics Corporation

1 Lower Ragsdale Drive, Building 1, Suite 150, Monterey, CA Thank You