Hands-on Automated Scoring

Slides:

Advertisements

Similar presentations

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Advertisements

Data Collection Six Sigma Foundations Continuous Improvement Training Six Sigma Foundations Continuous Improvement Training Six Sigma Simplicity.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Supporting Teachers to make Overall Teacher Judgments The Consortium for Professional Learning.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Copyright © 2001 by The Psychological Corporation 1 The Academic Competence Evaluation Scales (ACES) Rating scale technology for identifying students with.

S/W Project Management

Simple Linear Regression

Using the Foundation Phase Child Development Assessment Profile Training for Assessment.

Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.

Classroom Assessments Checklists, Rating Scales, and Rubrics

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Automated Scoring: Smarter Balanced Studies CCSSO- NCSA San Diego, CA June, 2015.

Descriptive Statistics

Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation.

Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Update on Selective Editing and Implications for Staff Skills International Trade Conference September 2008 Ken Smart.

Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.

Assessment and Testing

Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.

1 Scoring Provincial Large-Scale Assessments María Elena Oliveri, University of British Columbia Britta Gundersen-Bryden, British Columbia Ministry of.

1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Overview of the handbook Chapter 5: Levee inspection, assessment and risk attribution.

Research Design

HRM-755 PERFORMANCE MANAGEMENT OSMAN BIN SAIF LECTURE: TWENTY THREE 1.

Meta-analysis Overview

Statistics & Evidence-Based Practice

Designing Scoring Rubrics

Systems Analysis and Design in a Changing World, Fifth Edition

Review of Assessment Toolkit

Classroom Assessments Checklists, Rating Scales, and Rubrics

Chapter 8 Environments, Alternatives, and Decisions.

Cheryl Ng Ling Hui Hee Jee Mei, Ph.D Universiti Teknologi Malaysia

Classroom Assessment A Practical Guide for Educators by Craig A

Software Architecture in Practice

Chapter 6: Checklists, Rating Scales & Rubrics

Capital Project / Infrastructure Renewal – Making the Business Case

Data Virtualization Tutorial: XSLT and Streaming Transformations

Correlations #1.

Elayne Colón and Tom Dana

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

Data Mining Jim King.

Preface to the special issue on context-aware recommender systems

Controls to Reduce Threats to Validity

Classroom Assessments Checklists, Rating Scales, and Rubrics

Business and Management Research

Lecture 12: Data Wrangling

Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov

Tabulations and Statistics

Discrete Event Simulation - 4

Large Scale Support Vector Machines

Agenda About Excel/Calc Spreadsheets Key Features

SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Performance Management

Accessibility Supports Training

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Deputy Commissioner Jeff Wulfson Associate Commissioner Michol Stapel

Developing a Rubric for Assessment

Machine Learning in Practice Lecture 27

Project Management Group

Accessibility Supports Training

Analyzing Reliability and Validity in Outcomes Assessment

Dr. Huda Sarraj Bouchra Bakach

Collecting and Interpreting Quantitative Data

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Hands-on Automated Scoring Sue Lottridge and Carlo Morales, Pacific Metrics Mark Shermis, University of Houston – Clear Lake

Workshop Goals Big-picture steps of how to train an engine and evaluate engine performance Hands-on experience with automated scoring across the training and validation pipeline

Overview of engine training Data handling Evaluation criteria Agenda 8:30-10:00: Presentations Overview of engine training Data handling Evaluation criteria Engine fundamentals Considerations 10:00 - 10:15: Break 10:15-11:30: Hands-on engine training Walk-through demo with data

Discovery Specifications Analysis Report Item Materials Scoring Rules Engine Training Steps Discovery Item Materials Scoring Rules Data Specifications Create Datasets Identify Models Analysis Procedures Analysis Build Models Pick Final Model Score and Analyze Report Purpose Methods Results

Item Materials Scoring Rules Data Discovery Presentation Information/Tools Item Rubric Passage Scoring Rules Relevant range-finding decisions Adjudication rules Read-behind procedures Choice of score on which to train Data Data definitions Data (ID, responses, human-assigned scores, other) Data summaries (validation) Review of raw data and summaries

Define Specifications Create Sets Train set (2/3) – Number of folds Held-out test set (1/3) Build Models Preprocessing parameters Feature extraction parameters Scoring parameters Analysis H1-H2-Engine statistics Train and Test set Data deliverables Archiving

Build Models Finalize Model Score Implementation Steps Build data sets Build models on train set Score “folds” based on trained models Cuts set, as needed Finalize Model Determine best-performing model Store model parameters Review with client if requested Score Score held-out test set Create data deliverable Conduct evaluation analyses Archive

Purpose Methods Results Generate Report Intro Reason for study Description of items/rubrics/passages Scoring model design Methods Description of data Description of sample allocations across sets Description of number of models employed and final model Description of analyses and any criteria used to evaluate Results Tables of results for each dataset (train, test) Text description of results including scoring issues Recommendations Data deliverables

Evaluating the Performance of an Automated Scoring Engine An industry-wide standard should be used. The current standard is from Williamson, Xi, and Breyer (2012). Note that other factors may weigh into decision: Identification of aberrant responses Concern for scores at critical decision points (near cut scores) Quality of scoring across the rubric Use of AS in the operational program (e.g., monitoring, sole read, hybrid)

Evaluating Engine Performance Metrics Mean, SD, Standardized Mean Difference Exact Agreement, Adjacent, Non-Adjacent Kappa, Quadratic Weighted Kappa, Correlations Evaluation Criteria (Williamson, Xi, & Breyer, 2012) Exceed minimal QWK .70 threshold Exceed Correlation .70 threshold QWK degradation not to exceed .10 SMD values not to exceed .15 Exact Agreement degradation not to exceed 5% One can also evaluate performance by targeted subgroups.

Data Item, rubric, training papers, and ancillary materials Scoring decisions and human scores available Responses with set of human scores Deciding on score to use for prediction and score to use for evaluation

Statistical Considerations Reason Total N Do we have enough responses to train and validate an engine? What proportion should be used for each sample? N at each score point Do we have enough responses at each score point to produce a reliable prediction? Is there a bimodal distribution, suggestion potential issue with the rubric? Mean/SD Is item of hard, easy, or medium difficulty? Exact Agreement Does it seem too low or too high, given rubric scale? Is it dominated by one or more score points? Non-Adjacent Agreement Is it in the usual range (e.g., 1-5%)? If high, potentially a scoring issue… Quadratic Weighted Kappa Is it > .70? Correlation

Response length – anything unusual? Uncommon responses Other considerations Once samples divided into training and validation, conduct review of training papers Response length – anything unusual? Uncommon responses Range of responses within score point Need for manual changes to response

Sample Human Rater Statistics Score H1 H2 SOR 22.6% 23.8% 1 44% 41.3% 2 33.4% 34.9% mean 1.11 std 0.74 0.76 N 416 Agreement H1-H2 H1-SOR H2-SOR Exact 81.2% 100% Adjacent 18.5% 0% Non-Adj 0.2% Kappa 0.71 1 QWK 0.83 Pearson r Spearman r N 416

Training sample vs. held-out validation sample Sample Handling Training sample vs. held-out validation sample Proportion (often 67% train, 33% test) Training sample used for building model Validation sample used for evaluating model (never touch this until model is finalized) K-fold cross-validation Enables multiple-model evaluation with realistic evaluation of performance Very powerful tool when used with grid-searching to build and evaluating competing models Once finalize model, train on entire sample (more data = better model)

K-Fold Cross-Validation

The Modeling Pipeline PreProcessing Feature Extraction Scoring Spell correct/Term replace Stopword removal Lemmatization Punctuation handling Case handling Feature Extraction Term vectors (ngrams) Base counts Response characteristics Scoring Regression/Classification Parametric vs. Non-parametric models Choice of score to predict Fit indices

Protect against over-fit Score modelling choices Other Considerations Protect against over-fit Score modelling choices Balance agreement with score distribution Quadratic weighted kappa best single metric for this Pay attention to both Manage aberrant responses Usually through other means but also through engine flagging Fairness issues How to handle?

15 minute break!

Hands-on Automated Scoring Demo of machine scoring tool A few notes Does not use CRASE engine, simplified scoring tool to demonstrate AS methods Allows user to upload data, review stats on responses, conduct limited preprocessing, feature extraction, and model-building, visualize modeling process, and view results We have data from the Automated Scoring Assessment prize for contructed response scoring for you to use

Use your assigned link Example item: ASAP data Toolkit Demo Walk though of functions File upload Train/test allocation proportions Sample view Describe PCA/LDA graphic and results below Observe changes with preproc

1. response processing parameters 2. n-gram size Task 1 ‘Play’ with the various parameters and observe the performance of the engine, varying: 1. response processing parameters 2. n-gram size 3. Principal components What we your measure of quality? What was your approach to modelling? What seemed to produce the best results? Did anything reduce scoring quality?

Do the same as before, but tune using the automation feature Task 2 Do the same as before, but tune using the automation feature 1. response processing parameters 2. n-gram size How did the results change across the various methods? Does one score prediction model seem to work better than another?

Determine the impact of overfitting your model OR Task 3 One of: Determine the impact of overfitting your model OR Find a way to cheat your model!

Thank you!