Hands-on Automated Scoring

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Data Collection Six Sigma Foundations Continuous Improvement Training Six Sigma Foundations Continuous Improvement Training Six Sigma Simplicity.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Supporting Teachers to make Overall Teacher Judgments The Consortium for Professional Learning.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Copyright © 2001 by The Psychological Corporation 1 The Academic Competence Evaluation Scales (ACES) Rating scale technology for identifying students with.
S/W Project Management
Simple Linear Regression
Using the Foundation Phase Child Development Assessment Profile Training for Assessment.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Classroom Assessments Checklists, Rating Scales, and Rubrics
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Automated Scoring: Smarter Balanced Studies CCSSO- NCSA San Diego, CA June, 2015.
Descriptive Statistics
Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Update on Selective Editing and Implications for Staff Skills International Trade Conference September 2008 Ken Smart.
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
Assessment and Testing
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
1 Scoring Provincial Large-Scale Assessments María Elena Oliveri, University of British Columbia Britta Gundersen-Bryden, British Columbia Ministry of.
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Overview of the handbook Chapter 5: Levee inspection, assessment and risk attribution.
Research Design
HRM-755 PERFORMANCE MANAGEMENT OSMAN BIN SAIF LECTURE: TWENTY THREE 1.
Meta-analysis Overview
Statistics & Evidence-Based Practice
Designing Scoring Rubrics
Systems Analysis and Design in a Changing World, Fifth Edition
Review of Assessment Toolkit
Classroom Assessments Checklists, Rating Scales, and Rubrics
Chapter 8 Environments, Alternatives, and Decisions.
Cheryl Ng Ling Hui Hee Jee Mei, Ph.D Universiti Teknologi Malaysia
Classroom Assessment A Practical Guide for Educators by Craig A
Software Architecture in Practice
Chapter 6: Checklists, Rating Scales & Rubrics
Capital Project / Infrastructure Renewal – Making the Business Case
Data Virtualization Tutorial: XSLT and Streaming Transformations
Correlations #1.
Elayne Colón and Tom Dana
Presented by Munezero Immaculee Joselyne PhD in Software Engineering
Data Mining Jim King.
Preface to the special issue on context-aware recommender systems
Controls to Reduce Threats to Validity
Classroom Assessments Checklists, Rating Scales, and Rubrics
Business and Management Research
Lecture 12: Data Wrangling
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Tabulations and Statistics
Discrete Event Simulation - 4
Large Scale Support Vector Machines
Agenda About Excel/Calc Spreadsheets Key Features
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Performance Management
Accessibility Supports Training
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Deputy Commissioner Jeff Wulfson Associate Commissioner Michol Stapel
Developing a Rubric for Assessment
Machine Learning in Practice Lecture 27
Project Management Group
Accessibility Supports Training
Analyzing Reliability and Validity in Outcomes Assessment
Dr. Huda Sarraj Bouchra Bakach
Collecting and Interpreting Quantitative Data
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Hands-on Automated Scoring Sue Lottridge and Carlo Morales, Pacific Metrics Mark Shermis, University of Houston – Clear Lake

Workshop Goals Big-picture steps of how to train an engine and evaluate engine performance Hands-on experience with automated scoring across the training and validation pipeline

Overview of engine training Data handling Evaluation criteria Agenda 8:30-10:00: Presentations Overview of engine training Data handling Evaluation criteria Engine fundamentals Considerations 10:00 - 10:15: Break 10:15-11:30: Hands-on engine training Walk-through demo with data

Discovery Specifications Analysis Report Item Materials Scoring Rules Engine Training Steps Discovery Item Materials Scoring Rules Data Specifications Create Datasets Identify Models Analysis Procedures Analysis Build Models Pick Final Model Score and Analyze Report Purpose Methods Results

Item Materials Scoring Rules Data Discovery Presentation Information/Tools Item Rubric Passage Scoring Rules Relevant range-finding decisions Adjudication rules Read-behind procedures Choice of score on which to train Data Data definitions Data (ID, responses, human-assigned scores, other) Data summaries (validation) Review of raw data and summaries

Define Specifications Create Sets Train set (2/3) – Number of folds Held-out test set (1/3) Build Models Preprocessing parameters Feature extraction parameters Scoring parameters Analysis H1-H2-Engine statistics Train and Test set Data deliverables Archiving

Build Models Finalize Model Score Implementation Steps Build data sets Build models on train set Score “folds” based on trained models Cuts set, as needed Finalize Model Determine best-performing model Store model parameters Review with client if requested Score Score held-out test set Create data deliverable Conduct evaluation analyses Archive

Purpose Methods Results Generate Report Intro Reason for study Description of items/rubrics/passages Scoring model design Methods Description of data Description of sample allocations across sets Description of number of models employed and final model Description of analyses and any criteria used to evaluate Results Tables of results for each dataset (train, test) Text description of results including scoring issues Recommendations Data deliverables

Evaluating the Performance of an Automated Scoring Engine An industry-wide standard should be used. The current standard is from Williamson, Xi, and Breyer (2012). Note that other factors may weigh into decision: Identification of aberrant responses Concern for scores at critical decision points (near cut scores) Quality of scoring across the rubric Use of AS in the operational program (e.g., monitoring, sole read, hybrid)

Evaluating Engine Performance Metrics Mean, SD, Standardized Mean Difference Exact Agreement, Adjacent, Non-Adjacent Kappa, Quadratic Weighted Kappa, Correlations Evaluation Criteria (Williamson, Xi, & Breyer, 2012) Exceed minimal QWK .70 threshold Exceed Correlation .70 threshold QWK degradation not to exceed .10 SMD values not to exceed .15 Exact Agreement degradation not to exceed 5% One can also evaluate performance by targeted subgroups.

Data Item, rubric, training papers, and ancillary materials Scoring decisions and human scores available Responses with set of human scores Deciding on score to use for prediction and score to use for evaluation

Statistical Considerations Reason Total N Do we have enough responses to train and validate an engine? What proportion should be used for each sample? N at each score point Do we have enough responses at each score point to produce a reliable prediction? Is there a bimodal distribution, suggestion potential issue with the rubric? Mean/SD Is item of hard, easy, or medium difficulty? Exact Agreement Does it seem too low or too high, given rubric scale? Is it dominated by one or more score points? Non-Adjacent Agreement Is it in the usual range (e.g., 1-5%)? If high, potentially a scoring issue… Quadratic Weighted Kappa Is it > .70? Correlation

Response length – anything unusual? Uncommon responses Other considerations Once samples divided into training and validation, conduct review of training papers Response length – anything unusual? Uncommon responses Range of responses within score point Need for manual changes to response

Sample Human Rater Statistics Score H1 H2 SOR 22.6% 23.8% 1 44% 41.3% 2 33.4% 34.9% mean 1.11 std 0.74 0.76 N 416 Agreement H1-H2 H1-SOR H2-SOR Exact 81.2% 100% Adjacent 18.5% 0% Non-Adj 0.2% Kappa 0.71 1 QWK 0.83 Pearson r Spearman r N 416

Training sample vs. held-out validation sample Sample Handling Training sample vs. held-out validation sample Proportion (often 67% train, 33% test) Training sample used for building model Validation sample used for evaluating model (never touch this until model is finalized) K-fold cross-validation Enables multiple-model evaluation with realistic evaluation of performance Very powerful tool when used with grid-searching to build and evaluating competing models Once finalize model, train on entire sample (more data = better model)

K-Fold Cross-Validation

The Modeling Pipeline PreProcessing Feature Extraction Scoring Spell correct/Term replace Stopword removal Lemmatization Punctuation handling Case handling Feature Extraction Term vectors (ngrams) Base counts Response characteristics Scoring Regression/Classification Parametric vs. Non-parametric models Choice of score to predict Fit indices

Protect against over-fit Score modelling choices Other Considerations Protect against over-fit Score modelling choices Balance agreement with score distribution Quadratic weighted kappa best single metric for this Pay attention to both Manage aberrant responses Usually through other means but also through engine flagging Fairness issues How to handle?

15 minute break!

Hands-on Automated Scoring Demo of machine scoring tool A few notes Does not use CRASE engine, simplified scoring tool to demonstrate AS methods Allows user to upload data, review stats on responses, conduct limited preprocessing, feature extraction, and model-building, visualize modeling process, and view results We have data from the Automated Scoring Assessment prize for contructed response scoring for you to use

Use your assigned link Example item: ASAP data Toolkit Demo Walk though of functions File upload Train/test allocation proportions Sample view Describe PCA/LDA graphic and results below Observe changes with preproc

1. response processing parameters 2. n-gram size Task 1 ‘Play’ with the various parameters and observe the performance of the engine, varying: 1. response processing parameters 2. n-gram size 3. Principal components What we your measure of quality? What was your approach to modelling? What seemed to produce the best results? Did anything reduce scoring quality?

Do the same as before, but tune using the automation feature Task 2 Do the same as before, but tune using the automation feature 1. response processing parameters 2. n-gram size How did the results change across the various methods? Does one score prediction model seem to work better than another?

Determine the impact of overfitting your model OR Task 3 One of: Determine the impact of overfitting your model OR Find a way to cheat your model!

Thank you!