Further Evaluation of Automated Essay Score Validity P. Adam Kelly Houston VA Medical Center and Baylor College of Medicine

Slides:



Advertisements
Similar presentations
Assessing Student Performance
Advertisements

Performance Assessment
Victorian Curriculum and Assessment Authority
Carol Ann Gittens, Gail Gradowski & Christa Bailey Santa Clara University WASC Academic Resource Conference Session D1 April 25, 2014.
Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing.
Spiros Papageorgiou University of Michigan
Comparing Growth in Student Performance David Stern, UC Berkeley Career Academy Support Network Presentation to Educating for Careers/ California Partnership.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Robert J. Mislevy & Min Liu University of Maryland Geneva Haertel SRI International Robert J. Mislevy & Min Liu University of Maryland Geneva Haertel SRI.
C R E S S T / U C L A Issues and problems in classification of students with limited English proficiency Jamal Abedi UCLA Graduate School of Education.
VALIDITY.
Development of New Science Standards:
Construct Validity and Measurement
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Foreign language and English as a Second Language: Getting to the Common Core of Communication. Are we there yet? Marisol Marcin
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
Deep Learning ThroughLiteracy-Rich Instructional Strategies Sara Overby Coordinating Teacher for Secondary Literacy
An English Proficiency Test for Today’s Student Using Today’s Technology Marcie Mealia,
Evaluating the Validity of NLSC Self-Assessment Scores Charles W. Stansfield Jing Gao Bill Rivers.
Ensuring State Assessments Match the Rigor, Depth and Breadth of College- and Career- Ready Standards Student Achievement Partners Spring 2014.
January 29, 2010ART Beach Retreat ART Beach Retreat 2010 Assessment Rubric for Critical Thinking First Scoring Session Summary ART Beach Retreat.
Forum - 1 Assessments for Learning: A Briefing on Performance-Based Assessments Eva L. Baker Director National Center for Research on Evaluation, Standards,
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
Agenda: Turn in Persuasive Essay (if needed) Debrief Persuasive Essay in Meta-Cog Log Introduce Expository Essay and Strategies Read and Assess Example.
EDU 8603 Day 6. What do the following numbers mean?
1 Validity Conclusions are appropriate Conclusion are true.
The Nation’s Report Card Science National Assessment of Educational Progress (NAEP)
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
GRE Test Preparation Workshop for Campus Educators Overview of the GRE Program.
VALUE/Multi-State Collaborative (MSC) to Advance Learning Outcomes Assessment Pilot Year Study Findings and Summary These slides summarize results from.
COUNCIL OF CHIEF STATE SCHOOL OFFICERS (CCSSO) & NATIONAL GOVERNORS ASSOCIATION CENTER FOR BEST PRACTICES (NGA CENTER) JUNE 2010.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Experimental Research Methods in Language Learning Chapter 5 Validity in Experimental Research.
© 2015 The College Board The Redesigned SAT/PSAT Key Changes.
INFO KLEDO HEALTH Nurul Maretia Rahmayanti Knowledge Center Division Kledo Health BY GRE and GMAT Which one should we take?
Qatar Comprehensive Educational Assessment (QCEA) 2008: Discussion Session For QCEA Support.
Chapter 6 - Standardized Measurement and Assessment
The Redesigned SAT January 20, About the Redesigned SAT.
Randy Bennett Frank Jenkins Hilary Persky Andy Weiss Scoring Simulation Assessments Funded by the National Center for Education Statistics,
Obtaining International Benchmarks for States Through Statistical Linking: Presentation at the Institute of Education Sciences (IES) National Center for.
Assessing Student Learning Workshop 2: Making on-balance judgements and building consistency.
Portfolios A number of years ago the portfolio became part of the requirements to attain the two highest levels of graduation status. Though one.
Limitations and Future Directions of Tests. Test Interpretation and Use A valid test involves valid interpretation and valid use of the test scores A.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
© 2015 The College Board The Redesigned SAT Essay Writing Oakland Schools.
Measured Progress ©2012 Combined Human and Automated Scoring of Writing Stuart Kahl Measured Progress.
Writing CCSS: English Language Arts. Objectives  Become familiar with format and content of Common Core writing standards  Review writing samples for.
Monitoring and Assessment Presented by: Wedad Al –Blwi Supervised by: Prof. Antar Abdellah.
An Institutional Writing Assessment Project Dr. Loraine Phillips Texas A&M University Dr. Yan Zhang University of Maryland University College October 2010.
Module 1: Common Core Instruction for ELA & Literacy Informational Text Audience: Science, Social Studies, Technical Subject Teachers Area V Regional Superintendents.
 1. optional (check to see if your college requires it)  2. Test Length: 50 min  3. Nature of Prompt: Analyze an argument  4. Prompt is virtually.
A Close Look at Don’t Fail Idaho’s Student Achievement Message June 25, 2013 Bert Stoneberg, Ph.D. K-12 Research Idaho
26134 Business Statistics Week 5 Tutorial
SAT Notes: Please get out your notebook and turn to the writing section. We are taking notes today.
Test Validity.
Elements of Reasoning:
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
HLT 540 Competitive Success-- snaptutorial.com
HLT 540Competitive Success/tutorialrank.com
HLT 540 Education for Service-- snaptutorial.com
HLT 540 Education for Service-- tutorialrank.com.
HLT 540 Teaching Effectively-- snaptutorial.com
& Anglophone Writing Assessments
Chief of English Testing, Language Programs
Sociology Outcomes Assessment
Developing a Rubric for Assessment
AICE General Paper What IS this class?.
Presentation transcript:

Further Evaluation of Automated Essay Score Validity P. Adam Kelly Houston VA Medical Center and Baylor College of Medicine Paper Repository:

It would appear that we have reached the limits of what is possible to achieve with computer technology. – John von Neumann, computer scientist, 1949

Research Questions  How do automated essay scoring models behave when:  The level of specificity of models (“generic” vs. “prompt- specific”) is varied;  The essay task type (“discuss an issue” vs. “make an argument”) and program type (grad school admissions vs. grade school achievement) is varied; and  The distributional assumptions of the independent and dependent variables are varied?  What are the consequences of score interpretations/uses, as stated by end users

The six aspects of evidence in Messick’s (1995) unitary validity framework

Essay Samples and Scoring Program ~1,800 GRE ® Writing Assessment essays: “Issue” task: ~600 essays on 3 prompts, scored by raters and by computer “Argument” task: ~1,200 essays, “ “ “ “ “ “ “ “ ~ 900 National Assessment of Educational Progress (NAEP) writing assessment essays: “Informative” task: ~450 essays, scored by raters and by computer “Persuasive” task: ~450 essays, “ “ “ “ “ “ e-rater ™ (ETS Technologies, Inc.): Linear regression model: 59 variables, covering content, rhetorical structure, and syntactic structure “features” of essays: “Generic” models calibrated for multiple prompts, and “prompt-specific” models

“In-Task” vs. “Out-of-Task” e-scores Using the GRE W. A. “Issue” generic model, generated “out-of-task” scores for ~900 “Argument” essays Using the GRE W. A. “Argument” generic model, generated “out-of-task” scores for ~400 “Issue” essays “Issue”: Proportions of agreement and correlations of “in- task” (correct) with “out-of-task” e-scores exceeded the statistics for “in-task” scores with rater scores (Kelly, 2001). “Argument”: Ditto. Meaning: Models may be somewhat invariant to task type

“In-Program” vs. “Out-of-Program” e-scores Using the GRE W. A. “Issue” generic model, generated “out-of-program” scores for ~450 NAEP “Informative” essays Using the GRE W. A. “Argument” generic model, generated “out-of-program” scores for ~450 NAEP “Persuasive” essays For both NAEP task types: Proportions of agreement and correlations of “in-program” (correct) with “out-of-program” e-scores fell well below the statistics for “in-program” e-scores with rater scores. Meaning: Strong evidence of discrimination between programs

Generic vs. Prompt-Specific e-scores Generic Scoring Model: “Issue”“Argument” Prompt-Specific Models: Exact + adjacent agreement >.95 >.90 Correlation > These statistics are similar in magnitude to rater/e-rater agreement statistics presented in Kelly (2001). Meaning: Evidence supporting generalizability of e-scores from prompt-specific to generic models

“Modified Model” e-scores e-rater’s linear regression module replaced with ordinal regression “Modified model” e-scores generated for GRE essays Both task types: Proportions of agreement remained roughly constant, but correlations increased noticeably Meaning: An ordinal regression model may improve the accuracy of e-scores, especially in the extremes of the score distribution (e.g., 5s and 6s)

Consequences of e-score interpretation/use  How are the scores interpreted? Used? By whom? What are the implications of this?  Interviewed graduate program admissions decision- makers: open-ended questions, by phone, recorded on audio tape  The sample: 12 humanities, 18 social sciences, 28 business graduate faculty

Examples of Responses … Humanities:  Not familiar with GRE W. A. or e-rater  Wouldn’t be inclined to use an essay test for admissions  Concerned that computer scoring could undervalue creativity and ethnically diverse writing styles/formats Social Sciences:  Not familiar with GRE W. A. or e-rater  Essay test likely only used to assess English language proficiency  Less concerned about potential threat to creativity; some disciplines have rigid writing styles anyway

Examples of Responses … Business:  Didn’t realize that a computer currently helps score GMAT W. A., or knew it but wasn’t affected by it  Rarely use GMAT W. A. scores, then only to assess English language proficiency  Concerned that computer scoring could marginalize W. A., but (may) feel it is already useless Meaning: Since scores largely discounted by users, the consequences of interpretation/use are nonexistent (at present, at least).

Conclusions … (this year and last) Content representativeness evidence: Variables that “drive” e-rater are identifiable and constant, group into factors forming reasonably interpretable, parsimonious factor models Structural evidence: (Most of) the factors resemble writing qualities listed in the GRE W. A. Scoring Guides – just as ETS Technologies has claimed Substantive evidence: Raters agreed that the “syntactic” and “content” factors are relevant, identifiable, and reflective of what a rater should look for, but were highly skeptical of others

Conclusions … (this year and last) Correlational evidence: Apparent strong discrimination of “in- program” from “out-of-program” essays; important for commercial applications across academic/professional fields Generalizability evidence: the use of less expensive “generic” models, trained only to the task type, not the prompt, appears to be supported Consequential evidence: Many graduate program admissions decision-makers do not use the GRE W.A. or GMAT W. A.; those that do use it mostly for diagnostic/remedial purposes (so the scores matter, but not for the reasons thought …)

“D**n this computer, I think that I shall sell it. It never does what I want it to do, only what I tell it!”