Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015
Arizona English Language Learner Assessment AZELLA AZELLA is Arizona’s own English Language Proficient Assessment. AZELLA has been in use since school year 2006-2007. Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. AZELLA had to be revised to align with the new ELP Standards. Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. Revisions to the Speaking portion of the AZELLA are particularly notable.
AZELLA Speaking Test Administration Prior to School Year 2012-2013 Administered orally by test administrator One-on-one administration Scored by test administrator Immediate scores Training for test administrators Minimal Not required
AZELLA Speaking Test Concerns Prior to School Year 2012-2013 Inconsistent test administration Not able to standardize test delivery Inconsistent scoring Not able to replicate or verify scoring
AZELLA Speaking Test Desires For School Year 2012-2013 and beyond Consistent test administration Every student has the same testing experience Consistent and quick scoring Record student responses Reliability statistics for scoring Minimal burden for schools No special equipment No special personnel requirements or trainings Similar amount of time to administer
AZELLA Speaking Test Administration For School Year 2012-2013 and beyond Consistent test administration Administered one-on-one via speaker telephone Consistent and quick scoring Student responses are recorded Reliable machine scoring Minimal burden for schools Requires a landline speaker telephone No special personnel requirements or training Slightly longer test administration time
Proposed Solution ----- Meeting Notes (6/16/15 15:03) ----- In order to provide a consistent test admin experience to all ELL students and provide a consistent scoring for all speaking tests, Pearson worked with the Department to implement a telephone-based speaking assessment solution. This solution includes automated delivery of the speaking assessment and automated scoring of the test responses. Here is a quick walk-through of our solution. Tests were administrated one-on-one to students. Test admin dialed a toll-free number and enter a test idenfication number to access the right test form. The speaking test items were delivered through a speaker phone. The timing for item presentation is controled and standardized. students' oral responses are collected through the phone and the audio data are transfered back to our database for grading. A machine scoring algorithm goes through the audio responses to produce a score for each of the students' responses.
Development of Automated Scoring Method Human raters Field testing data Testing System Automated Scores Validation Human Transcribers Recorded Items Item Text ----- Meeting Notes (6/16/15 15:03) ----- Next we're going to talk about how we developed the automated scoring for azella speaking and what it takes to set up a solution like this for states. Test Developers Test Spec
Why does automated scoring of speaking work? The acoustic models used for speech recognition are optimized for various accents Young children speech, foreign accents The test questions have been modeled from field test data The system anticipates the various ways that students respond
Field Tested Items The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”
Language models a It’s protractor I don’t know protractor a compass The system estimates the probability of each of those possible responses based on field test data. The responses from field tests were rated by human graders with the rubrics, so we know for each response what score a human grader will assign. We build the scoring algorithm based on those responses and human scores, so that the algorithm can perform like a human grader.
Used for building models Field Testing and Data Preparation Two Field Testing: 2011-2012 Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten) Stage Total tests Used for building models Used for validation I 13,184 1,200 333 II 10,646 300 III 9,369 IV 6,439 V 5,231
Item Type for Automated Scoring Score Point Domain Syllabification 0-1 Oral Reading Wordlist Repeat 0-6 Speaking Questions about an image 0-4 Similarities and differences Give directions from a map Questions about a statement Give instructions to do something Open questions about a topic Detailed responses to a topic Automated scoring can handle a variety of item types. The item types ranges from confined item types such as wordlist to more open/less confined item type such as picture description and giving instruction.
Sample Speaking Rubric: 0 – 4 Point Item Points Descriptors 4 Student formulates a response in correct understandable English using two or more sentences based on given stimuli. Student responds in complete declarative or interrogative sentences. Grammar errors are not evident and do not impede communication. Student responds with clear and correct pronunciation. Student responds using correct syntax. 3 Student formulates a response in understandable English using two or more sentences based on a given stimuli. Sentences have minor grammatical errors. Student responds with clear and correct pronunciation. 2 Student formulates an intelligible English response based on given stimuli. Student does not respond in two complete declarative or interrogative sentences. Student responds with errors in grammar. Student attempts to respond with clear and correct pronunciation. 1 Student formulates erroneous responses based on given stimuli. Student does not respond in complete declarative or interrogative sentences. Student responds with significant errors in grammar. Student does not respond with clear and correct pronunciation. Human rating rubrics is a holistic rubrics that capture both the content of speech production (what they say) and the manner of production (how they say it) in terms of pronunciation, fluency etc.
Sample student responses Item Response Transcript Human Score Machine Score Next, please answer in complete sentences. Tell how to get ready for school in the morning. Include at least two steps. first you wake up and then you put on your clothes # and eat breakfast 3 3.35
Validity evidence: Are machine scores comparable to human scores? Measures we looked at: Reliability (internal consistency) Candidate-level (or test-level) correlations Item-level correlations
Structural reliability Stage Human Cronbach α Machine Cronbach α I 0.98 0.99 II III 0.96 0.94 IV 0.95 V Average 0.97
Scatterplot by Stage Stage II Stage III Stage IV Stage V
Item-level performance: by item type Item Type (Stage II) Human-human correlation Machine-human correlation Questions about an image 0.87 0.86 Give directions from a map 0.82 0.84 Open questions about a topic 0.75 0.72 Give instructions to do something 0.83 0.80 Repeat 0.95 0.85 Human-human corr gives us a baseline. Machine performance very closely approximate human raters performance. For some item types, when human raters don’t agree with each other on scoring an item, machine human agreement goes down as well.
Item-level performance: by item type Item Type (Stage IV) Human-Human correlation Machine-Human correlation Questions about an image 0.84 Give directions from a map 0.90 Open questions about a topic 0.82 Detailed response to a topic 0.85 0.87 Give instructions to do something Repeat 0.96 0.89 In some cases, machine grading outperform human raters in terms of consistency.
Summary of Score Comparability Machine-generated scores are comparable to human ratings Reliability (internal consistency) Test-level correlations Item-type-level correlations
Test Administration Preparation One-on-one practice – student and test administrator Demonstration Video Landline Speaker Telephone for one-on-one administration Student Answer Document – Unique Speaking Test Code
Test Administration
Test Administration Warm Up Questions What is your first and last name? What is your teacher’s name? How old are you? Purpose of the Warm Up Questions Student becomes more familiar with prompting Sound check for student voice level, equipment Capture Demographic data to resolve future inquiries Responses are not scored
Challenges Challenge Solution Landline Speaker telephone availability ADE purchased speaker telephones for the first year of administration Difficulty scoring young population Additional warm up questions Added beeps to prompt the student to respond Adjusting acceptable audio threshold Rubric Update and Scoring Engine Recalibration Captured demographics from warm up questions Speaking code key entry process updated Documentation of test administrator name and time of administration Incorrect Speaking Codes
Summary Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments Standardize test delivery Minimal test set-up and training is required Consistent in scoring Availability of test data for analysis and review
Questions