Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arizona English Language Learner Assessment AZELLA

Similar presentations


Presentation on theme: "Arizona English Language Learner Assessment AZELLA"— Presentation transcript:

0 Automated Scoring for Speaking Assessments
Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015

1 Arizona English Language Learner Assessment AZELLA
AZELLA is Arizona’s own English Language Proficient Assessment. AZELLA has been in use since school year Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. AZELLA had to be revised to align with the new ELP Standards. Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. Revisions to the Speaking portion of the AZELLA are particularly notable.

2 AZELLA Speaking Test Administration Prior to School Year 2012-2013
Administered orally by test administrator One-on-one administration Scored by test administrator Immediate scores Training for test administrators Minimal Not required

3 AZELLA Speaking Test Concerns Prior to School Year 2012-2013
Inconsistent test administration Not able to standardize test delivery Inconsistent scoring Not able to replicate or verify scoring

4 AZELLA Speaking Test Desires For School Year 2012-2013 and beyond
Consistent test administration Every student has the same testing experience Consistent and quick scoring Record student responses Reliability statistics for scoring Minimal burden for schools No special equipment No special personnel requirements or trainings Similar amount of time to administer

5 AZELLA Speaking Test Administration For School Year 2012-2013 and beyond
Consistent test administration Administered one-on-one via speaker telephone Consistent and quick scoring Student responses are recorded Reliable machine scoring Minimal burden for schools Requires a landline speaker telephone No special personnel requirements or training Slightly longer test administration time

6 Proposed Solution ----- Meeting Notes (6/16/15 15:03) -----
In order to provide a consistent test admin experience to all ELL students and provide a consistent scoring for all speaking tests, Pearson worked with the Department to implement a telephone-based speaking assessment solution. This solution includes automated delivery of the speaking assessment and automated scoring of the test responses. Here is a quick walk-through of our solution. Tests were administrated one-on-one to students. Test admin dialed a toll-free number and enter a test idenfication number to access the right test form. The speaking test items were delivered through a speaker phone. The timing for item presentation is controled and standardized. students' oral responses are collected through the phone and the audio data are transfered back to our database for grading. A machine scoring algorithm goes through the audio responses to produce a score for each of the students' responses.

7 Development of Automated Scoring Method
Human raters Field testing data Testing System Automated Scores Validation Human Transcribers Recorded Items Item Text ----- Meeting Notes (6/16/15 15:03) ----- Next we're going to talk about how we developed the automated scoring for azella speaking and what it takes to set up a solution like this for states. Test Developers Test Spec

8 Why does automated scoring of speaking work?
The acoustic models used for speech recognition are optimized for various accents Young children speech, foreign accents The test questions have been modeled from field test data The system anticipates the various ways that students respond

9 Field Tested Items The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”

10 Language models a It’s protractor I don’t know protractor a compass
The system estimates the probability of each of those possible responses based on field test data. The responses from field tests were rated by human graders with the rubrics, so we know for each response what score a human grader will assign. We build the scoring algorithm based on those responses and human scores, so that the algorithm can perform like a human grader.

11 Used for building models
Field Testing and Data Preparation Two Field Testing: Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten) Stage Total tests Used for building models Used for validation I 13,184 1,200 333 II 10,646 300 III 9,369 IV 6,439 V 5,231

12 Item Type for Automated Scoring
Score Point Domain Syllabification 0-1 Oral Reading Wordlist Repeat 0-6 Speaking Questions about an image 0-4 Similarities and differences Give directions from a map Questions about a statement Give instructions to do something Open questions about a topic Detailed responses to a topic Automated scoring can handle a variety of item types. The item types ranges from confined item types such as wordlist to more open/less confined item type such as picture description and giving instruction.

13 Sample Speaking Rubric: 0 – 4 Point Item
Points Descriptors 4 Student formulates a response in correct understandable English using two or more sentences based on given stimuli. Student responds in complete declarative or interrogative sentences. Grammar errors are not evident and do not impede communication. Student responds with clear and correct pronunciation. Student responds using correct syntax.  3 Student formulates a response in understandable English using two or more sentences based on a given stimuli. Sentences have minor grammatical errors. Student responds with clear and correct pronunciation.  2 Student formulates an intelligible English response based on given stimuli. Student does not respond in two complete declarative or interrogative sentences. Student responds with errors in grammar. Student attempts to respond with clear and correct pronunciation.  1 Student formulates erroneous responses based on given stimuli. Student does not respond in complete declarative or interrogative sentences. Student responds with significant errors in grammar. Student does not respond with clear and correct pronunciation.  Human rating rubrics is a holistic rubrics that capture both the content of speech production (what they say) and the manner of production (how they say it) in terms of pronunciation, fluency etc.

14 Sample student responses
Item Response Transcript Human Score Machine Score Next, please answer in complete sentences. Tell how to get ready for school in the morning. Include at least two steps. first you wake up and then you put on your clothes # and eat breakfast

15 Validity evidence: Are machine scores comparable to human scores?
Measures we looked at: Reliability (internal consistency) Candidate-level (or test-level) correlations Item-level correlations

16 Structural reliability
Stage Human Cronbach α Machine Cronbach α I 0.98 0.99 II III 0.96 0.94 IV 0.95 V Average 0.97

17 Scatterplot by Stage Stage II Stage III Stage IV Stage V

18 Item-level performance: by item type
Item Type (Stage II) Human-human correlation Machine-human correlation Questions about an image 0.87 0.86 Give directions from a map 0.82 0.84 Open questions about a topic 0.75 0.72 Give instructions to do something 0.83 0.80 Repeat 0.95 0.85 Human-human corr gives us a baseline. Machine performance very closely approximate human raters performance. For some item types, when human raters don’t agree with each other on scoring an item, machine human agreement goes down as well.

19 Item-level performance: by item type
Item Type (Stage IV) Human-Human correlation Machine-Human correlation Questions about an image 0.84 Give directions from a map 0.90 Open questions about a topic 0.82 Detailed response to a topic 0.85 0.87 Give instructions to do something Repeat 0.96 0.89 In some cases, machine grading outperform human raters in terms of consistency.

20 Summary of Score Comparability
Machine-generated scores are comparable to human ratings Reliability (internal consistency) Test-level correlations Item-type-level correlations

21 Test Administration Preparation
One-on-one practice – student and test administrator Demonstration Video Landline Speaker Telephone for one-on-one administration Student Answer Document – Unique Speaking Test Code

22 Test Administration

23 Test Administration Warm Up Questions
What is your first and last name? What is your teacher’s name? How old are you? Purpose of the Warm Up Questions Student becomes more familiar with prompting Sound check for student voice level, equipment Capture Demographic data to resolve future inquiries Responses are not scored

24 Challenges Challenge Solution Landline Speaker telephone availability
ADE purchased speaker telephones for the first year of administration Difficulty scoring young population Additional warm up questions Added beeps to prompt the student to respond Adjusting acceptable audio threshold Rubric Update and Scoring Engine Recalibration Captured demographics from warm up questions Speaking code key entry process updated Documentation of test administrator name and time of administration Incorrect Speaking Codes

25 Summary Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments Standardize test delivery Minimal test set-up and training is required Consistent in scoring Availability of test data for analysis and review

26 Questions


Download ppt "Arizona English Language Learner Assessment AZELLA"

Similar presentations


Ads by Google