Download presentation
Presentation is loading. Please wait.
Published byGarry Park Modified over 9 years ago
1
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng, Yuan Zhao D’Antilio, Xin Chen, Jared Bernstein Knowledge Technologies, Pearson Menlo Park, California, USA BEA-2014
2
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Overview Introduction Item type analysis Data Human transcriptions and scoring Machine scoring methods Experimental results Unscorable test detection Future work Conclusions
3
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Introduction Arizona English Language Learner Assessment (AZELLA) is an English Learners (ELs) test administrated in the state of Arizona for K-12 students by Arizona Department of Education (ADE). Five stages: K., Elementary, Primary, Middle and High School. AZELLA is a four skills test. This research focuses on speaking part to generate scores automatically. The first field test (stage 2-5) took place around Nov. 2011. Pearson Knowledge Technologies (PKT) delivered over 31K tests. The second field test (stage 1) took place around April 2012. PKT delivered over 13K tests. The first operational AZELLA test with automatic speech scoring took place between January and February, 2013, with approximately 140K tests delivered. After that, annually PKT are supposed to deliver around 180k tests.
4
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Item type analysis Constrained item types: Naming Read syllables for one word Read three words sequence Repeat Fairly unconstrained item types: Questions about image Give directions from map Ask questions about a thing Open questions about a topic Give instructions to do something Similarities & differences Ask questions about a statement Detailed response to a topic
5
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Data From the data in the first field test (Stages 2-5), for each AZELLA stage, we randomly sampled 300 tests (75 tests/form x 4 forms) as a validation set and 1,200 tests as a development set. For the data in the second field test (Stage 1), we randomly sampled 167 tests from the four forms as the validation set and 1,200 tests as the development set. No validation data was used for model training.
6
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Human transcriptions and scoring In the development sets, we needed from 100 to 300 responses per item to be transcribed, depending on the complexity of the item type. All responses from the tests were scored by trained professional raters according to predefined ADE rubrics. Every response has one trait: human holistic score. We used the average score from different raters as the final score during machine learning. The responses in each validation set were double rated (producing two final scores) for use in validation. For the responses of open-ended item types, AZELLA holistic score rubrics require to consider both the content and the manner of speaking used in the response.
7
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods We used different features (content and manner) derived from speech to predict the final human holistic score. ASR (Automatic Speech Recognition) Acoustic models Language models Content modeling Duration modeling Spectral modeling Confidence modeling Final models
8
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Content Modeling Content indicates how well the test-taker understood the prompt and could respond with appropriate linguistic content. has_keywords: the occurrence of correct sequence of syllables or words. word_errors: the minimum number of substitutions, deletions, and/or insertions required to find a best string match in the response to the answer choices. word_vector: scaling the weighted sum of the occurrence of a large set of expected words and word sequences that may be recognized in the spoken response. Weights are assigned to the expected words and word sequences according to their relation to the good responses using LSA. It was done automatically.
9
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Duration modeling It can catch if test-takers produce the correct duration for different phonemes. The duration statistics models were built from native data from an unrelated test called the Versant Junior English Test. The statistics of the phoneme durations of native responses were stored as non-parametric cumulative density functions (CDFs). Duration statistics from native speakers were used to compute the log likelihood for durations of phonemes produced by candidates. If enough samples for a phoneme in a specific word existed, we built a unique duration model for this phoneme in context. All phones vs. pause
10
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Spectral modeling To consider manner scoring more than duration, we computed few spectral likelihood features according to native and learner segment models applied to the recognition alignment of segmental units. We did force alignment of the utterance on the word string from the recognized sentence using the native mono acoustic model. For every phoneme, using the previous time boundary constrain from the native mono acoustic model, we did an allphone recognition using the native mono acoustic model again. Different features by using different interested phonemes. ppm: the percentage of phonemes from the allphone recognition matching to the phonemes from the force alignment.
11
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Confidence modeling After finishing speech recognition, we can assign speech confidence scores to words and phonemes. Then for every response, we may compute the average confidence, the percentage of words or phonemes whose confidences are lower than a threshold value as features to predict test-takers' performance.
12
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Final models Features word_vector, has_keywords, word_errors, percent_correct can effectively define content scores based on what is spoken. Features log_seg_prob, iw_log_seg_prob, spectral_1, spectral_2 can effectively define both the rhythmic and segmental aspects of the performance to be native likelihoods of producing the observed base physical measures. By combining these features together, we can predict effectively human's holistic scores. PKT tried both simple multiple linear regression models and neural network models and selected the best models. In most of the cases, the neural network models had better performances.
13
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results All results presented here are validation results used the validation sets. The models built knew nothing about the validation sets. Distribution of average human holistic score of participants in the validation set for Stage 5 (Grade 9-12)
14
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage I
15
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage II, III, IV, V (item level)
16
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage II, III, IV, V (participant level)
17
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results:
18
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Test reliability by stage
19
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Unscorable test detection There are several outliers that the machine scores were significant lower than human scores. The main reason is basically low Signal- to-Noise Ratio (SNR), either the background noise was so high, or speech voice was low (low volume recordings made by shy kids). For those cases, it is hard for ASR. The solution could be filtering these calls out and pass them to human grading. We identified features to deal with low-volume tests: maximum energy, the number of frames with fundamental frequency, etc., plus many features mentioned in Cheng and Shen (2011) to build a unscorable test detector. More details in the poster this afternoon: Angeliki Metallinou, Jian Cheng, "Syllable and language model based features for detecting non-scorable tests in spoken language proficiency assessment applications”
20
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Future work We may train a better native acoustic model that uses more native data from AZELLA project after we got the demographic information for test-takers. We may catch soft or noise calls automatically to exclude them from machine grading. For repeat, we used simple average as the final scores. We may use a partial credit Rasch model to improve the performance. The current items in forms didn't go through a post screening process, if we only select the items that have the best prediction power to the test forms, the correlations could be improved. Some kids speak significantly soft. This problem should be fixed. Apply deep neural network (DNN) acoustic models instead of traditional GMM-HMMs to achieve better performance.
21
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Angeliki Metallinou, Jian Cheng, "Using Deep Neural Networks to Improve Proficiency Assessment for Children English Language Learners”, to appear in Interspeech, September 2014, Singapore. Target on AZELLA Stage II data: Experimental results show that the DNN-based recognition approach achieved 31% relative WER reduction when compared to GMM-HMMs. Averaged item-type level correlation increased from 0.772 (the result in this paper) to 0.795 (new GMM-HMMs) to 0.826 (DNN- HMMs), which is a 0.054 absolute improvement.
22
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Post Validation Studies After this study, we went through several post validation studies, our customer (Arizona Department of Education) is happy about these results.
23
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Conclusions We considered both what the student says and the way in which the student speaks to generate the final holistic scores. We provided validity evidence for machine-generated scores. The average human-machine correlation 0.92. The assessments include 10 open-ended item types. For 9 of the 10 open item types, machine scoring performed at a similar level human scoring at the item-type level. We described the design, implementation and evaluation of a detector to catch problematic, unscorable tests. Automatic assessment of the speech of young English learners works. It works well.
24
Copyright 2014 Pearson Education, Inc. or its affiliate(s). All rights reserved.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.