Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng,

Slides:

Advertisements

Similar presentations

WMS-IV Wechsler Memory Scale - Fourth Edition

Advertisements

Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?

Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing.

A Tale of Two Tests STANAG and CEFR Comparing the Results of side-by-side testing of reading proficiency BILC Conference May 2010 Istanbul, Turkey Dr.

Language Assessment System (LAS) Links TM Census Test.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.

Edition Version 1-11 Presented by Language Acquisition Branch.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Introduction to Automatic Speech Recognition

English Language Development Assessment (ELDA) Background to ELDA for Test Coordinator and Administrator Training Mike Fast, AIR CCSSO/LEP-SCASS March.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

NYC Schools Task Alignment Project The Common Core State Standards Aligning and Rating Math Tasks April 28, 2011.

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Professional Development by Johns Hopkins School of Education, Center for Technology in Education Supporting Individual Children Administering the Kindergarten.

Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.

Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.

Classroom Assessment A Practical Guide for Educators by Craig A

Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Multimedia English Learning System Using HMMs to Improve Phonemic Awareness for English Learning Yen-Shou Lai, Hung-Hsu Tsai and Pao-Ta Yu Chun-Yu Chen.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Arizona English Language Learner Assessment AZELLA

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

The Four P’s of an Effective Writing Tool: Personalized Practice with Proven Progress April 30, 2014.

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.

 Field Experience Evaluations PSU Special Educator Programs Confidence... thrives on honesty, on honor, on the sacredness of obligations, on faithful.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent SPACE Symposium - 05/02/091 Objective intelligibility assessment of pathological speakers Catherine Middag,

THREE DIMENSIONS OF A HIGH-QUALITY RUBRIC Created by Shauna Denson “Classroom Assessment for Student Learning”(J. Chappuis)

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

The Arizona English Language Learner Assessment (AZELLA)

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Evaluation Results MRI’s Evaluation Activities: Surveys Teacher Beliefs and Practices (pre/post) Annual Participant Questionnaire Data Collection.

LEAP TH GRADE. DATES: APRIL 25-29, 2016 Test Administration Schedule:  Day 1 April 25- ELA Session 1: Research Simulation Task (90mins) Mathematics.

Objectives of session By the end of today’s session you should be able to: Define and explain pragmatics and prosody Draw links between teaching strategies.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

Welcome Parents! FCAT Information Session. O Next Generation Sunshine State Standards O Released Test Items O Sample Test.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

AAPPL Assessment Follow Up June What is AAPPL Measure? The ACTFL Assessment of Performance toward Proficiency in Languages (AAPPL) is a performance-

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

The Arizona English Language Learner Assessment (AZELLA)

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Data Conventions and Analysis: Focus on the CAEP Self-Study

Olivier Siohan David Rybach

Evaluating Student-Teachers Using Student Outcomes

INTRODUCTION TO THE ELPAC

Online Multiscale Dynamic Topic Models

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Erasmus University Rotterdam

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Confidential - For internal NYSED Use Only - Not for Distribution

Automatic Fluency Assessment

Mapping it Out! Practical Tools to Use Assessment Well

Office of Education Improvement and Innovation

Your introduction to this year’s English exam.

Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov

The Arizona English Language Learner Assessment (AZELLA)

AP U.S. History Exam Details

Presentation transcript:

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng, Yuan Zhao D’Antilio, Xin Chen, Jared Bernstein Knowledge Technologies, Pearson Menlo Park, California, USA BEA-2014

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Overview Introduction Item type analysis Data Human transcriptions and scoring Machine scoring methods Experimental results Unscorable test detection Future work Conclusions

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Introduction Arizona English Language Learner Assessment (AZELLA) is an English Learners (ELs) test administrated in the state of Arizona for K-12 students by Arizona Department of Education (ADE). Five stages: K., Elementary, Primary, Middle and High School. AZELLA is a four skills test. This research focuses on speaking part to generate scores automatically. The first field test (stage 2-5) took place around Nov Pearson Knowledge Technologies (PKT) delivered over 31K tests. The second field test (stage 1) took place around April PKT delivered over 13K tests. The first operational AZELLA test with automatic speech scoring took place between January and February, 2013, with approximately 140K tests delivered. After that, annually PKT are supposed to deliver around 180k tests.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Item type analysis Constrained item types: Naming Read syllables for one word Read three words sequence Repeat Fairly unconstrained item types: Questions about image Give directions from map Ask questions about a thing Open questions about a topic Give instructions to do something Similarities & differences Ask questions about a statement Detailed response to a topic

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Data From the data in the first field test (Stages 2-5), for each AZELLA stage, we randomly sampled 300 tests (75 tests/form x 4 forms) as a validation set and 1,200 tests as a development set. For the data in the second field test (Stage 1), we randomly sampled 167 tests from the four forms as the validation set and 1,200 tests as the development set. No validation data was used for model training.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Human transcriptions and scoring In the development sets, we needed from 100 to 300 responses per item to be transcribed, depending on the complexity of the item type. All responses from the tests were scored by trained professional raters according to predefined ADE rubrics. Every response has one trait: human holistic score. We used the average score from different raters as the final score during machine learning. The responses in each validation set were double rated (producing two final scores) for use in validation. For the responses of open-ended item types, AZELLA holistic score rubrics require to consider both the content and the manner of speaking used in the response.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods We used different features (content and manner) derived from speech to predict the final human holistic score. ASR (Automatic Speech Recognition) Acoustic models Language models Content modeling Duration modeling Spectral modeling Confidence modeling Final models

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Content Modeling Content indicates how well the test-taker understood the prompt and could respond with appropriate linguistic content. has_keywords: the occurrence of correct sequence of syllables or words. word_errors: the minimum number of substitutions, deletions, and/or insertions required to find a best string match in the response to the answer choices. word_vector: scaling the weighted sum of the occurrence of a large set of expected words and word sequences that may be recognized in the spoken response. Weights are assigned to the expected words and word sequences according to their relation to the good responses using LSA. It was done automatically.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Duration modeling It can catch if test-takers produce the correct duration for different phonemes. The duration statistics models were built from native data from an unrelated test called the Versant Junior English Test. The statistics of the phoneme durations of native responses were stored as non-parametric cumulative density functions (CDFs). Duration statistics from native speakers were used to compute the log likelihood for durations of phonemes produced by candidates. If enough samples for a phoneme in a specific word existed, we built a unique duration model for this phoneme in context. All phones vs. pause

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Spectral modeling To consider manner scoring more than duration, we computed few spectral likelihood features according to native and learner segment models applied to the recognition alignment of segmental units. We did force alignment of the utterance on the word string from the recognized sentence using the native mono acoustic model. For every phoneme, using the previous time boundary constrain from the native mono acoustic model, we did an allphone recognition using the native mono acoustic model again. Different features by using different interested phonemes. ppm: the percentage of phonemes from the allphone recognition matching to the phonemes from the force alignment.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Confidence modeling After finishing speech recognition, we can assign speech confidence scores to words and phonemes. Then for every response, we may compute the average confidence, the percentage of words or phonemes whose confidences are lower than a threshold value as features to predict test-takers' performance.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Machine scoring methods- Final models Features word_vector, has_keywords, word_errors, percent_correct can effectively define content scores based on what is spoken. Features log_seg_prob, iw_log_seg_prob, spectral_1, spectral_2 can effectively define both the rhythmic and segmental aspects of the performance to be native likelihoods of producing the observed base physical measures. By combining these features together, we can predict effectively human's holistic scores. PKT tried both simple multiple linear regression models and neural network models and selected the best models. In most of the cases, the neural network models had better performances.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results All results presented here are validation results used the validation sets. The models built knew nothing about the validation sets. Distribution of average human holistic score of participants in the validation set for Stage 5 (Grade 9-12)

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage I

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage II, III, IV, V (item level)

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Stage II, III, IV, V (participant level)

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results:

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Experimental results – Test reliability by stage

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Unscorable test detection There are several outliers that the machine scores were significant lower than human scores. The main reason is basically low Signal- to-Noise Ratio (SNR), either the background noise was so high, or speech voice was low (low volume recordings made by shy kids). For those cases, it is hard for ASR. The solution could be filtering these calls out and pass them to human grading. We identified features to deal with low-volume tests: maximum energy, the number of frames with fundamental frequency, etc., plus many features mentioned in Cheng and Shen (2011) to build a unscorable test detector. More details in the poster this afternoon: Angeliki Metallinou, Jian Cheng, "Syllable and language model based features for detecting non-scorable tests in spoken language proficiency assessment applications”

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Future work We may train a better native acoustic model that uses more native data from AZELLA project after we got the demographic information for test-takers. We may catch soft or noise calls automatically to exclude them from machine grading. For repeat, we used simple average as the final scores. We may use a partial credit Rasch model to improve the performance. The current items in forms didn't go through a post screening process, if we only select the items that have the best prediction power to the test forms, the correlations could be improved. Some kids speak significantly soft. This problem should be fixed. Apply deep neural network (DNN) acoustic models instead of traditional GMM-HMMs to achieve better performance.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Angeliki Metallinou, Jian Cheng, "Using Deep Neural Networks to Improve Proficiency Assessment for Children English Language Learners”, to appear in Interspeech, September 2014, Singapore. Target on AZELLA Stage II data: Experimental results show that the DNN-based recognition approach achieved 31% relative WER reduction when compared to GMM-HMMs. Averaged item-type level correlation increased from (the result in this paper) to (new GMM-HMMs) to (DNN- HMMs), which is a absolute improvement.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Post Validation Studies After this study, we went through several post validation studies, our customer (Arizona Department of Education) is happy about these results.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Conclusions We considered both what the student says and the way in which the student speaks to generate the final holistic scores. We provided validity evidence for machine-generated scores. The average human-machine correlation The assessments include 10 open-ended item types. For 9 of the 10 open item types, machine scoring performed at a similar level human scoring at the item-type level. We described the design, implementation and evaluation of a detector to catch problematic, unscorable tests. Automatic assessment of the speech of young English learners works. It works well.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved.