Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
Catia Cucchiarini Quantitative assessment of second language learners’ fluency in read and spontaneous speech Radboud University Nijmegen.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Temporal Compression Of Speech: An Evaluation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Simon Tucker and Steve.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Chapter Thirteen Validation & Editing Coding Machine Cleaning of Data Tabulation & Statistical Analysis Data Entry Overview of the Data Analysis.
A Multimedia English Learning System Using HMMs to Improve Phonemic Awareness for English Learning Yen-Shou Lai, Hung-Hsu Tsai and Pao-Ta Yu Chun-Yu Chen.
Acoustic Properties of Taiwanese High School Students ’ Stress in English Intonation Advisor: Dr. Raung-Fu Chung Student: Hong-Yao Chen.
Arizona English Language Learner Assessment AZELLA
Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng,
Using Technology to Teach Pronunciation A review of the research from Melike Yücel Eleonora Frigo Laurie Wayne Ling 578, Winter 2010, Dr. Arnold.
English stress teaching and learning in Taiwan 林郁瑩 MA0C0104.
English Phonetics 许德华 许德华. Objectives of the Course This course is intended to help the students to improve their English pronunciation, including such.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
ESL PLC Meetings October 14 & 15, 2013 What’s Different About Teaching Reading to Students Learning English? Chapter 5: Fluency.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent SPACE Symposium - 05/02/091 Objective intelligibility assessment of pathological speakers Catherine Middag,
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Outline  I. Introduction  II. Reading fluency components  III. Experimental study  1) Method and participants  2) Testing materials  IV. Interpretation.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Script Writing Techniques. Write like you speak In most cases, writing for the ear is more informal than writing to be read. You may find that it improves.
Ha ʊ t ɘ ti: ʧ pr ɘˌ nnsı ʹ j eı ʃɘ n Why it seems to be so hard to teach And so hard to learn.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
 The DPI provides a written translation accommodation for the paper/pencil WKCE for Science and Social Studies in grades 4, 8, and 10.  Wordlists and.
Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Teaching Listening Why teach listening?
ASR-based corrective feedback on pronunciation: does it really work?
Mr. Darko Pekar, Speech Morphing Inc.
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Fluency Assessment
Detecting Prosody Improvement in Oral Rereading
THE NATURE OF SPEAKING Joko Nurkamto UNS Solo.
Supporting ELL Students in Math, Social Studies, and Science
Audio Books for Phonetics Research
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
National Conference on Student Assessment
A Study of the Decision-making Behavior of Markers in E-C Sentence Translation Assessment Wen Hui Nie Jianzhong.
Statistical Models for Automatic Speech Recognition
Clinical Assessment and Diagnosis
Automatic Speech Recognition: Conditional Random Fields for ASR
Perceptions on L2 fluency-perspectives of untrained raters
The Nature of Learner Language (Chapter 2 Rod Ellis, 1997) Page 15
Towards Automatic Fluency Assessment
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Educational Testing Service (ETS)
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang Investigation of the Effects of Automatic Scoring Technology on Human Raters' Performances in L2 Speech Proficiency Assessment Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Background English speaking tests have become mandatory in college and senior high school entrance examinations in many cities in China Most of them are assessed manually Cost a lot time and efforts Difficult to recruit enough qualified experts Recent advances in automatic scoring based on ASR Used in high-stakes English tests (J. Cheng, 2011) Comparable performances with human raters Many educators remain skeptical about the technology

Objectives of this research Try to find out the answers to these research questions: 1) how different are non-expert teachers' performances compared to experts? 2) Will showing them the ‘facts’ of different aspects of pronunciation proficency based on acoustic features and experts’ judgement changes their minds? 3) How can we better utilize automatic scorings technology to assist human raters instead of replacing them?

Experiments Examined how experts and non-experts perform in assessing real speaking tests Extracted acoustic features and conducted automatic scoring on the same data Presented to the non-expert teachers the results of multi-dimensional automatic scores on different aspects of pronunciation fluency when assessing an utterance, and examined how that might change their judgments.

Speech data The recording data of the English speaking test in Shenzhen High School Unified Examination repeating of a one-minute-long video clip Watch and listen to a video clip with English subtitles twice Read aloud the subtitles on the video 300 utterances 50 from each of the 6 proficiency level groups Develop set : 150; Test set 150

Proficiency Level Groups of the Test-takers Scoring Standards 5 Fluent and native-like in pronunciation and intonation without any mistakes 4 Fluent and intelligible with minor unnaturalness in pronunciation or intonation. Very few linguistic or phonetic mistakes 3 Have some errors in pronunciation or unnaturalness in intonation, but most part of the speech is intelligible. 2 Large amount of pronunciation errors and unnatural intonation, but parts of the speech are still intelligible 1 Severe errors in pronunciation and most part of the speech is unintelligible Completely unintelligible, silence or speaking something unrelated to the presented subtitle text

Human Assessment Participants Results 2 phonetically trained experts 14 non-expert high school English teachers 10 college students majored in English education Results The correlation between the two experts is 0.821. The 24 non-experts were clustered into 4 groups according to similarity of the scores among raters   Group A Group B Group C Group D Expert A 0.801 0.775 0.743 0.734 Expert B 0.810 0.769 0.751 0.725

Expert Annotation Perceptual dimensions annotated by an experienced expert include: 1) Intelligibility: understanding of what has been said (0: very poor,5:excellent) 2) Fluency: indicate the level of interruptions, hesitations, filled pauses (0: very poor, 5:excellent) 3) Correctness: indicate if all the phonemes have been correctly pronounced (0: very poor, 5: excellent) 4) Intonation: indicate to which extent the pitch and stress patterns clearly resembles the ones in English (0: unnatural , 5: natural) 5) Rhythm: indicate to which extent the timing resembles the one in English ( 0: unnatural, 5:natural) 60 utterances (10 from each proficiency level group) from the development data were annotated

Acoustic Models Data from Wall Street Journal CSR Corpus and TIMIT were used to train CD-DNN-HMM and CD-GMM-HMM The DNN training in this study follow the procedure described in (G.E. Dahl. et al, 2012) using KAIDI. Similar word error rate reduction has been achieved on test set of WSJ corpus as reported in (W. Hu, et al, 2013)

GOP(Goodness of Pronunciation) Scores The GOP score is defined as follows W. Hu, et al proposed a better implementation of GOP by calculating the average frame posteriors of a phone with the output of DNN model: Where is the posterior probability that the speaker uttered phoneme p given speech observation , Q is the full set of phonemes Where is an output of DNN. are the start and end frame of phone P

Other feature scores Word and Phone Correctness Pitch and Energy Features The Euclidean distances of F0 and energy contours between students’ speech and correct models Timing Features rate of speech (ROS) phoneme duration pauses Unsupervised Clustering Starting from each frame of the acoustic features, any adjacent feature frames that are similar to each other will be clustered as a group. If an utterance is distinctly pronounced, there will be more clusters in a given sentence than those that are not clearly pronounced.

Correlations between Feature Scores and the Average of Experts’ Scores Average GOP 0.79 Word_Acc 0.74 Phone_Acc 0.60 Pitch distance 0.51 Energy distance 0.55 Clustering 0.58 ROS 0.39 Phoneme duration 0.42 Pause duration 0.57 Linear Regression 0.80

Human-machine Hybrid Scoring Examine whether non-experts’ performance would change by presenting multidimensional automatic scores during assessment. Radar Chart Analysis Use a Gnuplot script to generate a 10-point radar chart for each utterance of all the development and test data

Scoring Procedure Training Assessment Can view radar chart plots of any utterances from development data set. The reference score is presented Can listen to the utterance to check pronunciation. Participants can view different shapes of radar charts from the same proficiency group or compare radar charts from different proficient level groups Assessment The radar charts of the utterances from test set are randomly presented together with a link to the corresponding utterance file. Raters are instructed to first look at the chart and then click on the link to check the audio before making the final decision. They are required to give an overall fluency score for the utterance.

Results Correlations between non-experts and experts’ scores in human-machine hybrid scoring Rates of agreement with experts in human rating and human- machine hybrid rating   Group A Group B Group C Group D Expert A 0.811 0.805 0.810 0.802 Expert B 0.821 0.814 0.820 0.817   Group A Group B Group C Group D Human only 80.5% 73.5% 72.2% 71.3% Hybrid rating 87.0% 85.4% 87.5% 86.4%

Conclusion Investigated how non-expert and expert human raters perform in the assessment of speaking test Found inconsistencies in non-experts' ratings compared with the experts Proposed a radar chart based multi-dimensional automatic scoring to assist non-expert human raters Experimental results show that presenting the automatic analysis of different fluency aspects can affects human raters' judgement The proposed human-machine hybrid scoring system can help human raters give more consistent and reliable assessment

Thank you for your kind attention!

… Clustering Speech Frame sequence Log-spectrum `` Log-spectrum Spectral envelope Acoustic analysis (MCEP) Clustering Stopping condition Output of phoneme segments

Time-constrained: only 2 adjacent clusters can be merged … … … …

× × × ( ) (q) (p) Ward’s method: Based on Euclid distances ( ) :within-group error sum of squares Based on Euclid distances (q) (p) × × × •