ASR-based corrective feedback on pronunciation: does it really work?

Slides:



Advertisements
Similar presentations
Project VIABLE: Behavioral Specificity and Wording Impact on DBR Accuracy Teresa J. LeBel 1, Amy M. Briesch 1, Stephen P. Kilgus 1, T. Chris Riley-Tillman.
Advertisements

KeTra.
DISCO Development and Integration of Speech technology into Courseware for language learning Stevin project partners: CLST, UA, UTN, Polderland Radboud.
Uitspraakevaluatie & training met behulp van spraaktechnologie Pronunciation assessment & training by means of speech technology Helmer Strik – and many.
Catia Cucchiarini Quantitative assessment of second language learners’ fluency in read and spontaneous speech Radboud University Nijmegen.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Chi-square Test of Independence
Chapter 7 Correlational Research Gay, Mills, and Airasian
Improving Spoken English NativeAccent™. What is NativeAccent? New internet-delivered technology that assesses a student’s English pronunciation skills.
Chapter 14: Usability testing and field studies
Extension to ANOVA From t to F. Review Comparisons of samples involving t-tests are restricted to the two-sample domain Comparisons of samples involving.
Language Issues in English-medium Universities: A Global Concern1 Using Mobile Phones in Pronunciation Teaching in English-medium Universities in Turkey.
Statistics for Education Research Lecture 5 Tests on Two Means: Two Independent Samples Independent-Sample t Tests Instructor: Dr. Tung-hsien He
APLNG 410 Megan Stump.  Activity ◦ Intended Use ◦ Directions ◦ Modifications ◦ Critique  Personal Philosophy ◦ General and Pronunciation ◦ Questions.
Student Use of CALL Software and its Effect on Learners Alan Bessette Poole Gakuin University GloCALL
The relationship between objective properties of speech and perceived pronunciation quality in read and spontaneous speech was examined. Read and spontaneous.
Acoustic Properties of Taiwanese High School Students ’ Stress in English Intonation Advisor: Dr. Raung-Fu Chung Student: Hong-Yao Chen.
Quantitative SOTL Research Methods Krista Trinder, College of Medicine Brad Wuetherick, GMCTE October 28, 2010.
Examination of Public Perceptions of Four Types of Child Sexual Abuse Prevention Programs Brandon Kopp Raymond Miltenberger.
An Analysis of Three States Alignment Between Language Arts and Math Standards and Alternate Assessments Claudia Flowers Diane Browder* Lynn Ahlgrim-Delzell.
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
Katherine Morrow, Sarah Williams, and Chang Liu Department of Communication Sciences and Disorders The University of Texas at Austin, Austin, TX
T tests comparing two means t tests comparing two means.
1 A Study On The Effects Of Phonics Instruction On the Decoding And Encoding Performances Of Junior High School EFL Students In Taiwan Researcher: Pei-chen.
D OES PLAYING SPORTS AFFECT SCHOOL MARKS ? Arun Jha and Sagar Badve Year 10 Perth Modern School.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
Identifying Factors’ Effects on Degree of Perceived Accent in the L2 Finnish Uzal, M. (2013) University of Helsinki Faculty of.
Day 8 Usability testing.
Effects of Word Concreteness and Spacing on EFL Vocabulary Acquisition 吴翼飞 (南京工业大学,外国语言文学学院,江苏 南京211816) Introduction Vocabulary acquisition is of great.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Research and Evaluation
Lt Col Dieter A. Waldvogel, PhD
Lt Col Dieter A. Waldvogel, PhD
Better to Give or to Receive?: The Role of Dispositional Gratitude
Logic of Hypothesis Testing
Is High Placebo Response Really a Problem in Clinical Trials?
Cheryl Ng Ling Hui Hee Jee Mei, Ph.D Universiti Teknologi Malaysia
TEXILA AMERICAN UNIVERSITY
Improving Student Engagement Through Audience Response Systems
2) Methodology Research instrument:
Effects of Previous Performance Beliefs on Perceptual Responses and Performance in 16.1 km Cycling Time Trials Jones H.S.1, Williams E.L., Marchant D.,
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Introduction of IELTS Test
The sociocultural perspective to researching CAPT technology in the L2 speaking class for pronunciation training Moustafa Amrate Supervised by: Dr Irena.
Effects of Targeted Troubleshooting Activities on
Between-Subjects, within-subjects, and factorial Experimental Designs
Catherine Storey BCBA, Dr. Claire McDowell & Prof. Julian Leslie
Test Validity.
Bowden, Shores, & Mathias (2006): Failure to Replicate or Just Failure to Notice. Does Effort Still Account for More Variance in Neuropsychological Test.
Understanding Results
Meredith A. Henry, M.S. Department of Psychology
Research methods Lesson 2.
Title: Validating a theoretical framework for describing computer programming processes 29 November 2017.
LAQ: Evaluating a study
2 independent Groups Graziano & Raulin (1997).
Office of Education Improvement and Innovation
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Social Practical Charlie.
The effectiveness of virtual conversations in language learning among
Second Language Vocabulary Acquisition
Inferential Statistics
Learning Outcome 3 – Assignment
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Contextualised MALL Comparison of Chinese Learning Activities Based on Country (Target/Non-Target) and Language Learning Orientation (Generic/Dedicated)
Levine et al continued.
Exercises Causal Comparative Research
Presentation transcript:

ASR-based corrective feedback on pronunciation: does it really work? Ambra Neri, Catia Cucchiarini, Helmer Strik Centre for Speech and Language Technology Dept. of Linguistics - Radboud University Nijmegen The Netherlands

Outline Background & problem Goal of present study Experiment Conclusions

Background and problem Computer Assisted Pronunciation Training (CAPT) ASR-based CAPT: can provide automatic, instantaneous, individual feedback on pronunciation in a private environment But ASR-based CAPT suffers from limitations. Is it effective in improving L2 pronunciation? Very few studies with different results. The problem is all the more acute in the case of ASR-based CAPT because of the limitations of such systems: if occasional errors are possible, are these systems still effective, or can these errors disrupt the learning process?

Goal of this study To study the effectiveness and possible advantage of automatic feedback provided by an ASR-based CAPT system. The problem is all the more acute in the case of ASR-based CAPT because of the limitations of such systems: if occasional errors are possible, are these systems still effective, or can these errors disrupt the learning process?

ASR-based CAPT system: Dutch CAPT Target users adult learners of Dutch with different L1's (e.g. immigrants) L1’s Pedagogical goal improving segmental quality in pronunciation

Dutch CAPT: feedback Content: focus on problematic phonemes, 11 ‘targeted phonemes’ : 9 vowels and 2 consonants Criteria Error detection algorithm: based on GOP method (Witt & Young 2000)

Video

Dutch CAPT Gender-specific, Dutch & English version. 4 units, each containing: 1 video (from Nieuwe Buren) with real-life + amusing situations + ca. 30 exercises based on video: dialogues, question-answer, minimal pairs, word repetition Sequential, constrained navigation: min. one attempt needed to proceed to next exercise, maximum 3

Method: participants & training Regular teacher-fronted lessons: 4-6 hrs per week Experimental group (EXP): n=15 (10 F, 5 M) Dutch CAPT Control group 1 (NiBu): n=10 (4 F, 6 M) reduced version of Nieuwe Buren Control group 2 (noXT): n=5 (3 F, 2 M) no extra training Extra training: 4 weeks x 1 session 30’ – 60’ 1 class – 1 type of training Subjects were all highly educated, having completed high school or university, some held a PhD title. aged 20 to 50, balanced for gender, highly educated, with different L1s, all following Dutch course as beginners

Method: testing 3 analyses: Participants’ evaluations: questionnaires on system’s usability, accessibility, usefulness etc. Global segmental quality: 6 experts rated stimuli on 10-point scale (pretest/posttest, phonetically balanced sentences) In-depth analysis of segmental errors: expert annotations Students were asked to complete anonymous questionnaires by indicating whether or not they agreed with statements on a 5-point Likert scale and by answering two open-ended questions. The questions concerned the accessibility of the exercises, the usability of the interface in general, the students’ feelings about the usefulness of the specific CAPT for improving pronunciation, and their opinion about specific features of the system used.   Analysis of global segmental quality The subjects were tested before (pre-test) and after the training (post-test). To ensure that the rating process would not be influenced by lexical or morphosyntactical errors, read speech containing all Dutch phonemes was used (phonetically rich sentences). Six expert raters evaluated the speech independently on a 10-point scale, where 1 indicated very poor and 10 very good segmental quality. They were instructed to focus on segmental quality only, and to ignore aspects such as word stress, sentence accent, and speech rate, since these aspects were not the focus of the training. To help them anchor their ratings (Cucchiarini et al. 2000), the raters were provided with examples of native spoken utterances and non-native spoken utterances of ‘poor’ segmental quality of the experiment stimuli. Pre- and post-test recordings were presented in random order. sentences presented on computer screen via adapted version of Dutch-CAPT In-depth analysis of segmental quality To obtain more fine-grained information on the effectiveness of the computer-generated feedback, a detailed analysis was carried out of the specific errors made by each participant. For this investigation, auditory analyses were carried out of the participants’ recordings, and annotations were made of specific segmental errors.

Results: participants’ evaluations Positive reactions Enjoyed working with the system Believed in the usefulness of the system Students were asked to complete anonymous questionnaires by indicating whether or not they agreed with statements on a 5-point Likert scale and by answering two open-ended questions. The questions concerned the accessibility of the exercises, the usability of the interface in general, the students’ feelings about the usefulness of the specific CAPT for improving pronunciation, and their opinion about specific features of the system used.   Analysis of global segmental quality The subjects were tested before (pre-test) and after the training (post-test). To ensure that the rating process would not be influenced by lexical or morphosyntactical errors, read speech containing all Dutch phonemes was used (phonetically rich sentences). Six expert raters evaluated the speech independently on a 10-point scale, where 1 indicated very poor and 10 very good segmental quality. They were instructed to focus on segmental quality only, and to ignore aspects such as word stress, sentence accent, and speech rate, since these aspects were not the focus of the training. To help them anchor their ratings (Cucchiarini et al. 2000), the raters were provided with examples of native spoken utterances and non-native spoken utterances of ‘poor’ segmental quality of the experiment stimuli. Pre- and post-test recordings were presented in random order. sentences presented on computer screen via adapted version of Dutch-CAPT In-depth analysis of segmental quality To obtain more fine-grained information on the effectiveness of the computer-generated feedback, a detailed analysis was carried out of the specific errors made by each participant. For this investigation, auditory analyses were carried out of the participants’ recordings, and annotations were made of specific segmental errors.

Results: reliability global ratings Cronbach’s : Intrarater: .94 – 1.00 Interrater: .83 - .96 Students were asked to complete anonymous questionnaires by indicating whether or not they agreed with statements on a 5-point Likert scale and by answering two open-ended questions. The questions concerned the accessibility of the exercises, the usability of the interface in general, the students’ feelings about the usefulness of the specific CAPT for improving pronunciation, and their opinion about specific features of the system used.   Analysis of global segmental quality The subjects were tested before (pre-test) and after the training (post-test). To ensure that the rating process would not be influenced by lexical or morphosyntactical errors, read speech containing all Dutch phonemes was used (phonetically rich sentences). Six expert raters evaluated the speech independently on a 10-point scale, where 1 indicated very poor and 10 very good segmental quality. They were instructed to focus on segmental quality only, and to ignore aspects such as word stress, sentence accent, and speech rate, since these aspects were not the focus of the training. To help them anchor their ratings (Cucchiarini et al. 2000), the raters were provided with examples of native spoken utterances and non-native spoken utterances of ‘poor’ segmental quality of the experiment stimuli. Pre- and post-test recordings were presented in random order. sentences presented on computer screen via adapted version of Dutch-CAPT In-depth analysis of segmental quality To obtain more fine-grained information on the effectiveness of the computer-generated feedback, a detailed analysis was carried out of the specific errors made by each participant. For this investigation, auditory analyses were carried out of the participants’ recordings, and annotations were made of specific segmental errors.

Results: Global ratings We then checked whether some non-natives had received scores in the range of the natives already at pre-test. The natives were found to receive scores between 9 and 10, while the non-native scores never fell outside the range 1-8, with a maximum of 7.6 at pre-test. Secondly, given the impossibility of matching the groups before the training, we examined their pre-test scores to see whether these differed significantly already prior to the training. An ANOVA with post-hoc comparisons indicated that the group receiving no CAPT at all (noXT) had significantly higher scores than the group training with the ASR-based CAPT system (EXP). We then examined the average improvement made by the groups after training, finding that overall segmental accuracy improved for all groups at post-test (see Figure 2). Subsequently, an ANOVA with repeated measures was conducted indicating a significant effect for Test time, with F(1, 27) = 18.806, p <.05 with the post-test scores reflecting significantly greater segmental accuracy (M=5.19, SD=1.53) than the pre-test scores (M=4.42, SD=1.54). The Test time x Training group interaction was not significant, indicating that there were no significant differences in the mean improvements of the training groups. To summarize, all three groups improved overall segmental quality after the training, with Exp showing the largest improvements, followed by NiBu. However, the difference in improvements in the three groups is statistically nonsignificant. All 3 groups improve (mean improvement) EXP improved most

In-depth analysis segm. quality Since this faster improvement could have resulted from the fact that EXP was initially making more errors and was therefore likely to make larger improvements than NiBu (Hincks, 2003), we also examined the errors made by both groups for the phonemes that were not targeted by Dutch-CAPT. This time a different trend appeared (see Figure 3): While both groups produced fewer errors at post-test, the decreases in untargeted errors are much smaller and more similar across the two groups (0.7% for EXP and 1.1% for NiBu) than those for the target errors. An ANOVA with repeated measures, with Training group as between-subjects factor and Test time as within-subjects factor, revealed no significant Training-Test interaction, indicating that the two groups made comparable mean improvements on the untargeted phonemes. A main effect was found for Test, F(1, 23) = 10.806, p <.05, with significantly fewer errors at post-test errors (M=3.7%, SD=.021) than at pre-test (M=4.5%, SD=.021). No significant effect was found for Training group, confirming that, overall, the two groups made comparable proportions of untargeted errors. The mean percentages of errors on untargeted phonemes (relative to all untargeted phonemes in the stimuli) for EXP and NiBu were, respectively, 4.7% (SD=.022) and 3.4% (SD=.019). These results show that a) more errors were produced for the targeted phonemes, which confirms that these phonemes are particularly problematic and segmental training should focus on these sounds, b) the group receiving feedback on these errors made a significantly larger improvement on these phonemes, whereas no statistically significant difference was found for the phonemes for which no feedback was provided, suggesting that the feedback provided in Dutch-CAPT was effective in improving the quality of the targeted phonemes.

Conclusions Participants enjoyed Dutch CAPT. ASR-CAPT seems efficacious in improving pronunciation of targeted phonemes. Global ratings are appropriate measure because CAPT should ultimately improve overall pronunciation quality. Fine-grained analyses also useful. . These results show that a) more errors were produced for the targeted phonemes, which confirms that these phonemes are particularly problematic and segmental training should focus on these sounds, b) the group receiving feedback on these errors made a significantly larger improvement on these phonemes, whereas no statistically significant difference was found for the phonemes for which no feedback was provided, suggesting that the feedback provided in Dutch-CAPT was effective in improving the quality of the targeted phonemes. These additional analyses have evidenced specific effects that did not appear in the analysis of overall segmental quality. To understand the reasons for this discrepancy, we examined the relationship between the human ratings of global segmental quality for each participant in the groups receiving CAPT and the percentages of different errors produced by these participants at pre- and post-test. We found a strong, negative correlation between the raters’ scores and the percentage of total errors per participant, r=-.877, p<.01. Thus the raters did indeed assess global segmental quality, i.e. segmental quality of all phonemes in the stimuli, as requested. A significant, negative correlation was also found between the scores and the percentage of untargeted errors: r=-.863, p<.01. A significant though weaker correlation was found with the targeted errors: r=-.645, p<.01. These results indicate that both types of errors contributed to determining the score; however the targeted errors had less impact on it, which is not surprising if we consider that the targeted phonemes are less frequent (18.5%) than the untargeted phonemes (81.5%).

The end Questions? These results show that a) more errors were produced for the targeted phonemes, which confirms that these phonemes are particularly problematic and segmental training should focus on these sounds, b) the group receiving feedback on these errors made a significantly larger improvement on these phonemes, whereas no statistically significant difference was found for the phonemes for which no feedback was provided, suggesting that the feedback provided in Dutch-CAPT was effective in improving the quality of the targeted phonemes. These additional analyses have evidenced specific effects that did not appear in the analysis of overall segmental quality. To understand the reasons for this discrepancy, we examined the relationship between the human ratings of global segmental quality for each participant in the groups receiving CAPT and the percentages of different errors produced by these participants at pre- and post-test. We found a strong, negative correlation between the raters’ scores and the percentage of total errors per participant, r=-.877, p<.01. Thus the raters did indeed assess global segmental quality, i.e. segmental quality of all phonemes in the stimuli, as requested. A significant, negative correlation was also found between the scores and the percentage of untargeted errors: r=-.863, p<.01. A significant though weaker correlation was found with the targeted errors: r=-.645, p<.01. These results indicate that both types of errors contributed to determining the score; however the targeted errors had less impact on it, which is not surprising if we consider that the targeted phonemes are less frequent (18.5%) than the untargeted phonemes (81.5%).

Possible improvements Give feedback on more phonemes More targeted systems for fixed L1-L2 pairs. Give feedback on suprasegmental Increase sample size Increase training intensity Match training groups: L1’s, proficiency, etc. These results show that a) more errors were produced for the targeted phonemes, which confirms that these phonemes are particularly problematic and segmental training should focus on these sounds, b) the group receiving feedback on these errors made a significantly larger improvement on these phonemes, whereas no statistically significant difference was found for the phonemes for which no feedback was provided, suggesting that the feedback provided in Dutch-CAPT was effective in improving the quality of the targeted phonemes. These additional analyses have evidenced specific effects that did not appear in the analysis of overall segmental quality. To understand the reasons for this discrepancy, we examined the relationship between the human ratings of global segmental quality for each participant in the groups receiving CAPT and the percentages of different errors produced by these participants at pre- and post-test. We found a strong, negative correlation between the raters’ scores and the percentage of total errors per participant, r=-.877, p<.01. Thus the raters did indeed assess global segmental quality, i.e. segmental quality of all phonemes in the stimuli, as requested. A significant, negative correlation was also found between the scores and the percentage of untargeted errors: r=-.863, p<.01. A significant though weaker correlation was found with the targeted errors: r=-.645, p<.01. These results indicate that both types of errors contributed to determining the score; however the targeted errors had less impact on it, which is not surprising if we consider that the targeted phonemes are less frequent (18.5%) than the untargeted phonemes (81.5%).

The end Questions? These results show that a) more errors were produced for the targeted phonemes, which confirms that these phonemes are particularly problematic and segmental training should focus on these sounds, b) the group receiving feedback on these errors made a significantly larger improvement on these phonemes, whereas no statistically significant difference was found for the phonemes for which no feedback was provided, suggesting that the feedback provided in Dutch-CAPT was effective in improving the quality of the targeted phonemes. These additional analyses have evidenced specific effects that did not appear in the analysis of overall segmental quality. To understand the reasons for this discrepancy, we examined the relationship between the human ratings of global segmental quality for each participant in the groups receiving CAPT and the percentages of different errors produced by these participants at pre- and post-test. We found a strong, negative correlation between the raters’ scores and the percentage of total errors per participant, r=-.877, p<.01. Thus the raters did indeed assess global segmental quality, i.e. segmental quality of all phonemes in the stimuli, as requested. A significant, negative correlation was also found between the scores and the percentage of untargeted errors: r=-.863, p<.01. A significant though weaker correlation was found with the targeted errors: r=-.645, p<.01. These results indicate that both types of errors contributed to determining the score; however the targeted errors had less impact on it, which is not surprising if we consider that the targeted phonemes are less frequent (18.5%) than the untargeted phonemes (81.5%).