Download presentation
Presentation is loading. Please wait.
Published byCecily Preston Modified over 9 years ago
1
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter: Davidson Date: 2009/10/14
2
2/20 Outline Introduction Speech database Consistency of human ratings Pronunciation scoring Experimental results Conclusions and future work
3
3/20 Introduction CAPT (Computer-assisted Pronunciation Training) French spoken by Americans Pronunciation scoring on Entire sentences Specific phone segments (10 phones) Number of phone utterances vs. reliable feedback on a speaker ’ s pronunciation proficiency
4
4/20 Speech database (1/2) Target language: French Native corpus 100 Parisian French speakers Used to train models for speech recognition Nonnative corpus 100 American students Rated by 5 French teachers Only the selected phone segments in each utterance were rated
5
5/20 Speech database (2/2) 4656 phone segments were selected and rated, consisting of 10 phones /an/, /eh/, /eo/, /eu/, /ey/, /in/, /on/, /r/, /uw/, /uy/ Score scale: 1 (unintelligible) to 5 (native- like) Serious disfluent or unacceptable audio quality sentences were discarded Each rater scored some utterances more than once without being informed Self-consistency test
6
6/20 Consistency of human ratings (1/4) Inter-rater correlation Phone level Phone-specific speaker level Overall speaker level Intra-rater correlation Correlation between and : Vector length Standard deviation
7
7/20 Consistency of human ratings (2/4) Phone level inter-rater correlation Speaker 1Speaker 2Speaker 100 Sent 1 Sent 2......... P1P2......... Rater1 Rater2 5 4 4 4.........
8
8/20 Consistency of human ratings (3/4) Phone-specific speaker level inter-rater correlation Speaker 1Speaker 2Speaker 100......... /an/......... Rater1 Rater2 5 4 4 3......... /uy/......... /eh/ 4 5 Overall speaker level inter-rater correlation Speaker 1......... Rater1 Rater2 5 4 4 3......... 4 5 Speaker 2Speaker 100
9
9/20 Consistency of human ratings (4/4) Average inter- and intra-rater correlation across all phones for 5 human raters Corr typeLevel# of scoresCorr. inter Phone 1440.55 Sentence n/a0.65 Phone-specific speaker 32500.80 Overall speaker n/a0.87 intra Phone 1530.86
10
10/20 Pronunciation scoring HMM-based log-likelihood scores HMM-based log-posterior probability scores Segment duration scores
11
11/20 HMM-based log-likelihood scores For each phone segment, the log-likelihood score is defined as: where = starting frame index of the phone segment = number of frames of the phone segment = observation vector = the i th model = likelihood of the current frame
12
12/20 HMM-based log-posterior probability scores For each phone segment, the frame-based posterior probability is defined as: where = prior probability of the phone class The posterior score for the phone segment is then defined as:
13
13/20 Segment duration scores Phone lengths are measured in frames Phone lengths are normalized by speakers ’ rate-of-speech Log-probability of the normalized duration is computed using a discrete distribution of durations (trained from native training data)
14
14/20 Experimental results Test set: (average) 30 sentences from each of the 100 American speakers Experiments Human-machine correlation for phone scores Effect of varying the amount of speaker data
15
15/20 Human-machine correlation for phone scores (1/3) Phone level correlations with about 450 phone scores in each phone class
16
16/20 Human-machine correlation for phone scores (2/3) Phone-specific speaker level correlations with a total of 4656 phone segments across 100 speakers
17
17/20 Human-machine correlation for phone scores (3/3) Comparison between human-human and human-machine correlation at the phone level and phone-specific speaker level Human-machine correlation Human-human correlation
18
18/20 Effect of varying the amount of speaker data (1/2) To evaluate the system ’ s performance as a function of the number of test utterances per speaker The number (N) of phone scores per speakers varied from 10 to 320 Phone proportion is preserved Then, for each N, the speaker level correlation is computed between: Speaker-averaged machine scores (of N scores) Speaker-averaged human scores (of entire human score data)
19
19/20 Effect of varying the amount of speaker data (2/2) N=40 Posterior Duration Likelihood
20
20/20 Conclusions and future work Posterior score performs better than likelihood and duration scores The system ’ s performance is comparable to human raters at speaker level but not as good as that at phone level Future work: More human-rated utterances Scoring algorithms with mispronunciation detection
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.