Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

1 Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter: Davidson Date: 2009/10/14

2 2/20 Outline  Introduction  Speech database  Consistency of human ratings  Pronunciation scoring  Experimental results  Conclusions and future work

3 3/20 Introduction  CAPT (Computer-assisted Pronunciation Training) French spoken by Americans  Pronunciation scoring on Entire sentences Specific phone segments (10 phones)  Number of phone utterances vs. reliable feedback on a speaker ’ s pronunciation proficiency

4 4/20 Speech database (1/2)  Target language: French  Native corpus 100 Parisian French speakers Used to train models for speech recognition  Nonnative corpus 100 American students Rated by 5 French teachers Only the selected phone segments in each utterance were rated

5 5/20 Speech database (2/2)  4656 phone segments were selected and rated, consisting of 10 phones /an/, /eh/, /eo/, /eu/, /ey/, /in/, /on/, /r/, /uw/, /uy/  Score scale: 1 (unintelligible) to 5 (native- like)  Serious disfluent or unacceptable audio quality sentences were discarded  Each rater scored some utterances more than once without being informed Self-consistency test

6 6/20 Consistency of human ratings (1/4)  Inter-rater correlation Phone level Phone-specific speaker level Overall speaker level  Intra-rater correlation  Correlation between and : Vector length Standard deviation

7 7/20 Consistency of human ratings (2/4)  Phone level inter-rater correlation Speaker 1Speaker 2Speaker 100 Sent 1 Sent 2......... P1P2......... Rater1 Rater2 5 4 4 4.........

8 8/20 Consistency of human ratings (3/4)  Phone-specific speaker level inter-rater correlation Speaker 1Speaker 2Speaker 100......... /an/......... Rater1 Rater2 5 4 4 3......... /uy/......... /eh/ 4 5  Overall speaker level inter-rater correlation Speaker 1......... Rater1 Rater2 5 4 4 3......... 4 5 Speaker 2Speaker 100

9 9/20 Consistency of human ratings (4/4)  Average inter- and intra-rater correlation across all phones for 5 human raters Corr typeLevel# of scoresCorr. inter Phone 1440.55 Sentence n/a0.65 Phone-specific speaker 32500.80 Overall speaker n/a0.87 intra Phone 1530.86

10 10/20 Pronunciation scoring  HMM-based log-likelihood scores  HMM-based log-posterior probability scores  Segment duration scores

11 11/20 HMM-based log-likelihood scores  For each phone segment, the log-likelihood score is defined as: where = starting frame index of the phone segment = number of frames of the phone segment = observation vector = the i th model = likelihood of the current frame

12 12/20 HMM-based log-posterior probability scores  For each phone segment, the frame-based posterior probability is defined as: where = prior probability of the phone class  The posterior score for the phone segment is then defined as:

13 13/20 Segment duration scores  Phone lengths are measured in frames  Phone lengths are normalized by speakers ’ rate-of-speech  Log-probability of the normalized duration is computed using a discrete distribution of durations (trained from native training data)

14 14/20 Experimental results  Test set: (average) 30 sentences from each of the 100 American speakers  Experiments Human-machine correlation for phone scores Effect of varying the amount of speaker data

15 15/20 Human-machine correlation for phone scores (1/3)  Phone level correlations with about 450 phone scores in each phone class

16 16/20 Human-machine correlation for phone scores (2/3)  Phone-specific speaker level correlations with a total of 4656 phone segments across 100 speakers

17 17/20 Human-machine correlation for phone scores (3/3)  Comparison between human-human and human-machine correlation at the phone level and phone-specific speaker level Human-machine correlation Human-human correlation

18 18/20 Effect of varying the amount of speaker data (1/2)  To evaluate the system ’ s performance as a function of the number of test utterances per speaker  The number (N) of phone scores per speakers varied from 10 to 320 Phone proportion is preserved  Then, for each N, the speaker level correlation is computed between: Speaker-averaged machine scores (of N scores) Speaker-averaged human scores (of entire human score data)

19 19/20 Effect of varying the amount of speaker data (2/2) N=40 Posterior Duration Likelihood

20 20/20 Conclusions and future work  Posterior score performs better than likelihood and duration scores  The system ’ s performance is comparable to human raters at speaker level but not as good as that at phone level  Future work: More human-rated utterances Scoring algorithms with mispronunciation detection

