Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter: Davidson Date: 2009/07/08, 2009/07/15
Contents Introduction Goodness of Pronunciation (GoP) algorithm Basic GoP algorithm Phone dependent thresholds Explicit error modeling Collection of a non-native database Performance measures The labeling consistency of the human judges Experimental results Conclusions and future work
Introduction (1/3) CAPT systems (Computer-Assisted Pronunciation Training) Word and phrase level scoring ( ’ 93, ’ 94, ’ 97) Intonation, stress, and rhythm Requires several recordings of native utterances for each word Difficult to add new teaching material Selected phonemic error teaching (1997) Uses duration information or models trained on non-native speech
Introduction (2/3) HMM has been used to produce sentence-level scores (1990, 1996) Eskenazi ’ s system (1996) produces phone-level scores but no attempt to relate this to human judgement Author ’ s proposed system: Measures pronunciation quality for non- native speech at the phone level
Introduction (3/3) Other issues GoP algorithms with refinements Performance measures for both GoP scores and scores by human judges Non-native database Experiments on these performance measures
Goodness of Pronunciation (GoP) algorithm: Basic GoP algorithm A score for each phone = likelihood of the acoustic segment corresponding to each phone GoP = duration normalized log of the posterior probability for a phone given the corresponding acoustic segment
Basic GoP algorithm (2/5) = the set of all phone models = number of frames in By assuming equal phone priors and approximating by its maximum:
Basic GoP algorithm (3/5) Numerator term is computed using forced alignment with known transcription Denominator term is determined using an unconstrained phone loop
Basic GoP algorithm (4/5) If a mispronunciation has occurred, it is not reasonable to constrain the acoustic segment used to compute the maximum likelihood phone to be identical to the assumed phone Hence, the denominator score is computed by summing the log likelihood per frame over the duration of In practice, this will often mean that more than one phone in the unconstrained phone sequence has contributed to the computation of
Basic GoP algorithm (5/5) Intuitive to use speech data from native speakers to train the acoustic models However, non-native speech is characterized by different formant structures compared to those from a native speaker for the same phone Adapt Gaussian means by MLLR Use only one single global transform of the HMM Gaussian component mean to avoid adapting to specific phone error patterns
Phone dependent thresholds The acoustic fit of phone-based HMMs differs from phone to phone E.g. fricatives tend to have lower log likelihood than vowels 2 ways to determine phone-specific thresholds By using mean and variance for phone By approximating human labeling behavior
Explicit error modeling (1/3) 2 types of pronunciation errors Individual mispronunciations Systematic mispronunciations Consists of substitutions of native sounds for sounds of the target language, which do not exist in the native language Knowledge of the learner ’ s native language is included in order to detect systematic mispronunciation
Explicit error modeling (2/3) Solution: a recognition network incorporating both correct pronunciation and common pronunciation errors in the form of error sublattices for each phone. E.g. “ but ”
Explicit error modeling (3/3) Target phone posterior probability Scores for systematic mispronunciations GoP that includes additional penalty for systematic mispronunciation
Collection of a non-native database (1/2) Based on the procedures used for the WSJCAM0 corpus Texts are composed of a limited vocabulary of 1500 words 6 females and 4 males whose mother- tongues are Korean (3), Japanese (3), Latin-American Spanish (3), and Italian (1). Each speaker reads 120 sentences 40 common set of phonetically-balanced sentences 80 sentences varied from session to session
Collection of a non-native database (2/2) 6 human judges who speaks native British English Each speaker was labeled by 1 judge 20 sentences from a female Spanish speakers are used as calibration sentences Annotated by all 6 judges Transcriptions reflect the actual sound uttered by the speakers Including phonemes from other languages
Performance measures (1/3) Compares 2 transcriptions of the same sentence Transcriptions are either transcribed by human judges or generated automatically 4 types of performance measures Strictness Agreement Cross-correlation Overall phone correlation
Performance measures (2/3) Compared on a frame by frame basis Each error is marked as 1 or 0 otherwise. Yields a vector of length with Apply a Hamming window Transition between 0 and 1 is too abrupt where as in practice the boundary is often uncertain Forced alignment might be erroneous due to poor acoustic modeling of non-native speech Window length
Performance measures (3/3)
Strictness (S) Measures how strict the judge was in marking pronunciation errors Relative strictness
Overall Agreement (A) Measures the agreement of all frames between 2 transcriptions Defined in terms of cityblock distance between 2 transcription vectors
Cross-correlation (CC) Measures the agreement between the error frames in either or both transcriptions is the Euclidean distance
Phoneme Correlation (PC) Measures the overall agreement of overall rejection statistics for each phone between 2 judges/systems PC is defined as is a vector of rejection count for each phone denotes the mean rejection counts
Labeling consistency of the human judges (1/4)
Labeling consistency of the human judges (2/4) All results are within an acceptable range 0.85<A<0.95, mean = <CC<0.65, mean = <PC<0.85, mean = < <0.14, mean = 0.06 These mean values can be used as a benchmark values
Labeling consistency of the human judges (3/4)
Labeling consistency of the human judges (4/4)
Experimental results (1/7) Multiple mixture monophone models Corpus: WSJCAM0 Range of rejection threshold was restricted to lie within one standard deviation of the judges strictness where
Experimental results (2/7)
Experimental results (3/7)
Experimental results (4/7)
Experimental results (5/7)
Experimental results (6/7) Add error handling with Latin-American Spanish models to detect systematic mispronunciations
Experimental results (7/7) Transcriptions comparison between human judges and the system with error network
Conclusions and future work 2 GoP scoring mechanism Basic GoP GoP with systematic mispronunciation penalty Refinement methods MLLR adaptation Independent thresholds trained from human judgement Error network Future work Information about the type of mistake