1 Using the HTK speech recogniser to analyse prosody in a corpus of German spoken learners English Toshifumi Oba, Eric Atwell University of Leeds, School of Computing
2 Outline Introduction -Intonation and Speech Recognition -Tendency of Speech Recognition Research -ISLE Speech Corpus -HTK Hidden Markov Model Toolkit Prosodic Annotation Human Evalution of Intonation Abilities Grouping of German Speakers by Intonation Ability HTK speech recognition experiments Conclusions Q & A
3 Intonation and Speech Recognition Intonation is important in Human Communication. -Convey the meaning and attitude of the speaker Intonation is important for Speech Recognition. -Acoustic Models (duration, F0, intensity) -Language Models (identify the dialogue type)
4 Tendency of Speech Recognition Research Intonation << Pronunciation Non-native speaker << Native speaker Speech recognition research for non-native speakers intonation is unique. Also, Intonation is paid less attention in CALL compared with pronunciation.
5 Features of Various Speech Recognition Research Research ReferenceNon-NGermanIntonG/PHTK (Taylor, 1998)NNYNY (Uebler, 1998)YNNNN (Stemmer, 2001)YYNNN (Teixeira, 1996)YYNNY (Hansen, 1995)YYYYN (Yan and Vaseghi, 2002)NNYNY (Jurafsky et al, 1994)YYNNN (Berkling et al,1998)YNNNY (Oba and Atwell, 2003)YYYYY
6 Objectives Analysis of non-native speakers English intonation. -If the HTK is able to distinguish intonation ? -Is it possible to train distinct models for different intonation ability groups? Prosodic annotation of written English text to produce model intonation patterns. Human evaluation to group German speakers by English intonation ability.
7 ISLE Speech Corpus (1) Re-use of speech corpus collected in ISLE Interactive Spoken Languge Education project. Leeds University, Universität Hamburg, Università di Milano-Bicocca, Entropic Ltd., Ernst Klett Verlag GmbH, and Dida*El S.R.L. Time-aligned audio recordings from 23 German and 23 Italian spoken learners English + 2 Native English Speakers.
8 ISLE Speech Corpus (2) Speaker adaptation -82 sentences edited from The Ascent of Everest e.g. It is in fact a story of many years, in which men tried to climb that mountain. Typical EFL exercises -Minimal Pairs and Polysyllabic words e.g. I said bad not bed. He's a photographer.
9 ISLE Speech Corpus (3) Annotated corpus -Pronuciation errors at word- and phone-levels -Stress errors at word level Prosodic annotation was added to a written transcription of the speech corpus in our research.
10 HTK Hidden Markov Model Toolkit Developed at Cambridge University Engineering Depertment (CUED). Free toolkit for building Hidden Markov Models (HMMs). Module call: available from both command line and script file. Used in speech recognition research and other pattern recogntion research. e.g. Hand writing recognition Facial recognition
11 Prosodic Annotation Purpose: Predict model intonation patternsto be compared against German spoken learners English. Instructions: From text structure to prosodic structure (Knowles, 1996) Environment: Windows Excel Amount: First 27 sentences from the Ascent of Everest
12 Result of Prosodic Annotation (1) 27 sentences, consisting of 429 words, were divided into 84 tone groups: prosodic phrases. 1 low rise, 3 high rise, 52 fall-rise and 28 fall patterns. First 10 sentences were modified according to native speakers recordings. 15 fall-rise and 10 fall patterns 1 low rise, 2 high rise and 4 fall-rise were deleted.
13 Result of Prosodic Annotation (2) (A_01)This is the story of how two men reached the top of Everest on the twenty-ninth of May nineteen fifty-three and came back safely to their friends below. (A_02)Yet this will not be the whole story. (A_03) The ascent of Everest was not the work of one day, nor even of those few unforgettable weeks in which we prepared and climbed that summer.
14 Human Evaluation of German Spoken Learners English Intonation Abilities Purpose: Group German speakers into good and poor intonation groups. Evaluator I: Computational linguistics researcher Evaluator II: English language teaching researcher Quantity: First 10 utterances from each speaker. -If all the tone types of an utterance was matched with model pattern, then it was judged as correct; otherwise incorrect.
15 Grouping of 23 German Speakers Grouping I: based on Evaluator I (Computational linguistics researcher) Grouping II: based on Evaluator II (English language teaching (ELT) researcher) Grouping III: agreement of Evaluator I and II. 23speakers 3exceptionally poor pronunciation speakers 8good 4intermediate 8poor intonation speakers
16 Result of Human Evalution and Grouping Two evaluators agreed about 63% (144 utterances out of 230) Evaluator II marked 109 errors, while Evaluator I marked 78 errors. However, 7 poor and 5 good speakers were same in Grouping I and Grouping II. 2 speakers were added to good intonation group in Grouping III.
17 Conditions of HTK Speech Recognition Experiments Monophone and triphone HMMs were trained. No language models were used. Perl script and configuration file were used for module calls. Number of training speakers: 6 speakers from the same intonation group. Number of test speakers: 2 (1 for Grouping III) speakers from each group.
18 Results of HTK experiments Recognition accuracy was generally higher when test and training speakers intonation abilities were same. Improvement was higher against triphone HMMs. Improvement was most significant in Experiment II. One poor intonation speaker showed negative improvement in all three experiments. Another poor speaker also showed the negative improvement in Experiment I.
19 Average Recognition Accuracies of Good Intonation Speakers (Parentheses show results against monophone HMMs) Good PoorImprovement Experiment I % (33.31 %) % (20.11 %) % (13.20 %) Experiment II % (34.84 %) % (19.41 %) % (15.43 %) Experiment III % (34.50 %) % (18.09 %) % (16.41 %) Trained Models
20 Average Recognition Accuracies of Poor Intonation Speakers (Parentheses show results against monophone HMMs) Good PoorImprovement Experiment I % (22.88 %) % (20.03 %) 6.76 % (-2.85 %) Experiment II % (34.84 %) % (19.41 %) % (1.67 %) Experiment III % (20.60 %) % (20.12 %) % (-0.48 %) Trained Models
21 Prosodic Keywords Tone type is decided by the last accented syllable. (Knowles, 1996) We called word containing the last accented syllable of each tone group the prosodic keyword. Recognition accuracy among prosodic keywords was counted for triphone cases of Experiment II. Improvement of recognition accuracy among prosodic keywords was higher that of overall. -Good test speakers: 26.00% (overall 19.20%) -Poor test speakers: 24.50% (overall 15.50%)
22 Irrelevance of Pronunciation Abilities Good intonation speakers tended to have slightly better pronunication ability than poor intonation speakers, although 3 exceptionally poor pronunciatioin speakers had been excluded. Additional experiments were executed taking 2 best and 2 worst pronunciation speakers from poor and good intonation groups, respectively. Similar improvement was observed in this experiment too.
23 Conclusions Matching of test and training speakers intonation abilities brought about higher recognition accuracy. HTK was able to distinguish good and poor intonation. Confirmed that German speakers weakness of English intonation was generally fall-rise patterns. Human evaluation was successful enough.
24 Future Work Expand tone types. (not only for fall-rise and fall patterns) Applied to other languages and to different native- speaker groups. Use of results in practical language-teaching systems.