Download presentation
Presentation is loading. Please wait.
Published byAntonia Neal Modified over 8 years ago
1
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory, University of Southern California This work is supported by the National Science Foundation
2
Intonation in English is Predictable “…if you’re a mind-reader” – D. Bolinger We know what native speakers usually don’t do –e.g. put pitch accents on function words But what they can do is so open… –“I didn’t steal your red hat” (Joe stole it) –“I didn’t steal your red hat” (I ate it) –“I didn’t steal your red hat” (I stole your red shirt) How can we decide if a nonnative speaker’s “tune” sounds native? –…without limiting the sentence structure?
3
Past Approaches to This Task Compare f0 contour with a reference expert pronunciation –Doesn’t allow for variability Extract features from syllables, then classify –Ad-hoc –Requires syllable segmentation time f0 time f0 mean slope max min range duration
4
Our Proposed Method Train native intonation HMMs over processed f0 contours (also: ∆f0, ∆∆f0, Energy) –Normalized, interpolated, smoothed –Annotated with ToBI labels Decode a nonnative speaker’s contour –Define an intonation “grammar” for recognition –No text required Calculate standard confidence measures for the recognized accents/boundaries –Demonstrate correlation with overall pronunciation scores
5
SIL L*H* SIL L*H* SIL HMM Training H* L* Interpolation & Smoothing SIL f0 time H* L* SIL time f0 Centerpoints of pitch accents/boundaries Compensates for segmental effects on f0 realization Baum-Welch time f0 5 states for each model time f0
6
Intonation Grammars FSGs –AB: –HL: Bigram tone models –e.g. SIL L* H*H% L% % * % %H SIL H = high L = low * = accent % = boundary SIL = silence Two sets of models
7
Score Calculation For a single recognized tone “segment”: Where O is the speech observation in suprasegmental features, M t is the recognized tone model, and i takes values over all tone HMMs Then the overall utterance-level score over T tones is:
8
Corpora Native (training set) –The BURNC Professionally read AE radio speech ToBI transcripts for one speaker (1.2 hours) –The IViE corpus Designed to capture intonation variation in BE ToBI-like labels for read Southern BE (0.1 hours) Nonnative (test set) –The ISLE corpus Italian and German learners of BE (23 of each) 138 read sentences (3 x 46 speakers) No tone transcripts AE = American English BE = British English According to Bolinger, AE and BE differ not in tone shape but in frequency and context of use
9
Perceptual Evaluations 3 sentences for each nonnative speaker: –1: “I said ‘white,’ not ‘bait.’” –2: “Could I have chicken soup as a starter and then lamb chops?” –3: “This Summer I’d like to visit Rome for a few days.” Overall pronunciation scored on a 1 to 5 scale –Six Native English-speaking evaluators –Includes both prosodic and segment-level effects Mean inter-evaluator correlation All sentences0.657 Speaker-level0.798 Sentence 10.640 Sentence 20.760 Sentence 30.584 Italian0.707 German0.238 Median of three sentence-level scores Some sentences had more obvious pronunciation mistakes Evaluators were self- consistent and used context Italian speakers were less proficient; German is more closely related to English
10
Results HL163264 FSG-0.0300.070-0.049 Bigram: BE0.2210.2440.247 Bigram: AE0.2100.2620.203 Bigram: both0.2460.3310.248 AB163264 FSG-0.156-0.171-0.203 Bigram: BE0.0260.012-0.045 Bigram: AE0.0340.0250.022 Bigram: both-0.0100.012-0.027 Theoretical FSG doesn’t apply to nonnatives Correlation between automatic scores and evaluator medians for both model sets, four grammars, and variable # of mixtures per state Too many mixtures = overtraining BE and AE can be used together for intonation grammar High/Low Models are necessary
11
Results, continued Automatic scores follow perceptual trends Mean inter-evaluator correlation Best automatic score correlation All sentences0.6570.331 Speaker-level0.7980.280 Sentence 10.6400.308 Sentence 20.7600.511 Sentence 30.5840.181 Italian0.7070.233 German0.2380.156 But not here: Automatic scores did not use context
12
In Conclusion 0.331 correlation –Represents contribution of intonation to overall pronunciation scores –Considering all factors, inter-human agreement is 0.6 - 0.8 –Comparable to SRI Eduspeak System Prosodic features derived from knowledge of text Now: combine this with segment-level features for robust overall pronunciation scores Can also potentially be used for: –Speaker ID –Pronunciation scoring of spontaneous speech
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.