Download presentation
Presentation is loading. Please wait.
1
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization – Linguistic Significance ● Motivation for studying VOT ● Methodology for automatically analyzing VOT contrasts ● Evaluation Method ● Results ● Discussion
2
Project Description ● Automatically distinguish whether a voiceless stop consonant is pronounced with a native or accented pronunciation based on voice onset time characteristics. ● Use data from the Tball corpus: ESL children doing oral reading tasks. ● Evaluate different methods of accomplishing this. – State duration measurements – Explicit modeling of aspiration – Model probablility discrimination
3
What is VOT? ● Voice onset time is defined for stops – e.g. /p,b,t,d,k,g/ ● It is the inverval between the release of closure of an articulator (the transient “burst”) and the start of voicing. ● VOT has a continuum of values: – When the start of voicing precedes the release of closure for a stop, the VOT takes on a negative value. – When the release of closure and onset of voicing are coincident, VOT is zero. – When voicing comes after release of closure, VOT is positive.
4
Physical Realization of VOT ● Stop consonants are produced with a closure of the vocal tract at a specific point, the place of articulation ● During the closure, there is a build up of sub- laryngeal pressure. ● When the closure is released there is a transient burst of air, frication due to turbulence at the place of articulation, aspiration noise from turbulence at the glottis ● Voicing may occur before, during, or after the closure.
5
Linguistic Significance of VOT ● VOT distinguishes consonants with the same place of articulation (/p/ vs. /b/, /t/ vs. /d/, etc.) ● However, different languages use different VOT intervals in contrasts (e.g. “taco”, “pasta”). ● English voiceless stops: VOT= +40-50 ms ● Spanish voiceless stops: VOT= near zero ● English voiced stops: VOT = near zero ● Spanish voiced stops: negative VOT (voicing before closure
6
Linguistic Significance Cont'd ● In English, voiceless stops are have a long VOT at the beginning of a word and before stressed vowels, so aspiration is a perceptual cue to word boundaries and stress ● Since the frication and aspiration during the VOT is due to build up of pressure from the lungs, it may correspond with emphasis.
7
Motivation for Studying VOT ● This study was motivated by a desire to determine if a phone was pronounced with a non-standard pronuniation ● Other reasons to study VOT – It is an important contrastive feature – It gives information about stess – It gives information about word segmentation – It may give information about emphasis
8
Methodology ● Baseline: use duration measurements from a forced alignment. ● Insert an /h/ symbol in the transcriptions with standard pronunciation, train accordingly and decode the test files to see if the /h/ phone is recognized. ● Cut out the phones of interest from the audio file, train separate models and a combined model, and evaluate the likelihood of the separate models w.r.t. the combined model.
9
Methodology (cont'd) ● The data was transcribed by ear with special symbols for non-standard pronunciations. – b/c the data for non standard pronunciatons was sparse, the symbol for dental /t/ was included as short VOT. ● Standard 3 state HMM models – 4 mixtures, T-state silence model – Different frame rates were tested – Bootstrap and flat start methods were tested
10
Evaluation Method ● The evaluation metric used was the error rate for both classes evaluated separately. – This was necessary because the there were much fewer instances of the non-standard pronunciations. ● When using thresholds, the point of equal error rate for both classes was used. – This was necessary b/c moving the threshold would tilt the error rate toward one class or the other.
11
Results ● Baseline method error rates: – p: 55%t:23%k:29% – p: 19%t:20% k:48% using duration of 3 rd HMM state ● With aspiration model: – ShortVOT/ LongVOT – p: 5% / 36% – t: 11% / 38% – k: 57% / 17% ● With probability comparision: – p: 36% / 4% – t: 0% / 5% – k: 0% / 6% – (trained on test data—over trained?)
12
Discussion ● Studies have noted that for VOT k>t>p – This could explain why the baseline gets poor results for p – and why the aspiration model predicts the short VOT class best for /p,t/ but predicts the long VOT class best for /k/ ● Roughly, each method increased in difficulty. ● T he results improved from the baseline, but the last approach (comparing probabilities) may have been over-trained. ● Comparing probabilities may be easier to extend to other pronunciation modeling tasks.
13
Discussion ● Increasing the frame rate didn't help much. – Don't use a 1ms frame rate Unless you want to test your patience. ● If an Inintial consonant has a short VOT, this does not necessarily imply non-standard accent. – Words like “today” and “together” have stress on the 2 nd syllable, so the VOT of the initial consonant is shorter for even for standard pronunciation.
14
Conclusion ● When classifying stop consonants based on VOT characteristics, different approaches work better on different stops – Measuring duration of stop state works reasonably well for /t,k/ b/c longer VOT than /p/. – Detecting insertion of an aspiration model during decoding works well for /p,t/ but not k, which has too many false positives. – Comparing phone probabilities worked well except for unaspirated /p/
15
Future Work ● Since VOT is a time/timing related phenomenon, it may help to explicitly model the state duration density in the HMMs. ● Other optimization criteria might be be better suited than maximum likelihood extimation to train models for this purpose
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.