Download presentation
Presentation is loading. Please wait.
Published byFrederick Stokes Modified over 9 years ago
1
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes 1, Lannion, France. VIVOS project, funded by the French National Agency for Research (ANR)
2
LREC 2008, Marrakech, Morocco2 OUTLINE ►Introduction ►Corpus description ►Experimentation ■text verification ■phonetisation ■HMM modeling ►A new mixed model ►Results ►Conclusion and perspectives
3
LREC 2008, Marrakech, Morocco3 Introduction ►Objectives ■To develop an automatic segmentation system adapted to expressive speech taken from movie dubbing. ■To investigate a new modelling methodology using mixed HMM models based on both Context Dependent and Context Independent Models. ►Motivations ■Voices for TTS applications are created from constrained recordings whereas unconstrained recordings are available, notably in the post-production industry. ■Context-independent phoneme models are usually used to perform label alignment, but, in some cases, context- dependent phoneme models can improve the alignment precision for co-articulated sounds.
4
LREC 2008, Marrakech, Morocco4 The speech corpus ►Voice-over recordings of short fantastic stories ■recorded in a dubbing studio ■speech expressing suspense ►French-native male speaker ►Database content ■5 hours and 20 minutes ■1633 speech turns ■average of 32 words/turn ■4995 sentences ►Effects of expressivity ■large variability in prosody, long pauses, fillers ■the speaker takes liberties in his pronunciation (unusual liaisons, approximative pronunciation of some words)
5
LREC 2008, Marrakech, Morocco5 Experimentation ►3 corpora ■learning : 70% of the corpus -> to train the models ■validation : 12% of the corpus -> to set modeling parameters ■test : 18% of the corpus -> to evaluate the overall performance
6
LREC 2008, Marrakech, Morocco6 Text verification ►Manual checking ■spelling ■pronunciation ►Insertions of tags in the text ■indicating deep breathing and long pauses ■not synchronized with the signal ►Exception dictionary for ■some acronyms ■foreign words ■~600 words ►speech turns synchronization
7
LREC 2008, Marrakech, Morocco7 Phonetisation ►Rules-based grapheme-phoneme conversion ►Variants : liaisons, schwas, pauses ►Production of a graph including optional variants ►HTK phonological words ils sont amenés => i l / s õ / a m ø n e
8
LREC 2008, Marrakech, Morocco8 HMM methodology ►1 phoneme ↔1 hmm model ►12 MFCC + Energy + derivatives (39 coefficents) ►3 emitting states ►Context Independent models : ■initialised on the learning corpus (70% of the corpus) ■3 gaussian components mixture ►Context Dependent models : ■initialised on Context Independent models ■4 gaussian components mixture ■estimation of missing contextual models using a classification tree ►Mixed models
9
LREC 2008, Marrakech, Morocco9 Mixed models ►Mixing context-dependant models and context- independant models according to their performance on a validation set
10
LREC 2008, Marrakech, Morocco10 Comparing CD vs CI models ►Difference of %age of correct alignments (<20 ms) between Context-Dependent models and Context-Independent models
11
LREC 2008, Marrakech, Morocco11 Results : phonetic decoding ►Disagreement (Elisions+Insertions+Substitutions) between 5.11% and 5.55% ►Good labelling of liaisons, elisions and insertions of pauses and schwas ►Substitutions : inversion between open and closed vowels
12
LREC 2008, Marrakech, Morocco12 Results : label alignments ►computed on well recognised phonetic labels ►mixed models take advantage of context-dependent models ( semi-vowels, voiced fricatives, *-nasal consonants) ►+8% for semi-vowels-* 90.54% (mixed) vs 82.58% (CI)
13
LREC 2008, Marrakech, Morocco13 Conclusion and perspectives ►Good segmentation scores of expressive speech are due to ■an accurate text verification (...but only at a text level) ■an automatically generated graph of phonemesa including variants ■an automatic hmm segmentation ►Experimentation of a new segmentation methodology by mixing CI and CD models ►Perspectives ■to improve automatic grapheme to phoneme conversion of acronyms and proper names ■to apply post-processings for open/closed vowels and pauses ■to include new filler models
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.