Presentation is loading. Please wait.

Presentation is loading. Please wait.

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

Similar presentations


Presentation on theme: "Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino."— Presentation transcript:

1 Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino 2 2 Laboratoire Dynamique du Langage UMR 5596 CNRS Université Lumière Lyon 2 Lyon - France 1 Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS Université Paul Sabatier Toulouse - France This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche jean-luc.rouas@irit.frjerome.farinas@irit.frfrancois.pellegrino@univ-lyon2.fr

2 Overview 1. Introduction 2. Motivations 3. Rhythm unit extraction & modeling 4. Fundamental frequency extraction & modeling 5. Vowel System Modeling 6. Language identification: experiments 7. Conclusion and perspectives

3 1. Introduction  Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing  Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.)  Importance of prosody and rhythm One of the most salient features for Language Identification by humans  Difficult to define  Even more difficult to model!

4 2. Motivations 2.1. Relevance of rhythm  What is Rhythm? Pattern periodically repeated: syllable or stress or mora Alternative theory (Dauer, 1983)  Is rhythm important? Major role in early language acquisition (e.g. Cutler & Mehler, 1993) Structure related to the emergence of language (Frame-Content Theory) (MacNeilage & Davis, 2000) Role in speech perception (numerous works)  Neural Network Modeling of Rhythm (Dominey & Ramus, 2000) Recurrent network dedicated to temporal sequence processing Results:  78 % of correct identification for L1-L2 coherent pair (EN – JA),  chance for L1-L2 incoherent pair (EN – DU)  But inputs consist of hand C/V labelling

5 2. Motivations 2.2. Relevance of intonation  Is intonation relevant for language discrimination? Linguistic grouping between languages using tone as a lexical marker or not Tone driven language: Mandarin Chinese  The use of changes of F 0, or tones, assigned to syllables distinguish lexical items English uses stress at the level of the sentence Two groups of languages with distinctive prosodic signatures  The challenge Extract prosodic features in a fully unsupervised and language independent way Model these features and evaluate their relevance

6 Frequency (kHz) 8 4 00 00.20.40.60.81.0 Time (s) Amplitude 00.20.40.60.81.0 Time (s) NonVowelPause Vowel  3. Rhythm unit extraction 3.1. Speech segmentation and vowel detection  Speech segmentation: statistical segmentation (André-Obrecht, 1988)  Speech Activity Detection  Vowel detection (Pellegrino & Obrecht, 2000)

7 3. Rhythm unit extraction 3.2 Rhythm units  Syllable: a good candidate as rhythm unit Syllable seems to be crucial in speech perception (Mehler et al. 1981, Content et al., 2001)  But  Syllable parsing seems to be a tricky language-specific mechanism  No automatic language-independent algorithm can be derived (yet)  A roundabout trick: the “pseudo-syllable” Derived from the most frequent syllable structure in the world: CV Using the Vowel segments as milestones The speech signal is parsed in patterns matching the structure: C n V (n integer, can be 0).

8 3. Rhythm unit extraction 3.2 Pseudo-syllable modeling 5 pseudo-syllables  00.20.40.60.81.0 Time (s) Amplitude 00.20.40.60.81.0 Rhythm : - Duration C - Duration V - Complexity C Intonation : - Skewness(F0) - Kurtosis(F0) CCVV CCV CV CCCV CV CCC CCV CCV CV CCCV CV

9 4. Fundamental frequency modeling  Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf, spectral comb, autocorrelation) Spline interpolation of the F 0 curve allows to get values even on unvoiced segments  Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness & kurtosis of the F 0 distribution  For each language, a Gaussian Mixture Model is trained using the EM algorithm

10 5. Vowel system modeling  Each vowel segment detected by the vowel detection algorithm is represented by: 8 Mel Frequency Cepstral Coefficients (MFCCs), 8 Delta MFCCs, Energy, Delta Energy, Duration of the segment.  Cepstral subtraction is applied for removal of the channel effect and speaker normalization  For each language, a Gaussian Mixture Model is trained using the EM algorithm

11 6. Experiments  Corpus: MULTEXT 5 European languages (EN, FR, GE, IT, SP) 50 different speakers (male and female) Read utterances from EUROM1 Limitation: the same texts are produced on average by 3.75 speakers (possible partial text dependency of the models)  Identification task 20 s duration test utterances Very limited number of speakers: Cross validation: 9 speakers for training and 1 for test The learning-testing procedure is iterated for each speaker of the corpus.

12 6. Experiments 6.1. Rhythm modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 79 % Model Item

13 6. Experiments 6.2. F 0 modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 53 % Model Item

14 6. Experiments 6.3. Vowel system modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 70 % Model Item

15 6. Experiments 6.4. Merging  Simple weighted addition of the log-likelihoods from the three models (Rhythm, F 0 & vowel systems)  Weights (experimental): Rhythm model: 0.8 F 0 model: 0.1 Vowel system model: 0.1  Matrix of confusion: 20s test sentences duration Average correct identification rate: 84 % Model Item

16 7. Conclusion and perspectives  Conclusion First approach dedicated to automatic LId with merging of rhythmic and intonation features Rhythmic modeling based on a “Pseudo-syllable” parsing Fundamental frequency described by high-order statistics 84 % correct identification rate with 5 languages (20s utterances)  Perspectives Improve the rhythmic parsing Model the sequences of rhythmic units and fundamental frequency descriptors Study the impact of the nature of the corpus (read/spontaneous and studio/telephone recording) Merge this approach with phonetic and phonotactic modeling

17 8. Complementary experiments

18


Download ppt "Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino."

Similar presentations


Ads by Google