Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino 2 2 Laboratoire Dynamique du Langage UMR 5596 CNRS Université Lumière Lyon 2 Lyon - France 1 Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS Université Paul Sabatier Toulouse - France This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche
Overview 1. Introduction 2. Motivations 3. Rhythm unit extraction & modeling 4. Fundamental frequency extraction & modeling 5. Vowel System Modeling 6. Language identification: experiments 7. Conclusion and perspectives
1. Introduction Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.) Importance of prosody and rhythm One of the most salient features for Language Identification by humans Difficult to define Even more difficult to model!
2. Motivations 2.1. Relevance of rhythm What is Rhythm? Pattern periodically repeated: syllable or stress or mora Alternative theory (Dauer, 1983) Is rhythm important? Major role in early language acquisition (e.g. Cutler & Mehler, 1993) Structure related to the emergence of language (Frame-Content Theory) (MacNeilage & Davis, 2000) Role in speech perception (numerous works) Neural Network Modeling of Rhythm (Dominey & Ramus, 2000) Recurrent network dedicated to temporal sequence processing Results: 78 % of correct identification for L1-L2 coherent pair (EN – JA), chance for L1-L2 incoherent pair (EN – DU) But inputs consist of hand C/V labelling
2. Motivations 2.2. Relevance of intonation Is intonation relevant for language discrimination? Linguistic grouping between languages using tone as a lexical marker or not Tone driven language: Mandarin Chinese The use of changes of F 0, or tones, assigned to syllables distinguish lexical items English uses stress at the level of the sentence Two groups of languages with distinctive prosodic signatures The challenge Extract prosodic features in a fully unsupervised and language independent way Model these features and evaluate their relevance
Frequency (kHz) Time (s) Amplitude Time (s) NonVowelPause Vowel 3. Rhythm unit extraction 3.1. Speech segmentation and vowel detection Speech segmentation: statistical segmentation (André-Obrecht, 1988) Speech Activity Detection Vowel detection (Pellegrino & Obrecht, 2000)
3. Rhythm unit extraction 3.2 Rhythm units Syllable: a good candidate as rhythm unit Syllable seems to be crucial in speech perception (Mehler et al. 1981, Content et al., 2001) But Syllable parsing seems to be a tricky language-specific mechanism No automatic language-independent algorithm can be derived (yet) A roundabout trick: the “pseudo-syllable” Derived from the most frequent syllable structure in the world: CV Using the Vowel segments as milestones The speech signal is parsed in patterns matching the structure: C n V (n integer, can be 0).
3. Rhythm unit extraction 3.2 Pseudo-syllable modeling 5 pseudo-syllables Time (s) Amplitude Rhythm : - Duration C - Duration V - Complexity C Intonation : - Skewness(F0) - Kurtosis(F0) CCVV CCV CV CCCV CV CCC CCV CCV CV CCCV CV
4. Fundamental frequency modeling Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf, spectral comb, autocorrelation) Spline interpolation of the F 0 curve allows to get values even on unvoiced segments Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness & kurtosis of the F 0 distribution For each language, a Gaussian Mixture Model is trained using the EM algorithm
5. Vowel system modeling Each vowel segment detected by the vowel detection algorithm is represented by: 8 Mel Frequency Cepstral Coefficients (MFCCs), 8 Delta MFCCs, Energy, Delta Energy, Duration of the segment. Cepstral subtraction is applied for removal of the channel effect and speaker normalization For each language, a Gaussian Mixture Model is trained using the EM algorithm
6. Experiments Corpus: MULTEXT 5 European languages (EN, FR, GE, IT, SP) 50 different speakers (male and female) Read utterances from EUROM1 Limitation: the same texts are produced on average by 3.75 speakers (possible partial text dependency of the models) Identification task 20 s duration test utterances Very limited number of speakers: Cross validation: 9 speakers for training and 1 for test The learning-testing procedure is iterated for each speaker of the corpus.
6. Experiments 6.1. Rhythm modeling Matrix of confusion: 20s test sentences duration Average correct identification rate: 79 % Model Item
6. Experiments 6.2. F 0 modeling Matrix of confusion: 20s test sentences duration Average correct identification rate: 53 % Model Item
6. Experiments 6.3. Vowel system modeling Matrix of confusion: 20s test sentences duration Average correct identification rate: 70 % Model Item
6. Experiments 6.4. Merging Simple weighted addition of the log-likelihoods from the three models (Rhythm, F 0 & vowel systems) Weights (experimental): Rhythm model: 0.8 F 0 model: 0.1 Vowel system model: 0.1 Matrix of confusion: 20s test sentences duration Average correct identification rate: 84 % Model Item
7. Conclusion and perspectives Conclusion First approach dedicated to automatic LId with merging of rhythmic and intonation features Rhythmic modeling based on a “Pseudo-syllable” parsing Fundamental frequency described by high-order statistics 84 % correct identification rate with 5 languages (20s utterances) Perspectives Improve the rhythmic parsing Model the sequences of rhythmic units and fundamental frequency descriptors Study the impact of the nature of the corpus (read/spontaneous and studio/telephone recording) Merge this approach with phonetic and phonotactic modeling
8. Complementary experiments