Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

Slides:



Advertisements
Similar presentations
An Adaptive, Dynamical Model of Linguistic Rhythm Sean McLennan Proposal Defense
Advertisements

The Role of F0 in the Perceived Accentedness of L2 Speech Mary Grantham O’Brien Stephen Winters GLAC-15, Banff, Alberta May 1, 2009.
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
Nigerian English prosody Sociolinguistics: Varieties of English Class 8.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
SYNTAX 1 DAY 30 – NOV 6, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
PaPI 2005 (Barcelona, June) The perception of stress patterns by Spanish and Catalan infants Ferran Pons (University of British Columbia) Laura Bosch.
Languages’ rhythm and language acquisition Franck Ramus Laboratoire de Sciences Cognitives et Psycholinguistique, Paris Jacques Mehler, Marina Nespor,
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
Segmentación de mapas de amplitud y sincronía para el estudio de tareas cognitivas Alfonso Alba 1, José Luis Marroquín 2, Edgar Arce 1 1 Facultad de Ciencias,
Sonority as a Basis for Rhythmic Class Discrimination Antonio Galves, USP. Jesus Garcia, USP. Denise Duarte, USP and UFGo. Charlotte Galves, UNICAMP.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:
Chapter three Phonology
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
Nasal endings of Taiwan Mandarin: Production, perception, and linguistic change Student : Shu-Ping Huang ID No. : NA3C0004 Professor : Dr. Chung Chienjer.
Una Y. Chow Stephen J. Winters Alberta Conference on Linguistics November 1, 2014.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Infant Speech Perception & Language Processing. Languages of the World Similar and Different on many features Similarities –Arbitrary mapping of sound.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Multimodal Information Analysis for Emotion Recognition
Acoustic Cues to Laryngeal Contrasts in Hindi Susan Jackson and Stephen Winters University of Calgary Acoustics Week in Canada October 14,
1. Background Evidence of phonetic perception during the first year of life: from language-universal listeners to native listeners: Consonants and vowels:
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
1 Determining query types by analysing intonation.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino 2 2 Laboratoire Dynamique du Langage UMR 5596 CNRS Université Lumière Lyon 2 Lyon - France 1 Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS Université Paul Sabatier Toulouse - France This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche

Overview 1. Introduction 2. Motivations 3. Rhythm unit extraction & modeling 4. Fundamental frequency extraction & modeling 5. Vowel System Modeling 6. Language identification: experiments 7. Conclusion and perspectives

1. Introduction  Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing  Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.)  Importance of prosody and rhythm One of the most salient features for Language Identification by humans  Difficult to define  Even more difficult to model!

2. Motivations 2.1. Relevance of rhythm  What is Rhythm? Pattern periodically repeated: syllable or stress or mora Alternative theory (Dauer, 1983)  Is rhythm important? Major role in early language acquisition (e.g. Cutler & Mehler, 1993) Structure related to the emergence of language (Frame-Content Theory) (MacNeilage & Davis, 2000) Role in speech perception (numerous works)  Neural Network Modeling of Rhythm (Dominey & Ramus, 2000) Recurrent network dedicated to temporal sequence processing Results:  78 % of correct identification for L1-L2 coherent pair (EN – JA),  chance for L1-L2 incoherent pair (EN – DU)  But inputs consist of hand C/V labelling

2. Motivations 2.2. Relevance of intonation  Is intonation relevant for language discrimination? Linguistic grouping between languages using tone as a lexical marker or not Tone driven language: Mandarin Chinese  The use of changes of F 0, or tones, assigned to syllables distinguish lexical items English uses stress at the level of the sentence Two groups of languages with distinctive prosodic signatures  The challenge Extract prosodic features in a fully unsupervised and language independent way Model these features and evaluate their relevance

Frequency (kHz) Time (s) Amplitude Time (s) NonVowelPause Vowel  3. Rhythm unit extraction 3.1. Speech segmentation and vowel detection  Speech segmentation: statistical segmentation (André-Obrecht, 1988)  Speech Activity Detection  Vowel detection (Pellegrino & Obrecht, 2000)

3. Rhythm unit extraction 3.2 Rhythm units  Syllable: a good candidate as rhythm unit Syllable seems to be crucial in speech perception (Mehler et al. 1981, Content et al., 2001)  But  Syllable parsing seems to be a tricky language-specific mechanism  No automatic language-independent algorithm can be derived (yet)  A roundabout trick: the “pseudo-syllable” Derived from the most frequent syllable structure in the world: CV Using the Vowel segments as milestones The speech signal is parsed in patterns matching the structure: C n V (n integer, can be 0).

3. Rhythm unit extraction 3.2 Pseudo-syllable modeling 5 pseudo-syllables  Time (s) Amplitude Rhythm : - Duration C - Duration V - Complexity C Intonation : - Skewness(F0) - Kurtosis(F0) CCVV CCV CV CCCV CV CCC CCV CCV CV CCCV CV

4. Fundamental frequency modeling  Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf, spectral comb, autocorrelation) Spline interpolation of the F 0 curve allows to get values even on unvoiced segments  Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness & kurtosis of the F 0 distribution  For each language, a Gaussian Mixture Model is trained using the EM algorithm

5. Vowel system modeling  Each vowel segment detected by the vowel detection algorithm is represented by: 8 Mel Frequency Cepstral Coefficients (MFCCs), 8 Delta MFCCs, Energy, Delta Energy, Duration of the segment.  Cepstral subtraction is applied for removal of the channel effect and speaker normalization  For each language, a Gaussian Mixture Model is trained using the EM algorithm

6. Experiments  Corpus: MULTEXT 5 European languages (EN, FR, GE, IT, SP) 50 different speakers (male and female) Read utterances from EUROM1 Limitation: the same texts are produced on average by 3.75 speakers (possible partial text dependency of the models)  Identification task 20 s duration test utterances Very limited number of speakers: Cross validation: 9 speakers for training and 1 for test The learning-testing procedure is iterated for each speaker of the corpus.

6. Experiments 6.1. Rhythm modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 79 % Model Item

6. Experiments 6.2. F 0 modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 53 % Model Item

6. Experiments 6.3. Vowel system modeling  Matrix of confusion: 20s test sentences duration Average correct identification rate: 70 % Model Item

6. Experiments 6.4. Merging  Simple weighted addition of the log-likelihoods from the three models (Rhythm, F 0 & vowel systems)  Weights (experimental): Rhythm model: 0.8 F 0 model: 0.1 Vowel system model: 0.1  Matrix of confusion: 20s test sentences duration Average correct identification rate: 84 % Model Item

7. Conclusion and perspectives  Conclusion First approach dedicated to automatic LId with merging of rhythmic and intonation features Rhythmic modeling based on a “Pseudo-syllable” parsing Fundamental frequency described by high-order statistics 84 % correct identification rate with 5 languages (20s utterances)  Perspectives Improve the rhythmic parsing Model the sequences of rhythmic units and fundamental frequency descriptors Study the impact of the nature of the corpus (read/spontaneous and studio/telephone recording) Merge this approach with phonetic and phonotactic modeling

8. Complementary experiments