Automatic Dialect/Accent Recognition Fadi Biadsy April 12 th, 2010 1.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Speaker Adaptation for Vowel Classification
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Chapter three Phonology
Optimal Adaptation for Statistical Classifiers Xiao Li.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Why is ASR Hard? Natural speech is continuous
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Speech Signal Processing
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Study of Word-Level Accent Classification and Gender Factors
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
National Taiwan University, Taiwan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Phonetics, part III: Suprasegmentals October 19, 2012.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.
Linguistic knowledge for Speech recognition
Investigating Pitch Accent Recognition in Non-native Speech
Progress Report - V Ravi Chander
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Segment-Based Speech Recognition
Speaker Identification:
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Automatic Dialect/Accent Recognition Fadi Biadsy April 12 th,

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics 2

PhD Proposal – Fadi Biadsy Problem: Dialect Recognition Given a speech segment of a predetermined language Great deal of work on language recognition Dialect and Accent recognition have more recently begun to receive attention Dialect recognition more difficult problem than language recognition 3 Dialect = {D1, D2,…,DN}

PhD Proposal – Fadi Biadsy Motivation: Why Study Dialect Recognition? Discover differences between dialects To improve Automatic Speech Recognition (ASR) Model adaptation: Pronunciation, Acoustic, Morphological, Language models To infer speaker’s regional origin for Speech to speech translation Annotations for Broadcast News Monitoring Spoken dialogue systems – adapt TTS systems Charismatic speech Call centers – crucial in emergency situations 4

PhD Proposal – Fadi Biadsy Motivation: Cues that May Distinguish Dialects/Accents Phonetic cues: Differences in phonemic inventory Phonemic differences Allophonic differences (context-dependent phones) Phonotactics: Rules/Distribution that govern phonemes and their sequences in a dialect 5 (Al-Tamimi & Ferragne, 2005) MSA: /s/ /a/ /t/ /u/ /q/ /A/ /b/ /i/ /l/ /u/ /h/ /u/ Egy: /H/ /a/ /t/ /?/ /a/ /b/ /l/ /u/ Lev: /r/ /a/ /H/ /t/ /g/ /A/ /b/ /l/ /u/ MSA: /s/ /a/ /t/ /u/ /q/ /A/ /b/ /i/ /l/ /u/ /h/ /u/ Egy: /H/ /a/ /t/ /?/ /a/ /b/ /l/ /u/ Lev: /r/ /a/ /H/ /t/ /g/ /A/ /b/ /l/ /u/ Differences in Morphology Differences in phonetic inventory and vowel usage “She will meet him”

PhD Proposal – Fadi Biadsy Motivation: Cues that May Distinguish Dialects/Accents Prosodic differences Intonational patterns Timing and rhythm Spectral distribution (Acoustic frame-based features) Morphological, lexical, and syntactic differences 6

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 7

PhD Proposal – Fadi Biadsy Case Study: Arabic Dialects Iraqi Arabic: Baghdadi, Northern, and Southern Gulf Arabic: Omani, UAE, and Saudi Arabic Levantine Arabic: Jordanian, Lebanese, Palestinian, and Syrian Arabic Egyptian Arabic: primarily Cairene Arabic 8

PhD Proposal – Fadi Biadsy Corpora – Four Dialects – DATA I Recordings of spontaneous telephone conversation produced by native speakers of the four dialects available from LDC 9

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 10

PhD Proposal – Fadi Biadsy Probabilistic Framework for Language ID 11 Task: Hazen and Zue’s (1993) contribution: Acoustic modelProsodic model Phonotactic Prior

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 12

PhD Proposal – Fadi Biadsy Phonotactic Approach 13 dh uw z hh ih n d uw ey... f uw v ow z l iy g s m k dh... h iy jh sh p eh ae ey p sh… dh uw z hh ih n d uw ey... f uw v ow z l iy g s m k dh... h iy jh sh p eh ae ey p sh… Train an n-gram model: λ i Run a phone recognizer Hypothesis: Dialects differ in their phonotactic distribution Early work: Phone Recognition followed by Language Modeling (PRLM) (Zissman, 1996) Training: For each dialect D i :

PhD Proposal – Fadi Biadsy uw hh ih n d uw w ay ey uh jh y eh k oh v hh... C Test utterance: Run the phone recognizer Phonotactic Approach – Identification 14

PhD Proposal – Fadi Biadsy Applying Parallel PRLM (Zissman, 1996) Use multiple (k) phone recognizers trained on multiple languages to train k n-gram phonotactic models for each language of interest Experiments on our data: 9 phone recognizers, trigram models 15 Perplexities English phones Arabic phones Acoustic Preprocessing Arabic Phone Recognizer English Phone Recognizer Japanese Phone Recognizer Iraqi LM Gulf LM Egyptian LM Levantine LM MSA LM Japanese phones Back-End Classifier Hypothesized Dialect Iraqi LM Gulf LM Egyptian LM Levantine LM MSA LM Iraqi LM Gulf LM Egyptian LM Levantine LM MSA LM

PhD Proposal – Fadi Biadsy Our Parallel PRLM Results – 10-Fold Cross Validation 16 Test utterance duration in seconds

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 17

PhD Proposal – Fadi Biadsy Prosodic Differences Across Dialects 18 Hypothesis: Dialects differ in their prosodic structure What are these differences? Global Features Pitch: Range and Register, Peak Alignment, STDV Intensity Rhythmic features: ∆C, ∆V, %V (using pseudo syllables) Speaking Rate Vowel duration statistics Compare dialects using descriptive statistics

PhD Proposal – Fadi Biadsy New Approach: Prosodic Modeling 19 Learn a sequential model for each prosodic sequence type using an ergodic continuous HMM for each dialect Pseudo-syllabification Sequential local features at the level of pseudo-syllables:

PhD Proposal – Fadi Biadsy New Approach for Prosodic Modeling Dialect Recognition System: 20 Global Features Logistic Backend Classifier

PhD Proposal – Fadi Biadsy Prosodic Modeling – Results (2m test utterances) fold cross-validation (on Data I) * p<0.05; *** p<0.001

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 22

PhD Proposal – Fadi Biadsy Baseline: Acoustic Modeling Hypothesis: Dialect differ in their spectral distribution Gaussian Mixture Model – Universal Background Model (GMM-UBM) widely used approach for language and speaker recognition (Reynolds et al., 2000) a i : 40D PLP features 23 Dialect 2 Dialect 1 I.Train GMM-UBM using EM II.Maximum A-Posteriori (MAP) Adaptation to create a GMM for each dialect III.During recognition

PhD Proposal – Fadi Biadsy Corpora – Four Dialects – DATA II For testing: (25% female – mobile, 25% female – landline, 25% male – mobile, 25 % male – landline) Egyptian: Training: CallHome Egyptian, Testing: CallFriend Egyptian 24

PhD Proposal – Fadi Biadsy NIST LREC Evaluation Framework Detection instead of identification: given a trial and a target dialect Hypothesis: Is the utterance from the target dialect? Accept/reject + likelihood DET curves: false alarm probability against miss probability Results are reported across pairs of dialects All dialects are then pooled together to produce one DET curve Trials 30s, 10s, and 3s long Equal Error Rate (EER) 25

PhD Proposal – Fadi Biadsy Results (DET curves of PRLM and GMM-UBM) – 30s Cuts (Data II) ApproachEER (%) PRLM17.7 GMM-UBM

PhD Proposal – Fadi Biadsy Our GMM-UBM Improved with fMLLR Motivation: VTLN and channel compensation improve GMM-UBM for speaker and language recognition Our approach: Feature space Maximum Likelihood Linear Regression (fMLLR) adaptation Idea: Use a phone recognizer to obtain phone sequence: transform the features “towards” the corresponding acoustic model GMMs (a matrix for each speaker) Intuition: consequently produce more compact models Same as GMM-UBM approach, but use transformed acoustic vectors instead 27 fMLLR [Vowel]-/r/-[Consonant]

PhD Proposal – Fadi Biadsy Results – GMM-UBM-fMLLR – 30s Utterances ApproachEER (%) PRLM17.7 GMM-UBM15.3 GMM-UBM-fMLLR11.0% 28

PhD Proposal – Fadi Biadsy Outline Problem Motivation Corpora Framework for Language Recognition Experiments in Dialect Recognition Phonotactic Modeling Prosodic Modeling Acoustic Modeling Discriminative Phonotactics Contributions Future Work Research Plan 29

PhD Proposal – Fadi Biadsy Discriminative Phonotactics Hypothesis: Dialects differ in their allophones (context-dependent phones) and their phonotactics Idea: Discriminate dialects first at the level of context-dependent (CD) phones and then phonotactics I. Obtain CD-phones II. Extract acoustic features for each CD-phone III. Discriminate CD-phones across dialects IV. Augment the CD-phone sequences and extract phonotactic features V. Train a discriminative classifier to distinguish dialects 30

PhD Proposal – Fadi Biadsy... [Back vowel]-r-[Central Vowel] [Plosive]-A-[Voiced Consonant] [VCentral]-b-[High Vowel]... Run Attila context-dependent phone recognizer (trained on MSA) Obtaining CD-Phones Context-dependent (CD) phone sequence * not just /r/ /A/ /b/ Do the above for all training data of all dialects 31

PhD Proposal – Fadi Biadsy CD-Phone Universal Background Acoustic Model e.g., [Back vowel]-r-[Central Vowel] Each CD phone type has an acoustic model: 32

PhD Proposal – Fadi Biadsy Obtaining CD-Phones + Frame Alignment 33 Acoustic frames for second state CD-Acoustic Models: CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant] CD-Phone Recognizer

PhD Proposal – Fadi Biadsy [Back Vowel]-r-[Central Vowel] MAP 2. MAP adapt the universal background model GMMs to the corresponding frames MAP Adaptation of each CD-Phone Instance 34

PhD Proposal – Fadi Biadsy 2. MAP adapt the universal background model GMMs to the corresponding frames One Super Vector for each CD phone instance: Stack all the Gaussian means and phone duration V k =[μ 1, μ 2, …., μ N, duration ] i.e., a sequence of features with unfixed size to fixed-size vector [Back Vowel]-r-[Central Vowel] MAP Adaptation of each CD-Phone Instance 35

PhD Proposal – Fadi Biadsy Super vectors of CD-phone instances of all speakers in dialect 1 Super vectors of CD phone instances of all speakers in dialect 2 [Back Vowel]-r-[Central Vowel] dialect 1 dialect 2 SVM Classifier for each CD-Phone Type for each Pair of Dialects 36

PhD Proposal – Fadi Biadsy Discriminative Phonotactics – CD-Phone Classification 37 MAP Adapted Acoustic Models: MAP Adapt GMMs Super Vectors Acoustic frames for second state CD-Acoustic Models: CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant] CD-Phone Recognizer Super Vectors: Super Vector 1Super Vector N Dialects: (e.g.) SVM Classifiers Egy

PhD Proposal – Fadi Biadsy CD-Phone Classifier Results Split the training data into two halves Train 227 (one for each CD-phone type) binary classifiers for each pair of dialects on 1 st half and test on 2 nd 38

PhD Proposal – Fadi Biadsy Extraction of Linguistic Knowledge Use the results of these classifiers to show which phones in what contexts distinguish dialects the most (chance is 50%) 39 Levantine/Iraqi Dialects

PhD Proposal – Fadi Biadsy Labeling Phone Sequences with Dialect Hypotheses... [Back vowel]-r-[Central Vowel] [Plosive]-A-[Voiced Consonant] [Central Vowel]-b-[High Vowel]... Run corresponding SVM classifier to get the dialect of each CD phone [Back vowel]-r-[Central Vowel] Egyptian [Plosive]-A-[Voiced Consonant] Egyptian [Central Vowel]-b-[High Vowel] Levantine... CD-phone recognizer

PhD Proposal – Fadi Biadsy Textual Feature Extraction for Discriminative Phonotactics Extract the following textual features from each pair of dialects Normalize vector by its norm Train a logistic regression with L2 regularizer 41

PhD Proposal – Fadi Biadsy Experiments – Training Two Models Split training data into two halves Train SVM CD-phone classifiers using the first half Run these SVM classifiers to annotate the CD phones of the 2 nd half Train the logistic classifier on the annotated sequences 42

PhD Proposal – Fadi Biadsy Discriminative Phonotactics – Dialect Recognition 43 MAP Adapted Acoustic Models: MAP Adapt GMMs Super Vectors Acoustic frames for second state CD-Acoustic Models: CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant] CD-Phone Recognizer Logistic classifier Egyptian Super Vectors: Super Vector 1Super Vector N Dialects: (e.g.) SVM Classifiers Egy [vowel]-b-[glide] [front-vowel]-r-[sonorant] Egy

PhD Proposal – Fadi Biadsy Results – Discriminative Phonotactics 44 ApproachEER (%) PRLM17.7 GMM-UBM15.3 GMM-UBM-fMLLR11.0% Disc. Phonotactics6.0%

PhD Proposal – Fadi Biadsy Results per Dialect 45 DialectGMM fMLLR Disc. Pho. Egyptian4.4%1.3% Iraqi11.1%6.6% Levantine12.8%6.9% Gulf15.6%7.8%

PhD Proposal – Fadi Biadsy Comparison to the State-of-the-Art State of the art system: (Torres-Carrasquillo et al., 2008) Two English accents: EER: 10.6% Three Arabic dialects: EER: 7% Four Chinese dialects: EER: 7% NIST Language Recognition 2005: ( Mathjka et al., 2006) – fusing multiple approaches: 7 Languages + 2 accents: EER: 3.1% 46

PhD Proposal – Fadi Biadsy Research Plan 47

PhD Proposal – Fadi Biadsy Acknowledgments Thank You! 48

PhD Proposal – Fadi Biadsy Prosodic Differences Across Dialects 49 F0 differences Levantine and Iraqi speakers have higher pitch range and more expanded pitch register than Egyptian and Gulf speakers Iraqi and Gulf intonation show more variation than Egyptian and Levantine Pitch peaks within pseudo-syllables in Egyptian and Iraqi are shifted significantly later than those in Gulf and Levantine Durational and Rhythmic differences Gulf and Iraqi dialects tend to have more complex syllabic structure Egyptian tend to have more vocalic intervals with more variation than other dialects, which may account for vowel reduction and quantity contrasts

PhD Proposal – Fadi Biadsy Frame Alignment For each CD phone sequence: 1.Get the frame alignment with the acoustic model’s states 50