Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition with Hidden Markov Models Winter 2011
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Why is ASR Hard? Natural speech is continuous
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Study of Word-Level Accent Classification and Gender Factors
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
LANGUAGE, DIALECT, AND VARIETIES
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
King Saud University, Riyadh, Saudi Arabia
A maximum likelihood estimation and training on the fly approach
Presenter: Shih-Hsiang(士翔)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois with the advice and support of Eiman Mustafawi, Rehab Duwairi, and Mohamed Elmahdy Elabbas Benmamoun and Roxana Girju QNRF NPRP

Outline Background Cross-dialectal Transfer Learning Cross-dialectal Training Objective Corpora Experiments Conclusion

Background Dialectal variation is a difficult problem in automatic speech recognition (ASR). The spoken language is affected from acoustic realization of phones to differences in syntax, vocabulary, and morphology. In languages, such as Arabic and Chinese, large numbers of dialects are different to the extent that they are not mutually intelligible. Moreover, these dialects are spoken instead of written, that is, the written dialectal material is limited and does not generally follow a writing standard. Hence, the limited amount of training data and the unavailability of written transcriptions are serious bottlenecks in dialectal ASR system development.

Cross-dialectal Transfer Learning The Arabic language can be viewed as a family of related languages, with limited vocabulary overlap between dialects, but with a high percent overlap among the phoneme inventories. The idea of cross-dialectal transfer learning is to bridge the knowledge from one dialect to the other, assuming that different dialects still have knowledge in common.

Cross-Dialectal Training Objective

MSA Corpus: West Point MSA West Point Modern Standard Arabic corpus There are 8,516 speech files, totaling hours or 1.7 GB of speech data. Each speech file is recorded by one subject reciting one prompt from one of four prompt scripts. Approximately 7,200 files are from native speakers and 1,200 files are from nonnative speakers. There are totally 1,131 distinct Arabic words. All scripts were written with MSA and were diacritized.

Levantine Corpus: Babylon BBN/AUB DARPA Babylon Levantine Arabic Speech corpus Levantine Arabic is an Arabic dialect of people in Lebanon, Jordan, Palestine, and Syria. It is a spoken language instead of a written language. About 1/3 of word tokens are not in a typical MSA lexicon; many words shared by MSA are pronounced differently. The BBN/AUB dataset consists of 164 speakers, 101 males and 63 females. It is a set of spontaneous speech sentences, recorded in Levantine colloquial Arabic. The duration of recorded speech is 45 hours distributed among 79,500 audio clips.

Experiment Setting We first use forced alignment to generate phone boundary information. Then, for each phone occurrence, 12-dim PLP features are calculated with 25 ms Hamming window and 10 ms frame rate. Finally, we use segmental features to represent frame-level PLP features. For classification task, we omit the glottal stop phone in both corpora. There are 36 phone classes in West Point Modern Standard Arabic Speech corpus and 38 phone classes in BBN/AUB DARPA Babylon Levantine Arabic Speech corpus. For Levantine corpus, we randomly divide 60% of the data as training set, 10% of the data as development set, and the remaining 30% of the data as test set. We use the whole set of MSA data for transfer learning.

Experiments: Transfer MSA data to Levantine Arabic MSA is a formal language; recorded MSA can be easily obtained from broadcast news or lectures. Dialectal Arabic speech is only used in informal situations, therefore it is more difficult to obtain. We examine our proposed cross-dialectal GMM training technique as follows: knowledge about MSA data is used to reduce the error rate of a Levantine Arabic phone classifier.

Results: Transfer MSA data to Levantine Arabic Phone classification accuracy, system trained using a fraction of the MSA data and a fraction of the Levantine data. Abscissa: fraction of the Levantine corpus used for training (percent: 2.7 to 27 hours). Parameter: fraction of the MSA corpus used for training (percent: 0 to 1.4 hours). Ordinate: phone accuracy, Levantine test data.

Results: Transfer MSA data to Levantine Arabic Same data as previous slide, but replotted with Abscissa = ratio between the length of Levantine training data and MSA data. All systems peak at a data ratio of 20:1 (Levantine:MSA).

Discussion Low classification accuracy: Force-aligned phone boundaries are not precise. Hence, the feature of a phone occurrence might also include features from its adjacent phones. The classification accuracy is relevant to the ratio between the length of Levantine training data and MSA data. When proper ratio of MSA data is transferred to Levantine data (MSA data ≈ 1/20 of Levantine data), we enhance the phonetic coverage of GMMs and achieve higher accuracies. Transferring too much MSA data apparently results in mis-matched phone acoustic models.

Conclusion We present a cross-dialectal GMM training scheme in which data from the West Point Modern Standard Arabic Speech corpus are used to reduce error rate of a phone classifier applied to the Babylon Levantine Arabic corpus. Optimum ratio between in-dialect and cross-dialect data is about 20:1; this ratio is optimal for a wide range of total training set sizes.

Future Work Improve forced alignment in the Levantine data with phonetic landmark detectors and improved pronunciation models. Apply other methods of cross-dialect transfer learning (e.g., regularized adaptive learning; feature space normalization; learned distance metrics). Learn improved pronunciation models and language models in Levantine Arabic. Replicate similar methods in multiple dialects. Extend current phone classification tasks to word recognition tasks.