Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Similar presentations


Presentation on theme: "Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois."— Presentation transcript:

1 Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois with the advice and support of Eiman Mustafawi, Rehab Duwairi, and Mohamed Elmahdy Elabbas Benmamoun and Roxana Girju QNRF NPRP 09-410-1-069

2 Outline Background Cross-dialectal Transfer Learning Cross-dialectal Training Objective Corpora Experiments Conclusion

3 Background Dialectal variation is a difficult problem in automatic speech recognition (ASR). The spoken language is affected from acoustic realization of phones to differences in syntax, vocabulary, and morphology. In languages, such as Arabic and Chinese, large numbers of dialects are different to the extent that they are not mutually intelligible. Moreover, these dialects are spoken instead of written, that is, the written dialectal material is limited and does not generally follow a writing standard. Hence, the limited amount of training data and the unavailability of written transcriptions are serious bottlenecks in dialectal ASR system development.

4 Cross-dialectal Transfer Learning The Arabic language can be viewed as a family of related languages, with limited vocabulary overlap between dialects, but with a high percent overlap among the phoneme inventories. The idea of cross-dialectal transfer learning is to bridge the knowledge from one dialect to the other, assuming that different dialects still have knowledge in common.

5 Cross-Dialectal Training Objective

6

7

8 MSA Corpus: West Point MSA West Point Modern Standard Arabic corpus There are 8,516 speech files, totaling 11.42 hours or 1.7 GB of speech data. Each speech file is recorded by one subject reciting one prompt from one of four prompt scripts. Approximately 7,200 files are from native speakers and 1,200 files are from nonnative speakers. There are totally 1,131 distinct Arabic words. All scripts were written with MSA and were diacritized.

9 Levantine Corpus: Babylon BBN/AUB DARPA Babylon Levantine Arabic Speech corpus Levantine Arabic is an Arabic dialect of people in Lebanon, Jordan, Palestine, and Syria. It is a spoken language instead of a written language. About 1/3 of word tokens are not in a typical MSA lexicon; many words shared by MSA are pronounced differently. The BBN/AUB dataset consists of 164 speakers, 101 males and 63 females. It is a set of spontaneous speech sentences, recorded in Levantine colloquial Arabic. The duration of recorded speech is 45 hours distributed among 79,500 audio clips.

10 Experiment Setting We first use forced alignment to generate phone boundary information. Then, for each phone occurrence, 12-dim PLP features are calculated with 25 ms Hamming window and 10 ms frame rate. Finally, we use segmental features to represent frame-level PLP features. For classification task, we omit the glottal stop phone in both corpora. There are 36 phone classes in West Point Modern Standard Arabic Speech corpus and 38 phone classes in BBN/AUB DARPA Babylon Levantine Arabic Speech corpus. For Levantine corpus, we randomly divide 60% of the data as training set, 10% of the data as development set, and the remaining 30% of the data as test set. We use the whole set of MSA data for transfer learning.

11 Experiments: Transfer MSA data to Levantine Arabic MSA is a formal language; recorded MSA can be easily obtained from broadcast news or lectures. Dialectal Arabic speech is only used in informal situations, therefore it is more difficult to obtain. We examine our proposed cross-dialectal GMM training technique as follows: knowledge about MSA data is used to reduce the error rate of a Levantine Arabic phone classifier.

12 Results: Transfer MSA data to Levantine Arabic Phone classification accuracy, system trained using a fraction of the MSA data and a fraction of the Levantine data. Abscissa: fraction of the Levantine corpus used for training (percent: 2.7 to 27 hours). Parameter: fraction of the MSA corpus used for training (percent: 0 to 1.4 hours). Ordinate: phone accuracy, Levantine test data.

13 Results: Transfer MSA data to Levantine Arabic Same data as previous slide, but replotted with Abscissa = ratio between the length of Levantine training data and MSA data. All systems peak at a data ratio of 20:1 (Levantine:MSA).

14 Discussion Low classification accuracy: Force-aligned phone boundaries are not precise. Hence, the feature of a phone occurrence might also include features from its adjacent phones. The classification accuracy is relevant to the ratio between the length of Levantine training data and MSA data. When proper ratio of MSA data is transferred to Levantine data (MSA data ≈ 1/20 of Levantine data), we enhance the phonetic coverage of GMMs and achieve higher accuracies. Transferring too much MSA data apparently results in mis-matched phone acoustic models.

15 Conclusion We present a cross-dialectal GMM training scheme in which data from the West Point Modern Standard Arabic Speech corpus are used to reduce error rate of a phone classifier applied to the Babylon Levantine Arabic corpus. Optimum ratio between in-dialect and cross-dialect data is about 20:1; this ratio is optimal for a wide range of total training set sizes.

16 Future Work Improve forced alignment in the Levantine data with phonetic landmark detectors and improved pronunciation models. Apply other methods of cross-dialect transfer learning (e.g., regularized adaptive learning; feature space normalization; learned distance metrics). Learn improved pronunciation models and language models in Levantine Arabic. Replicate similar methods in multiple dialects. Extend current phone classification tasks to word recognition tasks.


Download ppt "Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois."

Similar presentations


Ads by Google