King Saud University, Riyadh, Saudi Arabia 1/18/2019 West Point, SAAVB, and BBN/AUB Arabic Speech Corpora: A Comparative Survey Yousef A. Alotaibi Ali H. Meftah King Saud University, Riyadh, Saudi Arabia This work is supported by NPST project No. 10-INF1325-02 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
OUTLINE INTRODUCTION SPEECH CORPORA BACKGROUND EVALUATION CONCLUSION MSA Arabic Arabic Dialects SPEECH CORPORA BACKGROUND (TIMIT, WESTPOINT, SAAVB, and BBN/AUB) EVALUATION Type, Speakers, Data Source, Labelling, Training and Testing CONCLUSION 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
INTRODUCTION MSA Arabic 34 Phonemes (6 V + 28 C) Valid syllables: CV, CVV, CVC, CVCC, CVVC, and CVVCC Limited number of research Low in quality of Arabic language resources 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
INTRODUCTION Arabic Dialects The Arab world can be divided into many different ways. The following is only one of many that cover the main Arabic dialects: Gulf Arabic Levantine Arabic Egyptian Arabic Maghreb Arabic Yemenite 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
SPEECH CORPORA BACKGROUND TIMIT American English speakers of different genders and dialects A read (Canonical) speech corpus Contains a total of 6,300 sentences, 10 sentences (about 30 sec of speech) Spoken by each of 630 speakers (438 males that account for a percentage of 70%, and 192 females 8 major dialect regions of the United States (US) A speaker's dialect region is a geographical distribution within the U.S. mainland. Those speakers lived during their childhood years in the same area 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
SPEECH CORPORA BACKGROUND West Point Represent MSA Arabic language Produced by the Linguistic Data Consortium (LDC) A read corpus Contain 110 speakers (66 male, 44 female) Consists of collections of 4 main Arabic scripts which contain 258 sentences and it has a total of 1512 tokens and 991types. The total number of distinct Arabic words is 1131. It consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
SPEECH CORPORA BACKGROUND SAAVB Corpus Saudi Arabia dialect A telephony and noisy speech corpus Collected by KACST during 2002 to 2003 A Canonical and spontaneous speech corpus Acquired from 1,033 speakers (51% males and 49% females) Total duration recorded is 96.37 hours distributed among 60947 audio files (1033 speakers x 59 audio files) Size is 2.59 GB. It contains 1,033 directories with 183,518 files 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
SPEECH CORPORA BACKGROUND BBN/AUB Corpus Levantine dialect Developed by funding from the Defense Advanced Research Project Agency (DARPA) A set of spontaneous speech sentences Recorded in Boston (20%), and in American University of Beirut (AUB) (80%) Consists of 164 speakers, 101 males and 63 females Total duration recorded speech is 45 hours distributed among 75,900 audio, the total audio size: 6.5 GB. The total text size is 3.1 MB, Vocabulary is 15K words and Total words are 336K words 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Corpus Type Corpus TIMIT W.P SAAVB BBN/AUB TYPE Canonical + Spontaneously 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
Speakers Number of speakers EVALUATION Speakers Number of speakers Corpus TIMIT W.P SAAVB BBN/AUB NO. of Speakers 630 110 1033 164 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Speakers Gender Very clear in (TIMIT, SAAVB, and West Point) and implicitly reported in BBN/AUB Corpus TIMIT West Point SAAVB BBN/AUB Male 438 (70%) 66 (60%) 523 (50.63%) 101 (61.58%) Female 192 (30%) 44 (40%) 510 (49.37%) 63 (38.41%) 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Speakers Speakers’ Ages TIMIT, West Point, and BBN/AUB corpora unfortunately no reference to this was reported in their Catalogs and does not give information about the speakers’ ages In SAAVB corpus, the ages are distributed and documented for each speaker in a well-defined manner 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Speakers Speakers Nationalities BBN/AUB corpus does not refer to this important information, thy only refer to 20% of the corpus was recorded in Boston and the remaining 80% was recorded in AUB Corpus TIMIT West Point SAAVB BBN/AUB Native 100% 68.18% ? Nonnative - 31.81% 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Speakers Speakers distribution within investigated region of the targeted dialects West point and BBN/AUB not shown how they chose the group of speakers form the Arabic countries(22 country for W.P, and 4 countries for BBN/AUB) Corpus TIMIT West Point SAAVB BBN/AUB Distribution state 8 Regions ? ALL Saudi Cities. 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Data Sources Corpus TIMIT West Point SAAVB BBN/AUB Sampling Rate 16KHz 22.05KHz 8KHz Recorded By Soundproof chamber Shure SM10A microphone and a RANE Model MS1 pre-amplifier Telephone system Close-talking, noise-cancelling, headset microphone (the Andrea Electronics NC-65) 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
Lexicon and Labeling West Point is timeless EVALUATION Lexicon and Labeling West Point is timeless Corpus TIMIT West Point SAAVB BBN/AUB Labeling Yes No 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
EVALUATION Training and Testing Subsets TIMIT has been subdivided into suggested training and testing subsets using the following criteria: Roughly 20% to 30% of the corpus should be used for testing purposes, leaving the remaining 70% to 80% for training No speaker should appear in both subsets. All the dialect regions should be represented in both subsets Overlap of text material in the two subsets should be minimized; if possible no texts should be identical All the phonemes should be covered in the test material 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
Training and Testing Subsets EVALUATION Training and Testing Subsets SAAVB left it open for researchers and application developers to select the training and testing sets according to their need West Point and BBN/AUB do not refer to that 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
CONCLUSION Arabic language lacks reliable speech corpora Robust Arabic speech corpora must be consider: The different Arab countries The different Arab and dialects The different speakers' ages, genders and good distributions Training and testing subsets are too important additionally to the phonemes labelling 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
Thank you for Your Attention 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco
Any Question? 1/18/2019 4th International Conference on Arabic Language Processing, May 2–3, 2012, Rabat, Morocco