Presentation is loading. Please wait.

Presentation is loading. Please wait.

ASSESSING THE USABILITY OF MODERN STANDARD ARABIC DATA IN ENHANCING THE LANGUAGE MODEL OF LIMITED SIZE DIALECT CONVERSATIONS Authers:- Tiba Zaki Abulhameed.

Similar presentations


Presentation on theme: "ASSESSING THE USABILITY OF MODERN STANDARD ARABIC DATA IN ENHANCING THE LANGUAGE MODEL OF LIMITED SIZE DIALECT CONVERSATIONS Authers:- Tiba Zaki Abulhameed."— Presentation transcript:

1 ASSESSING THE USABILITY OF MODERN STANDARD ARABIC DATA IN ENHANCING THE LANGUAGE MODEL OF LIMITED SIZE DIALECT CONVERSATIONS Authers:- Tiba Zaki Abulhameed from Western Michigan University, Al-Nahrain University Presenter: - Imed Zitouni from Microsoft Research Ikhlas Abdel-Qader from Western Michigan University Mohamed Abusharkh from Ferris State University

2 ACKNOWLEDGMENT Many thanks to the Higher Committee of Education Development (HCED) in Iraq for sponsoring the research. Also, we would like to show our appreciation to LDC for the opportunity to use Arabic GALE phase2 part1, and part2 data through their data scholarship.

3 OUTLINE Why to improve "dialect conversations" LM Issues in Iraqi dialect conversations Challenges in our case of study Approach to solve the issues to get more accurate LM Results & analysis Future work Q & A

4 WHY TO IMPROVE "DIALECT CONVERSATIONS" LM Real world conversations uses dialect language. Accurate LM is an important component to have efficient ASR.

5 ISSUES IN IRAQI DIALECT CONVERSATIONS Diversity in pronunciation based on region, spanning from north to the south of Iraq Many words originated from other regional languages such as Turkish, Farsi, and English. Also, Some words come from MSA but have a slightly changed pronunciation. This causes an unbounded vocabulary set to develop. Tendency of dialects to adjust the vocabulary and language usage through relatively short time intervals to reflect the cultural and generational changes. LM should be enabled to recognize such evolution and adapt. Speakers may use unrestricted grammar

6 CHALLENGES IN OUR CASE OF STUDY The data sparsity Limited size training data Adapting MSA of different domain (broadcast news and reports vs. Phone call conversations)

7 APPROACHES TO SOLVE THE ISSUES TO GET MORE ACCURATE LM To reduce Data sparsity, Class n-gram is implemented. To enlarge the data, Mixing with MSA and Refining For domain adaptation, we use interpolated LMs

8 CLASS N-GRAM BASED ON WORD2VEC VECTORS CLUSTERING Word2vec n-gram counting ouputs two files CS,CPW 1- CS is used to produce LM $ ngram-count -order 2 \ -read CS -write f1.ngrams $ ngram-count -order 2 \ -read f1.ngrams -lm w2vClassBased.lm 2- CPW is used in testing the LM $ ngram -lm w2vClassBased.lm \ -classes CPW -ppl test-data

9 ILLUSTRATION OF CS & CPW

10 MIXING WITH MSA AND REFINING Iraqi training set is 10% of MSA training set To have equal proportions, Two mixing ways were tested:- 1.All the Iraqi training set with 0.1 of MSA. 2.Duplicate the Iraqi data 10 time with all of the MSA. Refining the data by maintaining only the sentences with probability of words less than 0.3

11 INTERPOLATED LMS Lambda weights were set using tuning data of size 4% of total Iraqi

12 RESULTS CB stands for word2vec Class Based bi-gram ​ 21% over Baseline1 ​ 15% over Baseline2 ​

13 ANALYSIS 1-SIZE OF INTERSECTED VOCABULARY

14 ANALYSIS 2-STRUCTURE COMPARISON أنف على ما ها هو

15 ANALYSIS 3- DEALING WITH OOV If OOV are classified we have 1% more improvement

16 FUTURE STUDY We attempted to investigate the value of using GALE MSA data to reduce the Iraqi dialectal conversation perplexity mainly through two approaches; mixing the data versus interpolating their Language models. LM interpolation achieved encouraging improvements and so future consideration can be given to such approach. Different treatment for OOV words can be considered for future study. A biased selection of a subset of GALE's dataset can be used to enhance the results further.

17 QUESTIONS


Download ppt "ASSESSING THE USABILITY OF MODERN STANDARD ARABIC DATA IN ENHANCING THE LANGUAGE MODEL OF LIMITED SIZE DIALECT CONVERSATIONS Authers:- Tiba Zaki Abulhameed."

Similar presentations


Ads by Google