Download presentation
Presentation is loading. Please wait.
Published byFrederica McCormick Modified over 8 years ago
1
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat, Morocco Khalid Almeman and Mark Lee The University of Birmingham www.almeman.com
2
Outline Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
3
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
4
Introduction The usage of: MSA vs. Dialects
5
Dialectal Morphology & Variation – Arabic MSA has a rich morphology in two main aspects: Affixes and stems (word level) Syntax (context level) – Dialects have MSA complex and also the big change between MSA and the dialects in both word and syntax levels Introduction
6
Dialectal Morphology & Variation (the changes) Transforming in some phonetics – e.g. s to h (N Africa), q to a (LEV), s to H (EGY) New phonetics – e.g. k to ts or ch (Gulf), j to g (EGY) The changes in syntax between MSA and dialects No standardisation in writing – e.g. a loanword ‘sandwich’ can be represented in many forms; ساندوتش sAndwitš, ساندويشة sAndwiyšat, ساندويشه sAndwiyšh, ساندوش sAndwiš, سندوش sandwiš, سندوتش sandwitš
7
Introduction Dialectal Morphology & Variation (the changes) THE CHANGES IN PHONETICS BETWEEN ARABIC DIALECTS COMPARING WITH MSA E. G. MSAθðq j has converted to: Egyptians (or) tZAj Levantineθðgj (or) g Gulfθðgj North Africa s (or) tz (or) dAg
8
Introduction What is the problem: 1.The rich morphology in Arabic language 2.The variety between MSA and dialects 3.The variety between dialects themselves 4.No standardisation in Arabic dialects. 5.State of the art: MAGEAD 1.Restricted to verbs 2.Levantine – need to define rules for new dialects So, the need of dialects morphology analyser
9
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
10
Multi dialect morphology analyser Three methods have been applied: 1.Modify MSA analyser 2.Segment the rest of words 3.Check the frequency in the web corpus 1 2 3
11
Multi dialect morphology analyser Baseline experiment We have extracted 2229 dialects words from the web and then checked them in MSA morphology analyser (Al Khalil, 2011) the result The number of words2229 Unknown words1508 Unknown words (%)68% Recognised words721 The total accuracy32%
12
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
13
Multi dialect morphology analyser The first method: Adopt MSA analyser According to Haack (1996) the stem patterns of Arabic dialects are identical to those of MSA in many cases MSA EgyptianN AfricaGULF يلعبحيلعبهيلعبوابيلعب ينامحينامهيناموابينام يشربحيشربهيشربوابيشرب So the suggestion is to add NEW dialects affixes to MSA morphology analyser
14
The Results after the first layer: An example of output after first layer The number of words1508 Recognised words824 Recognised words (%)55% Unknown words684 Unknown words (%)45% The total accuracy has increasedFrom 32% to 69%
15
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
16
Multi dialect morphology analyser The second method: the segmenter Segments the rest of words by extracting four shapes of the word yet; we do not know which one is the correct?
17
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
18
Multi dialect morphology analyser FULL WORD usage ---- DISAGREED Between Arab countries in many cases So The third method: Use web corpus However,
19
Multi dialect morphology analyser The third method (cont.) According to a hypothesis: We will check the frequency in the web corpus; Full Word: بيصطاد (16500)Prefix: ب Suffix:Stem: يصطاد (800000) Full Word: بيتارجح (2850)Prefix: ب Suffix:Stem: يتارجح (212000) Full Word: بيتهجأ (5)Prefix: ب Suffix:Stem: يتهجأ (10100) Full Word: بيركع (13100)Prefix: ب Suffix:Stem: يركع (568000) Then: we choose the greatest frequency if it is >= 10000
20
The final Results: The number of words684 Recall (frequency > =10000)90% Precision (%)80% F-measure85% The total accuracy has increasedFrom 69% to 94% An example of the output after last layer
21
Last focus AdvantagesHowever, Still In many cases it can be used to differentiate between those words that have an actual suffix and those that have just similar letters of suffix e.g. مسؤولون masŵlwn ‘the accountants’ (actual suffix) Vs. e.g. جيلاتين jylAtyn ‘gelatine’ (similar letter of suffix) Does not support diacritisation yet. Web as corpus method also works with MSA words did not found in MSA morphology analyser e.g. الخبراء AlxubarA' ‘the experts’ آخرون Axrwn ‘others’.
22
Last focus AdvantagesHowever, Still Up to date e.g. two months later, found that unknown words have reduced from 76 to 64 words. Although all possible solutions appears in the first layer, they do not supported by the web search yet By Frequency the web search can also distinguish NEW dialect Arabic words. e.g. أبضاي AbaĎAy ‘strong man’ (Levantine), أتاي AtAy ‘tea’ (North Africa) and شدعوه šdaςwah ‘why’ (Gulf).
23
Contents Introduction Multi dialect Morphology Analyser – Adopt MSA morphology analyser – Segment unknown words – Check on web corpus Conclusions and future work
24
Conclusions and future work & Future work Works on a larger corpus Deal with diacritisation Add more linguistic rules in both adopted MSA morphology analyser and in web searching to improve the accuracy
25
Any questions ? Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.