Computational Investigation of Palestinian Arabic Dialects Ezra Daya Rafi Talmon Shuly Wintner
Background Fieldwork study refers to Arabic dialects spoken by people in 250 localities – Northern and central parts of Israel. Localities in the West Bank. Southern Lebanese communities in Galilee. 1948’s Palestinian refugees in existing Arabic localities .
Background cont. Colloquial Arabic features: Non-official spoken language, usually not written. Differs from place to place. The similarity/distance between the Arabic dialects can be measured Considered by the speakers as less prestigious compared to the official Arabic.
Background cont. Work performed by special teams : Collecting and processing fieldwork material such as recorded interviews and linguistic questionnaires. Transcription of the material that constitutes the basis of our work. Defining an accurate description of the language varieties of Palestinian colloquial Arabic, their characteristics, and their geographical distribution.
Transcribed Text Sample
Objectives using computational linguistic techniques in order to: Publication of the vast collected material using computational linguistic techniques in order to: Create lexicons and glossaries for Arabic dialects automatically. Create a linguistic atlas to graphically measure the similarities among the dialects. Better understanding of morphological and phonemic dialectology features.
Linguistic Atlas
The challenge – Rich Morphology Semitic languages such as Arabic, have a rich morphology and contain highly inflected forms. Example: axdat is 3nd, singular, feminine, past form of the verb axad Obtained by concatenating the suffix ‘at’ and reducing the vowel ‘a’ to the base axad.
Rich Morphology cont. Arabic has a complex system of morphology based on triconsonantal roots that is common in Semitic languages. For example, there are 10 verb patterns, each of which can be inflected in 3 numbers, 2 genders, 3 persons, several tenses and aspects, and can be suffixed by several pronominal forms.
Traditional Approach Disadvantages: Assignment of linguists performing grammatical analysis of the transcribed texts and manually creating lexicon, glossaries and linguistic atlas. Disadvantages: Lack of sophistication. Time consuming. Expensive human resources.
Innovative Approach Devise an automated analysis of these transcribed texts, in order to obtain: An automated creation of a glossary to organize all the lexical items by grammatical features. i.e. root, pattern etc. Isolation of the phonetic and morphological features and characteristic of specific dialects in this surveyed area. Measurement of dialect similarity. Automated processing provides accuracy and efficiency .
Linguistic Technologies For this research we intend to exploit existing computational linguistics technology for the investigation of Palestinian Arabic dialects by using: Finite-State technology. Machine learning techniques. Computational dialectology.
Finite State Technology Employing the Xerox finite state tools and techniques which are: Useful and efficient programs that process text in natural languages. Concentrating on morphological analysis and generation. Giving access to finite state operations and a regular expression compiler.
Machine Learning Machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. Two distinguished learning frameworks according to the amount of supervision used: Supervised learning when the learning algorithm is presented with pairs of strings of symbols., i.e. inflected and uninflected forms. Unsupervised learning when the algorithm is presented merely with a single set of words, and must work out what the morphological relationships are.
Computational Dialectology Use measures to compute the distance between two given dialects and to define geographical dialect boundaries. Example: Edit Distance The distance could be set sensitive to phonological similarities. Example:
Previous Related Work Morphological Tagging of the Qur’an: The system facilitates a variety of queries on the Qur’anic text that make reference to the words and their linguistic attributes and provides full morphological tagging of its words. The core of the system is a set of finite-state based rules which describe the morpho-phonological and morpho-syntactic phenomena of the Qur’anic language. The system is currently being used for teaching and research purposes.