Computational Investigation of Palestinian Arabic Dialects

Slides:



Advertisements
Similar presentations
Social Dialectology Ch.3 Measuring the Cause of Variation Defining a Linguistic Variable Social Factors Related to Variation Identifying Variation in.
Advertisements

Uses of a Corpus “[E]xplore actual patterns of language use”
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Autosegmental Phonology
Geodatabase basic. The geodatabase The geodatabase is a collection of geographic datasets of various types used in ArcGIS and managed in either a file.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Grammar and Grammars Dialects of Native Speakers.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
1 Today: Accents and Dialects of US English This hour: What is a dialect? An accent? What contributes to a listener's perception of accented speech? From.
Phonetics and Phonology.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Bootstrapping pronunciation models: a South African case study Presented at the CSIR Research and Innovation Conference Marelie Davel & Etienne Barnard.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Juba Arabic (expanded pidgin) also called Arabic Sudanese Creole Billy Evalt.
1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Communication Disorders Across Cultures
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Phonetics and Phonology
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Instrumentation.
The Linguistics of Second Language Acquisition
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
What is linguistics  It is the science of language.  Linguistics is the systematic study of language.  The field of linguistics is concerned with the.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
Chapter Two ( Data Model) Objectives Introduction to Data Models What are the Data Models Why they are important Learn how to design a DBMS.
Linguistics The first week. Chapter 1 Introduction 1.1 Linguistics.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
WHAT IS LANGUAGE?. INTRODUCTION In order to interact,human beings have developed a language which distinguishes them from the rest of the animal world.
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Natural Language Processing Chapter 2 : Morphology.
Jeopardy Syntax Morphology Sociolinguistics and Prescriptivism Phonology Language and Diversity Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300.
Levels of Linguistic Analysis
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
LANGUAGE, DIALECT, AND VARIETIES
Slang. Informal verbal communication that is generally unacceptable for formal writing.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
History of the English Language ENGL Spring Semester 2005.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
2 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel Data Models Why data models are important About the basic data-modeling.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
INTRODUCTION TO APPLIED LINGUISTICS
Language choice in multilingual communities
Grammatical Issues in translation
1 Variation in English Grammar Linda Thomas U210A Chapter 6.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Dafydd Gibbon Universität Bielefeld Germany
An Introduction to Linguistics
What is sociolinguistics 2
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
The role of the Arabic orthography in reading and spelling
CHAPTER 5 This chapter introduces students to the study of linguistics. It discusses the basic categories and definitions used to study language, and the.
Purpose of Study & Introduction to Sarf (Morphology)
A Systematic Framework for Language Analysis
Token generation - stemming
Language- an abstract cognitive system which allows humans to produce and comprehend meaningful utterances Dialect- a variety of language, defined by geographical.
Levels of Linguistic Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
What is sociolinguistics?
Presentation transcript:

Computational Investigation of Palestinian Arabic Dialects Ezra Daya Rafi Talmon Shuly Wintner

Background Fieldwork study refers to Arabic dialects spoken by people in 250 localities – Northern and central parts of Israel. Localities in the West Bank. Southern Lebanese communities in Galilee. 1948’s Palestinian refugees in existing Arabic localities .

Background cont. Colloquial Arabic features: Non-official spoken language, usually not written. Differs from place to place. The similarity/distance between the Arabic dialects can be measured Considered by the speakers as less prestigious compared to the official Arabic.

Background cont. Work performed by special teams : Collecting and processing fieldwork material such as recorded interviews and linguistic questionnaires. Transcription of the material that constitutes the basis of our work. Defining an accurate description of the language varieties of Palestinian colloquial Arabic, their characteristics, and their geographical distribution.

Transcribed Text Sample

Objectives using computational linguistic techniques in order to: Publication of the vast collected material using computational linguistic techniques in order to: Create lexicons and glossaries for Arabic dialects automatically. Create a linguistic atlas to graphically measure the similarities among the dialects. Better understanding of morphological and phonemic dialectology features.

Linguistic Atlas

The challenge – Rich Morphology Semitic languages such as Arabic, have a rich morphology and contain highly inflected forms. Example: axdat is 3nd, singular, feminine, past form of the verb axad Obtained by concatenating the suffix ‘at’ and reducing the vowel ‘a’ to the base axad.

Rich Morphology cont. Arabic has a complex system of morphology based on triconsonantal roots that is common in Semitic languages. For example, there are 10 verb patterns, each of which can be inflected in 3 numbers, 2 genders, 3 persons, several tenses and aspects, and can be suffixed by several pronominal forms.

Traditional Approach Disadvantages: Assignment of linguists performing grammatical analysis of the transcribed texts and manually creating lexicon, glossaries and linguistic atlas. Disadvantages: Lack of sophistication. Time consuming. Expensive human resources.

Innovative Approach Devise an automated analysis of these transcribed texts, in order to obtain: An automated creation of a glossary to organize all the lexical items by grammatical features. i.e. root, pattern etc. Isolation of the phonetic and morphological features and characteristic of specific dialects in this surveyed area. Measurement of dialect similarity. Automated processing provides accuracy and efficiency .

Linguistic Technologies For this research we intend to exploit existing computational linguistics technology for the investigation of Palestinian Arabic dialects by using: Finite-State technology. Machine learning techniques. Computational dialectology.

Finite State Technology Employing the Xerox finite state tools and techniques which are: Useful and efficient programs that process text in natural languages. Concentrating on morphological analysis and generation. Giving access to finite state operations and a regular expression compiler.

Machine Learning Machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. Two distinguished learning frameworks according to the amount of supervision used: Supervised learning when the learning algorithm is presented with pairs of strings of symbols., i.e. inflected and uninflected forms. Unsupervised learning when the algorithm is presented merely with a single set of words, and must work out what the morphological relationships are.

Computational Dialectology Use measures to compute the distance between two given dialects and to define geographical dialect boundaries. Example: Edit Distance The distance could be set sensitive to phonological similarities. Example:

Previous Related Work Morphological Tagging of the Qur’an: The system facilitates a variety of queries on the Qur’anic text that make reference to the words and their linguistic attributes and provides full morphological tagging of its words. The core of the system is a set of finite-state based rules which describe the morpho-phonological and morpho-syntactic phenomena of the Qur’anic language. The system is currently being used for teaching and research purposes.