Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy.

Slides:



Advertisements
Similar presentations
Functional Morphology by Markus Forsberg & Aarne Ranta Otakar Smrž Institute of Formal and Applied Linguistics Charles University in Prague.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of Computer Science University of Tartu
Morphology.
Morphology and Lexicon Chapter 3
Morphology Morphology is the branch of linguistics that studies the structure of words. In English and many other languages, many words can be broken down.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
1.4 Linguistic signs: Morphemes and lexemes.
Ken Beesley / 7 Aug 2001 / page 1 One Working Linguist’s View of Two-Level Morphology Ken Beesley, Xerox Research Centre Europe.
Autosegmental Phonology
Morphology Words and Rules. Lexicon collection of the meaningful sound and their meanings in a language dictionaries attempt to be written versions of.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Introduction to English Morphology Finite State Transducers
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Chapter three lexicon 3.1 What is Word? three senses of “ WORD”
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.
English Lexicology Morphological Structure of English Words Week 3: Mar. 10, 2009 Instructor: Liu Hongyong.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Computational Investigation of Palestinian Arabic Dialects
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
WEEK3- MORPHOLOGY Dr. Monira I. Al-Mohizea. What is this?
Morphological Analysis of Hungarian in NooJ
Chapter 3 Lexical & Grammatical Morphology Morphology Lane 333.
The Descriptive Grammar as a (Meta)Database Jeff Good University of Pittsburgh and Max Planck Institute for Evolutionary Anthropology.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Lecture 4 Eastern Middle Persian
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
Inflection Word forms Paradigms. INFLECTION is a morphological change by means of which a word adapts to a grammatical function without changing its lexical.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
MORPHOLOGY. Morphology The study of internal structure of words, and of the rules by which words are formed.
SYNTAX.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
Morphology and Syntax- Week 5
Slang. Informal verbal communication that is generally unacceptable for formal writing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Kuiper and Allan Chapter 2.2.2
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Morphology 1 : the Morpheme
Introduction to Linguistics Unit Four Morphology, Part One Dr. Judith Yoel.
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Chapter 6 Morphology.
Prague Arabic Dependency Treebank
Língua Inglesa - Aspectos Morfossintáticos
Introduction to English morphology
Presentation transcript:

Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy Ali Farghaly Université Lumière-Lyon 2, Systran Software Inc. Lyon, France San Diego (CA), USA Paper presented at the: IXth MT Summit – Workshop on Machine Translation for Semitic Languages: issues and approaches – New Orleans, USA, Sept. 23, 2003

Keywords n MT, multilingual lexical databases n NLP and MT feasibility in Semitic languages n Arabic morphology & morphotactics (word- form structure) n Semitic roots and patterns n stem-based lexicons n morphosyntactic specifiers n grammar-lexis specifications

Keywords n MT, multilingual lexical databases n NLP and MT feasibility in Semitic languages n Arabic morphology & morphotactics (word- form structure) n Semitic roots and patterns n stem-based lexicons n morphosyntactic specifiers n grammar-lexis specifications

Keywords n MT, multilingual lexical databases n NLP and MT feasibility in Semitic languages n Arabic morphology & morphotactics (word- form structure) n Semitic roots and patterns n stem-based lexicons n morphosyntactic specifiers n grammar-lexis specifications

Lexical databases in Semitic languages n ROOT-&-PATTERN grounded analysis of fully ‘vowelled’ Arabic script Pioneering works (D. Cohen, 1961/70, Hlal, 1979)

Lexical databases in Semitic languages n ROOT-&-PATTERN grounded analysis of fully ‘vowelled’ Arabic script Pioneering works (D. Cohen, 1961/70, Hlal, 1979) n STEM + ROOT-&-PATTERN approaches: –Arabic computational lexicon of T. Buckwalter (Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, 2002)

Lexical databases in Semitic languages n ROOT-&-PATTERN grounded analysis of fully ‘vowelled’ Arabic script Pioneering works (D. Cohen, 1961/70, Hlal, 1979) n STEM + ROOT-&-PATTERN approaches: –Arabic computational lexicon of T. Buckwalter (Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, 2002) –Lexeme-Based Morphological treatment of Arabic (after Aronoff, 1994 and Beard, 1995): “Only the stem is morphologically relevant in that realization rules act on it" (Soudi, Cavalli-Sforza, Jamari, 2001).

Lexical databases in Semitic languages (2) n STEM-grounded database, the entries of which are associated with grammar-lexis specifications. Feasibility conditions for the recognition of Arabic standard vowel-free writing Desclés & al. (1983), Dichy (1984/89), SAMIA (1984), Hassoun (1987), Dichy & Hassoun, eds. (1989), Dichy, Braham, Ghazali & Hassoun (2002)

Paper focuses on: n the precise reasons why and how stem-grounded lexical databases, with entries associated with grammar-lexis specifications, should be recommended in Arabic NLP applications, with special reference to MT.

ROOTS & PATTERNS n As is well-known: ROOTS (massively) = - ordered sequences of three consonants - traditionally considered representative of a semantic field. field. Related nouns, verbs and adjectives are considered as generated through processes of vocalization and affixation, forming a syllabic PATTERN. Combination of roots and patterns in linguistic units is both non-concatenative (McCarthy, e.g. 1981) and sensitive to constraints and rules (point going back to Medieval Arabic linguistics).

The root-&-pattern paradigm n (Cantineau, 1950) (D. Cohen, 1961/70) and many others: The assumption is: Roots and patterns define the meaning of lexical entries in Arabic. Nouns, verbs and adjectives result from combination of (a) the "general meaning" of a given root, and (a) the "general meaning" of a given root, and (b) a "specific meaning" associated with a pattern. (b) a "specific meaning" associated with a pattern.

The root-&-pattern paradigm n (Cantineau, 1950) (D. Cohen, 1961/70) and many others: The assumption is: Roots and patterns define the meaning of lexical entries in Arabic. Nouns, verbs and adjectives result from combination of (a) the "general meaning" of a given root, and (a) the "general meaning" of a given root, and (b) a "specific meaning" associated with a pattern. (b) a "specific meaning" associated with a pattern. The whole lexicon of the language could then be generated using two database (a ROOT and a PATTERN dB), to which rules accounting for constraints of various natures would be added.

Limits of the ROOT-&-PATTERN representation (1) n Root-&-pattern representation is only valid for a subset of the lexicon n A substantial subset of nouns is not subject to analysis in terms of root and pattern (Dichy, 1984/89; Hassoun, 1987). Ancient and medieval Arabic examples:Ancient and medieval Arabic examples: ?ismâ‘îl (إسماعيل), “Ishmael”, nâranj (نارنج), “orange”, sunûnû (سنونو), “sparrow”, sirât (سراط), “path, way”; Modern standard Arabic examples:Modern standard Arabic examples: fusfât or fusfât (فسفات ـ فصفات), “phosphate”, naylûn or nîlûn (نيلون), “nylon”.

Limits of the ROOT-&-PATTERN representation (2) n Root-&-pattern representation is essentially valid for verbs and deverbals.

Limits of the ROOT&PATTERN representation (2) n Root-&-pattern representation is essentially valid for verbs and deverbals. n Form-to-form derivational relations essentially operate in the domain of verbs and basic verbo- nominal derivatives, such as infinitive forms (masdar – مصدر) and active or passive participles (?ism al-fâ‘il, ?ism al-maf‘ûl – اسم الفاعل والمفعول).

Limits of the ROOT-&-PATTERN representation (2) n Root-&-pattern representation is essentially valid for verbs and deverbals. n Form-to-form derivational relations essentially operate in the domain of verbs and basic verbo- nominal derivatives, such as infinitive forms (masdar – مصدر) and active or passive participles (?ism al-fâ‘il, ?ism al-maf‘ûl – اسم الفاعل والمفعول). n All Arabic verbs and all verbo-nominal derivatives can be analysed in terms of root and pattern (Dichy, 1984/89, 1997).

Limits of the ROOT-&-PATTERN representation (3) n The root-&-pattern paradigm appears to be doubly mistaken. n Extending its representation to the entire lexicon: u (a) leaves a large number of lexical entries un- represented (a substantial subset of nouns), u (b) does not sufficiently take into account its own effective domain of validation (verbs and basic verbo-nominal derivatives).

Arabic word-form structure n ‘Traditional’ representation of the word-form maximal _______word-form_______ | | minimal ____word-form___ | | ## PCL# PRF+ STEM + SUF# ECL## ## PCL# PRF+ STEM + SUF# ECL## Nucleus-extensions representation Nucleus-extensions representation NF / \ aEF — pEF / \ / \ NF / \ aEF — pEF / \ / \ PCL PRF SUF ECL PCL PRF SUF ECL

Arabic word-form structure n ‘Traditional’ representation of the word-form maximal _______word-form_______ | | minimal ____word-form___ | | ## PCL# PRF+ STEM + SUF# ECL## ## PCL# PRF+ STEM + SUF# ECL## Nucleus-extensions representation Nucleus-extensions representation NF / \ aEF — pEF / \ / \ NF / \ aEF — pEF / \ / \ PCL PRF SUF ECL PCL PRF SUF ECL

Constituents of the word-form in Arabic n proclitics (PCL) n prefix (PRF) n a stem(2 types) n suffixes (SUF) n enclitics (ECL)

Hebrew word-form structures are similar to that of Arabic n Sampson (1985: 90-1) analyses graphic word- forms in Hebrew. Not surprisingly, word-form structure analyses in Hebrew are to some extent akin to the one presented here for Arabic. (Sampson, Geoffrey, Writing systems. Stanford University Press.)

Arabic word-form structures entail complex grammar-lexis relations n The word-formative grammar divides into u (1) EF-EF rules and u (2) NF-EF rules (Dichy, 1997).

Arabic word-form structures entail complex grammar-lexis relations n The word-formative grammar divides into u (1) EF-EF rules and u (2) NF-EF rules (Dichy, 1997). n (1)EF-EF rules purely belong to the grammar of the language, e.g.: u If the proclitics include the preposition bi- or li-, then the case-ending suffixes are that of the indirect case. u The proclitic article ?al- excludes undetermined case endings known as tanwîn.

Arabic word-form structures entail complex grammar-lexis relations n The word-formative grammar divides into u (1) EF-EF rules and u (2) NF-EF rules (Dichy, 1997). n (1)EF-EF rules purely belong to the grammar of the language n (2) NF-EF rules are correlated to NF categories and sub-categories. Their field in the word-formative grammar is that of grammar-lexis relations.

Morphotactic grammar-lexis relations: NF-EF relations - Type 1 n NF-EF relations pertain partly to grammar, e.g.: u the proclitic article ?al- is exclusively compatible with adjectives and common nouns; u the proclitic morpheme sa-, which denotes the future of verbs, is only compatible with imperfective verb stems; n and for a greater part to grammar-lexis relations: u e.g.: enclitic pronouns are associated with verbs according to selection features such as <+ human vs. – human complements> (متعد إلى العقلاء ~ إلى غير العقلاء). One can say, for example: qara?tu-hu (قرأته), "I read it", but not *qara?tu-hum

Morphotactic grammar-lexis relations: NF-EF relations -Type 2 n A large set of NF-EF relations involves "frozen" or "lexicalized" relations between nucleus and extension formatives, as opposed to compositional relations, e.g.: u The word jâmi c a& (جامعة) can be analysed either as: F (a) the active participle jâmi c "bringing together", "collecting", to which the fem. suffix –a& is added, or as: F (b) a lexicalized compound including the meanings of the active participle and the suffix –a& of the res generalis, "the thing that..." (Roman, 1990). The whole compound, which includes a semantic addition (Dichy, 2002), means "university". u In (a), the relation between jâmi c and –a& is simply compositional. In (b), it is clearly frozen or lexicalized (deriving from "the thing that brings together").

Grammar-lexis relations are finite n The two types of NF-EF relations account for a finite and exhaustive set of grammar-lexis relations, which operate in the domain of the Arabic word-form. n They have been formalized in Hassoun (1987) and Dichy (1987, 1990, 1997), and implemented in the DIINAR.1 language database. n They have also been extended to scientific terminological units (Lelubre, 2001).

DIINAR.1 (DIctionnaire INformatisé de l’Arabe), Arabic acronym Ma‘âlî (Mu‘jam al-‘Arabiyya l-’âlî – مـعـالي ــ معجم العربية الآلي ) n A comprehensive Arabic Language dB of around 130,000 lemmas, comprising u approximately 20,000 verbal entries, 79,000 deverbal items, 29,000 nominal entries (to which 10,000 related "broken plural" items are attached), 1,000 proper names and 450 grammatical tool-words (each of which is associated with a specific grammar). u The resource also includes the clitics and affixes of the language.

DIINAR.1 morphosyntactic specifiers n Each lexical unit is associated with morphosyntactic specifiers accounting for grammar-lexis specifications operating at word- form level. n Specifiers also include derivational links between morphologically related items such as u verb  deverbal(s) or, in nouns: singular  “broken” plural, etc.

DIINAR.1 morphosyntactic specifiers n Each lexical unit is associated with morphosyntactic specifiers accounting for grammar-lexis specifications operating at word-form level. n Specifiers also include derivational links between morphologically related items such as u verb  deverbal(s) or, in nouns: singular  “broken” plural, etc. n DIINAR.1 has been completed by: u in Tunis, IRSIT (A. Braham and S. Ghazali), and u in France at ENSSIB (Ecole Nationale Supérieure des Sciences de l'Information et des Bibliothèques – M. Hassoun) and  the Lumière-Lyon 2 University (J. Dichy). (See: Dichy, Braham, Ghazali & Hassoun, 2002)

Morphosyntatic specifiers can only be based on stems n Grammar-lexis relations are not connected with patterns. n They are not predictable on the sole basis of roots and patterns, n and can only be associated with actual lexical entries, n which can only be identified in a stem-based lexicon.

Morphosyntatic specifiers can only be based on stems n Grammar-lexis relations are not connected with patterns. n They are not predictable on the sole basis of roots and patterns, n and can only be associated with actual lexical entries, n which can only be identified in a stem-based lexicon.

Morphosyntatic specifiers can only be based on stems n Grammar-lexis relations are not connected with patterns. n They are not predictable on the sole basis of roots and patterns, n and can only be associated with actual lexical entries, n which can only be identified in a stem-based lexicon.

Morphosyntatic specifiers can only be based on stems n Grammar-lexis relations are not connected with patterns. n They are not predictable on the sole basis of roots and patterns, n and can only be associated with actual lexical entries, n which can only be identified in a stem-based lexicon.

The root, pattern and rule-based lexicon of the Xerox analyzer n Xerox Arabic morphological analyzer because it is accessible on the web: u n Based on solid and innovative finite-state technology u (Beesley, Kenneth and Karttunen Lauri, Finite State Morphology. CSLI Publications, Stanford, California). (Beesley, 2001) n The approach relies on previous research, including Buckwalter's lexicon presently used at LDC (Maamouri & Cieri, 2002), and a contribution to Two- level Morphology (Beesley, 1989/91)

The Xerox Analyzer/Generator n Beesley (2001) takes up the idea that Arabic words consist of at least two building blocks: the root and the prosodic template (McCarthy, 1981). n Processes applying in generation/analysis: u First, the process of "interdigitation" or the "merging" of roots and patterns to form stems. u Second, alternation rules apply to perform deletion, epenthesis, assimilation, gemination and metathesis. u Third, rules for short vowels and other diacritics are relaxed to allow for variations in the way Arabic words are written.

Xerox Arabic Lexicons Xerox has several lexicons (Beesley, 2001, p. 7): n The first is a lexicon of roots, which contains 4,930 entries. n The second is a dictionary of patterns, which includes about 400 entries. n Each root-entry is manually coded and associated with patterns. The manual association of roots and patterns produces about 90,000 Arabic stems. n When these stems combine with possible prefixes, suffixes and clitics by composition, 72 million abstract words are generated.

Stem-based Arabic Lexicons (1) n Stem-based lexicons, compared to root-based ones, are more intuitive to build (Farghaly and Senellart, 2003), more efficient, and easier to develop and extend. n FIRST POINT: Unlike the entries of root-&-pattern grounded databases, in a stem-based dictionary, all the lemmas are actual lexical units – not abstract or virtual items.

Stem-based Arabic Lexicons (2) n Pure root-&-pattern generation would produce a lexicon of over 2 million stems: u The Xerox lexicons comprise about 5,000 roots and 400 patterns: 5,000 x 400 = 2,000,000 n Manual generation and control has produced: u a dictionary of around 90,000 stem-entries at Xerox, and u around 120,000 stems in the DIINAR.1 database.

Stem-based Arabic Lexicons (3) n SECOND POINT: In a lexicon (a) based on stems, and (b) with entries associated with word-form grammar-lexis specifications, rule-governed combination with prefixes, suffixes, proclitics and enclitics only generates existing Arabic word-forms. n This is not the case of the 72 million word-forms generated from the 90,000 stems of the Xerox lexicon, which are clearly virtual or abstract units.

The Xerox Spanish Lexical Transducer and the DIINAR.1 Arabic dB n The Xerox Spanish Lexical Transducer contained, in 1996: over 46,000 baseforms. n It analyzed and generated over 3,400,000 inflected wordforms (Beesley & Karttunen, 2003, p. xvii). n In the DIINAR.1 lexical database, only 6.2 million existing word-forms are generated from the approximately 120,000 stem-based entries (Ouersighni, 2001).

Stem-based Arabic Lexicons (3) THIRD POINT : n In a stem-based morphological analyzer and/or generator, the process of generating stems from underlying roots is eliminated altogether. n Arabic lexical dB-s based on stems associated with grammar-lexis specifications are crucial in the context of MT: entries associated with morphosyntactic specifiers are much more compatible with the requirements of MT than virtual root-&-pattern generated word-forms. n Arabic lexical dB-s based on stems associated with grammar-lexis specifications are crucial in the context of MT: entries associated with morphosyntactic specifiers are much more compatible with the requirements of MT than virtual root-&-pattern generated word-forms.