Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.

Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France

Purpose To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic- English Machine Translation System To describe the methodology and implementation adopted for dictionary building and morphological analysis

Overview SYSTRAN’s Arabic-English MT System SYSTRAN’s Intuitive Coding Technology Intuitive Coding of the Arabic Lexicon –Stem-based –Statistical Arabic stem Generation –Internal morphology –External morphology

SYSTRAN’s Arabic-English MT System An end to end MT System Development started July 2002 Using SYSTRAN’s NG technology –Declarative modules –State of the art Arabic linguistic knowledge –Transfer approach –Hybrid approach combining Statistical techniques and linguistic knowledge

SYSTRAN’s Intuitive Coding Technology Customizing MT systems to improve translation quality Building user specific dictionaries - by the developers - by the user - collaboration SYSTRAN’s decision: Let the user do the customization

Intuitive Coding (Senellart et al, 2003) Dictionary representation should be simple Automatic processing of user information Interactive processing Multi level coding algorithm Complete integration Easy to use Graphic Interface

Stem Based Arabic lexicon Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon: What are the building blocks of the Arabic dictionary? A – roots B - stems

Why Stems? Stems are more intuitive than roots Eliminates the need for morphological patterns “ الميزان الصرفي ” Eliminates overgeneralization of Arabic stems Subcategorization frames, syntactic and semantic information are stem-specific and not root-specific

Sample Entry 1016 إِنْتَصَرَ verbplain "[perfect= إنْتَصَرَ ],[imperfect= ينْتَصِرُ ],[pass per= إنْتُصِر ],[imperative= إنْتَصِر ],[passimp= ينْتَصَر ]"[+AINT+GPP+HUSUBJ]

Statistical Arabic Stem Generator To reduce amount of typing To speed up entry creation 60% increase of productivity of lexicographers Uses morphological rules that are most productive

Generator Output [perfect= قال ],[imperfect= يَقال ],[imperative= إقال ],[passp erf= قال ],[passimperf= يقال ] [perfect= كَتَبَ ],[imperfect= يَكْتُب ],[imperative= أُكْتُب ],[pa ssperf= كُتِبَ ],[passimperf= يُكْتَب ]

Arabic Morphology SYSTRAN has two different modules: 1. Internal Morphology 2. External Morphology Two separate modules in a feeding order

Internal Morphology Module Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing

The Input to Internal Morphology Module Input: Two files: 1. stem files 2. Morphological Rules file Output Inflected Dictionary file

Sample of output كتبن verb plain كتب +past+fem+3P+plural

Syntagmatic and Paradigmatic (Halliday 1972) Morphology Internal همشاهد و نييشاهدف External هاسيشاهد هيشاهدون ل هن نشاهد

External Morphology Module Decomposes a token into different part-of- speech units Follows morphosyntactic rules of the language It is the syntax of morphemes It has morphophonemic component

Sample of External Morphology Rules WAFA:= KABILI:= LI:= {WAFA}?_{AL}_ {WAFA}?_{NOUNADJ}_ {WAFA}?_{KABILI}_{NOUNADJ}_

Order of Application The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary Thus we can say that the output of the external morphology module feeds the internal morphology

Conclusion SYSTRAN’s monolingual dictionary has about 30,000 entries Coverage of newspapers’ discourse is over 90% The approach outlined in this paper has greatly accelerated development Analysis, homograph resolution and transfer rules are being added and implemented.

Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.

Similar presentations

Presentation on theme: "Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.

Similar presentations

Presentation on theme: "Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France."— Presentation transcript:

Similar presentations

About project

Feedback