Download presentation
Presentation is loading. Please wait.
Published byNorah Robertson Modified over 9 years ago
1
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France
2
Purpose To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic- English Machine Translation System To describe the methodology and implementation adopted for dictionary building and morphological analysis
3
Overview SYSTRAN’s Arabic-English MT System SYSTRAN’s Intuitive Coding Technology Intuitive Coding of the Arabic Lexicon –Stem-based –Statistical Arabic stem Generation –Internal morphology –External morphology
4
SYSTRAN’s Arabic-English MT System An end to end MT System Development started July 2002 Using SYSTRAN’s NG technology –Declarative modules –State of the art Arabic linguistic knowledge –Transfer approach –Hybrid approach combining Statistical techniques and linguistic knowledge
5
SYSTRAN’s Intuitive Coding Technology Customizing MT systems to improve translation quality Building user specific dictionaries - by the developers - by the user - collaboration SYSTRAN’s decision: Let the user do the customization
6
Intuitive Coding (Senellart et al, 2003) Dictionary representation should be simple Automatic processing of user information Interactive processing Multi level coding algorithm Complete integration Easy to use Graphic Interface
7
Stem Based Arabic lexicon Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon: What are the building blocks of the Arabic dictionary? A – roots B - stems
8
Why Stems? Stems are more intuitive than roots Eliminates the need for morphological patterns “ الميزان الصرفي ” Eliminates overgeneralization of Arabic stems Subcategorization frames, syntactic and semantic information are stem-specific and not root-specific
9
Sample Entry 1016 إِنْتَصَرَ verbplain "[perfect= إنْتَصَرَ ],[imperfect= ينْتَصِرُ ],[pass per= إنْتُصِر ],[imperative= إنْتَصِر ],[passimp= ينْتَصَر ]"[+AINT+GPP+HUSUBJ]
10
Statistical Arabic Stem Generator To reduce amount of typing To speed up entry creation 60% increase of productivity of lexicographers Uses morphological rules that are most productive
11
Generator Output [perfect= قال ],[imperfect= يَقال ],[imperative= إقال ],[passp erf= قال ],[passimperf= يقال ] [perfect= كَتَبَ ],[imperfect= يَكْتُب ],[imperative= أُكْتُب ],[pa ssperf= كُتِبَ ],[passimperf= يُكْتَب ]
12
Arabic Morphology SYSTRAN has two different modules: 1. Internal Morphology 2. External Morphology Two separate modules in a feeding order
13
Internal Morphology Module Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing
14
The Input to Internal Morphology Module Input: Two files: 1. stem files 2. Morphological Rules file Output Inflected Dictionary file
15
Sample of output كتبن verb plain كتب +past+fem+3P+plural
16
Syntagmatic and Paradigmatic (Halliday 1972) Morphology Internal همشاهد و نييشاهدف External هاسيشاهد هيشاهدون ل هن نشاهد
17
External Morphology Module Decomposes a token into different part-of- speech units Follows morphosyntactic rules of the language It is the syntax of morphemes It has morphophonemic component
18
Sample of External Morphology Rules WAFA:= KABILI:= LI:= {WAFA}?_{AL}_ {WAFA}?_{NOUNADJ}_ {WAFA}?_{KABILI}_{NOUNADJ}_
19
Order of Application The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary Thus we can say that the output of the external morphology module feeds the internal morphology
20
Conclusion SYSTRAN’s monolingual dictionary has about 30,000 entries Coverage of newspapers’ discourse is over 90% The approach outlined in this paper has greatly accelerated development Analysis, homograph resolution and transfer rules are being added and implemented.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.