Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al.

Similar presentations


Presentation on theme: "Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al."— Presentation transcript:

1 Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al.

2 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects2 PADT Project at a Glance Dependency treebank of Modern Standard Arabic Morphology 58,148 tokens Analytical syntax 41,288 tokens Tectogrammatical descriptionin preparation Experience of the Prague Dependency Treebank Guidelines and annotations by Charles University Since 2001 ~ five annotators ~ three researchers Cooperation with the Linguistic Data Consortium Source corpora, morphological analyzer, workshops

3 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects3 Presentation Outline Introductory issues in Arabic Morphology and the writing system Elementary syntactic constructs LDC Arabic Treebank Reference to ConDep conversion Prague Arabic Dependency Treebank Progress in the project, applications Related projects and perspectives Exchange of tools and ideas Workshops and cooperation

4 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects4 Arabic Language and Script Semitic language, inner flexion and concatenation, consonantal roots, weak derivation patterns Phonemic script, non-vocalized script, word tying, other omissions الافراد فهم the members understoodfahima al-'afra~du the members were understoodfuhima al-'afra~du he understood the membersfahima al-'afra~da understanding to the isolationfahmu al-'ifra~di and they are the individualsfa-hum al-'afra~du

5 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects5 Morphology Issues Arabic strings are extremely ambiguous Short vowels, consonantal geminations, glottal-stop marks etc. normally omitted in the script Strings need not correspond to single words Morphonological changes increase the homonymy Tokenization of input surface strings Necessary pre-requisite to analytical annotation Requires morphological disambiguation Lexicon update, foreign names and terms Use those analyzers which are flexible in this respect

6 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects6 Elementary Syntax Issues Mostly VSO in verbal sentences, but … … not so in clauses with non-verbal predication … neither if topicalizers are present Non-verbal predication of several types Verbal nature of some nominal formations Grammatical co-reference, accusative of the inner object Complex referencing, rich expressions

7 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects7 Dependency Formalism la- [PredP] for -hu [Obj] him baytun [Sb] a-house [nom.] da~ma [Pred] lasted iqtira~Hu [Sb] proposal ‑ hu [Atr] his al-Eamali~yata [Obj] the-operation [acc.] Eala~ [AuxP] on zumala~’i [Obj] colleagues ‑ hi [Atr] his sa~Eatayni [Adv] two-hours [acc.] a~mili~na [???] hoping [acc.] qubu~la [Obj] accepting [acc.] -kum [Atr] your daEwata [Obj] invitation [acc.] -na~ [Atr] our al-baytu [Sb] the-house kabi~run [Pnom] a-big

8 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects8 Constituency X Dependency Non-terminal nodes + Text tokens Constituent labeling on non-terminals Slots and traces Linguistic Data Consortium, University of Pennsylvania Sentence root node + Text tokens Analytical function for every tree node Government and roles CCL & IFAL & ICL, Charles University in Prague

9 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects9 Model Arabic Phrase I Trace of the antecedent subject Compound function of the head of the clause – outer and inner perspectives Free word-order compliant

10 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects10 Model Arabic Phrase II Sister-like co-ordination Conjunction of co-ordination Status constructus

11 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects11 LDC Arabic Treebanking Arabic Treebank: Part 1, version 2.0 (syntax) 160,275 words, 4,113 trees Arabic Treebank: Part 2, version 1.0 (morphology) 144,199 words, 2,591 paragraphs Arabic Treebank: Part 1, Arabic-English Parallel 10K-word parallel translation Arabic Gigaword Agence France Presse, Al Hayat, Al Nahar, Xinhua 391,619,000 words, 1,256,719 documents

12 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects12 PADT Annotation Progress AFP Data Exchange Experiment Dependency annotation of LDC’s ~10k words 12,936 nodes, 374 trees (34.6 nodes per tree) Additional Xerox morphological annotation UMMAH Corpus Annotation Morphology with the LDC tools, ~50k words 45,212 nodes, 1,039 trees (43.5 nodes per tree) Dependency annotation, ~30k words ready 28,352 nodes, 646 trees (43.9 nodes per tree)

13 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects13 Algorithm Progress Constituency—Dependency Transformation Based on the AFP Exchange Experiment EACL ’03 Research Note Arabic Dependency Parser & Analytical Function Assignment Incorporated into the annotation process Machine-learning methods involved

14 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects14 Application Progress TrEd Tree Editor Highly powerful and reusable annotation tool NetGraph Tree Search Extra version for Arabic Server/Client system architecture Perl Modules AG2MorphoXML, MorphoMap, Encode::Arabic

15 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects15 TrEd Tree Editor Perl and Perl/Tk interactive application or batch processor General editor for trees and tree-like graphs Analytical dependency annotation Tectogrammatical dependency annotation Phrase-structure trees, MT solution forests … Comparison of parser/human results Language and platform independent

16 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects16 NetGraph Tree Search Java client, C server Interactive tree search, viewing, counting … Query in the form of a generalized subtree Server-side data search, client-side rendering Dependency trees, phrase-structure trees, trees Linguistic research, verifying of hypotheses Quick & easy system, language and platform independent

17 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects17 Perl Modules AG2MorphoXML Token reconstruction from morpheme sequence Various readings/annotations, Prague XML MorphoMap Conversion from AraMorph multi-word POS tags to positional/bit-vector compact description Encode::Arabic Incorporation of Buckwalter and ArabTeX transliterations into the useful Encode module

18 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects18 Related Projects Prague/Penn Tectogrammatical description guidelines Excellent PhD students joining the project Taggers, parsers, tree-node classifiers AraMorph re-implementation, spell-checkers Dictionaries, on-line or printed Projects in CR, USA and the Netherlands ACE named entity annotation Currently in LDC, “included” in tectogrammatics LDC’s CallHome & CallFriend for Arabic Dialects

19 August 28, 2003Prague Arabic Dependency Treebank: Introduction & Related Projects19 Workshops Penn/Prague Philadelphia, July 2002 Setting up, POS tool demo, intro to descriptions AFP data exchange experiment Prague, May 2003 Reports, tutorials on applications and theories Morphology improvements, Arabic Gigaword Tool exchange and data revision plans Lisbon, April 2004 Open workshop proposed for the LREC ’04 Publication of the projects, the results & consequences


Download ppt "Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al."

Similar presentations


Ads by Google