Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

Similar presentations


Presentation on theme: "Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees."— Presentation transcript:

1 Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees of Arabic and Their Annotation in the TrEd Environment Otakar Smrž Petr Pajas

2 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment2 MorphoTrees … TrEd … ??? MorphoTrees mean turning unorganized sets of complex morphological analyses into hierarchies Intuitive, decision-efficient, multi-purpose, interesting In general, not limited to the language, nor the system of morphology, nor the levels, nor the implementation TrEd is a fully programmable graphical editor for tree-like graphs and an excellent suite of tools for data batch processing (local/network) Analytical and tectogrammatical dependency annotation Viewing and converting of Arabic phrase-structure trees Evaluating and merging of parser/tagger/human results

3 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment3 MorphoTrees in TrEd Files with two types of trees Criteria & restrictions Automatic decisions Hiding modes Viewing options Short-cut keys & mouse Consist- ency checks Processing & update macros

4 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment4 Arabic … the Questions Is there syntactic difference in sawfa ′arā ′abā ′Aḥmada and sa′as′alu wālidahu ? Is there morphological difference? The only difference is in the use of lexical units and morphs. The grammatical categories are unchanged, and morphology and syntax should clearly show this. How do we find syntactic units? How do we get back word-forms from the lexical units and tags? How much does improper morphological reading disturb consequent syntactic representation? Improper in tags, lemmas, diacritics, or in tokenization?

5 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment5 Reminder of the Terms Grapheme / Phoneme The least units capable of distinguishing meanings ~ 40 letters, context- dependent forms 28 consonants, 6 vowels Morph Composition of graphemes / phonemes Abstract derivational forms Morpheme The least unit representing some linguistic meaning Function of morphs Projection of grammatical categories Token The least syntactic unit Bearer of a uniform vector of grammatical categories

6 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment6 Tim Buckwalter’s Morphology PADT MorphoTrees are generated based on the information provided by Buckwalter Arabic Morphological Analyzer + Updateable stem-based lexicon, finite-state model, implementation in Perl and published under GNU GPL – Morphs, mapping only to Quasi-Functional Morphology The tokenization, clustering, modeling of conditionality, … (wabijAnibihA) [jAnib_1] wa/CONJ + bi/PREP + jAnib/NOUN + i/CASE_DEF_GEN + hA/POSS_PRON_3FS C---------waCONJ and P---------biPREP at N-------2RjAnib+iNOUN+CASE_DEF_GEN side of S----3FS2-hAPOSS_PRON_3FS her

7 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment7 Xerox Morphological Analyzer

8 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment8 MorphoTrees Hierarchy MorphoTrees of Arabic propose these levels Entity – the analyzed elements of the discourse Partitioning to the standard forms of the tokens Non-vocalized standard orthographical forms Lemmas/identifiers of lexical units Tokens – syntactic units including the form and the tag Independence on the language / implementation More/different levels, inclusion of spelling variations, … Annotation of various tagsets, other features of tokens Efficiency of decision-making Distance between analyses becomes recognized

9 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment9 MorphoTrees Annotation Selecting the leaves that correspond to the proper reading of the tokens constituting the entity Quick use of keyboard and/or mouse for annotations Restricting the tree according to the criteria/categories required by the context Natural control over the inheritance of restrictions Employing automatic restrictions and annotation actions, both generic and linguistic Learning about the discriminative categories and “human tagging”

10 September 22, 2004MorphoTrees of Arabic and Their Annotation in the TrEd Environment10 Discussion and Conclusion MorphoTrees Imporant in morphological annotation and in evaluation PADT 1.0 provides 148 000 annotated tokens Functional Morphology … more in Prague Arabic Dependency Treebank: Development in Data and Tools Even its approximation is promising and welcome Feature-Based Tagger trained on Penn ATB 2 3.6% error rate in major part-of-speech (15 values) 10.8% in the full tagset (317 evidenced combinations) 0.8–0.6% error rate in tokenization of the input


Download ppt "Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees."

Similar presentations


Ads by Google