Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000.

Slides:



Advertisements
Similar presentations
CODE/ CODE SWITCHING.
Advertisements

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Chapter 4 Syntax.
Functional Generative Description (FGD) Markéta Lopatková Institute of Formal and Applied Linguistics, MFF UK
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
MORPHOLOGY - morphemes are the building blocks that make up words.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
DS-to-PS conversion Fei Xia University of Washington July 29,
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Syntax Nuha AlWadaani.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 14, Feb 27, 2007.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
1 Introduction to Natural Language Processing ( ) Linguistic Essentials: Syntax AI-lab
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
SYNTAX Lecture -1 SMRITI SINGH.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 13, Feb 16, 2007.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
CSA2050 Introduction to Computational Linguistics Parsing I.
Jan 2004CSA3050: NLG21 CSA3050: Natural Language Generation 2 Surface Realisation Systemic Grammar Functional Unification Grammar see J&M Chapter 20.3.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
Prague Dependency Treebank 1.0 Functional Generative Description.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Writing 2 ENG 221 Norah AlFayez. Lecture Contents Revision of Writing 1. Introduction to basic grammar. Parts of speech. Parts of sentences. Subordinate.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
Chapter 1 Introduction Samuel College of Computer Science & Technology Harbin Engineering University.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Beginning Syntax Linda Thomas
Compiler Construction (CS-636)
Revision Outcome 1, Unit 1 The Nature and Functions of Language
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Chapter Eight Syntax.
Prague Arabic Dependency Treebank
Part I: Basics and Constituency
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Chapter Eight Syntax.
Presentation transcript:

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Functional Generative Description

Prague Dependency Treebank 1.0 Functional Generative Description ztheoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School zmethodological requirements of a formal description zlevels: ytectogrammatical (underlying) representations (TRs) with dependency based syntax ymorphemics yphonemics and phonetics  TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)

Prague Dependency Treebank 1.0 Dependency tree My younger brother arrived there yesterday. Linearized form, one-to-one relation: ((I) Appurt (younger) Rstr brother) Act arrive.Pret.Indic ( Dir there) ( Temp yesterday)

Prague Dependency Treebank 1.0 Dependency Tree zlabels - lexical meanings (abstract symbols) with indices yfunctors xsubscripts at parentheses oriented towards head y grammatemes - values of morphological categories x Tense, Modality, Number, Definiteness, etc. zprojectivity zvalency yarguments (inner participants) and adjuncts (circumstantials or 'free modifications') yobligatory and optional with a given head, ydeletable or not

Prague Dependency Treebank 1.0 Dependency Tree zparticipants (arguments) of verbs yActor/Bearer (underlying subject) yObjective (Patient, underlying direct object) yAddressee (underlying indirect object) yEffect ('second' object: to choose so. as sth.) yOrigin (to make sth. out of sth.) zadjuncts yLocative, several Directional and Temporal modifications yCondition, Means, Manner, etc.

Prague Dependency Treebank 1.0 Dependency Tree zinner participants yMaterial (Partitive) two baskets of sth. yIdentity the river Danube; the notion of operator z free modifications yPossession (Appurtenance) my table; Jim's brother yRestrictive rich man yDescriptive the Swedes, who are a Scandinavian nation Complementations dependent mainly on nouns

Prague Dependency Treebank 1.0 Dependency Tree zsyntactic grammatemes yLoc, Dir - in, on, under, between... y Regard - with, without zoperational (testable) criteria yfor distinguishing xarguments from adjuncts, xfrom each other y deletability (dialogue test)

Prague Dependency Treebank 1.0 Simplified valency frames yread V Act Addr Obj ychange V Act Obj Orig Eff ygive V Act Addr Obj ybrother N Appurt yman N yglass N Material yfull A Material obligatory complementations in blue

Prague Dependency Treebank 1.0 Topic-focus articulation z contextual boundness ymain verb CB/NB (T/F) ydependents to the left/right z communicative dynamism yleft-right (mother, sisters, transitive) ypartial ordering z underlying word order yleft-right ylinear ordering left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR) young there T

Prague Dependency Treebank 1.0 Topic-focus articulation z TFA - one of the basic aspects of underlying structures young there T yesterday F

Prague Dependency Treebank 1.0 Complex sentence z a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause My brother, whom you know, arrived there yesterday.

Prague Dependency Treebank 1.0 Complex sentence z function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs Martin came there late, since he had to accompany his sick mother.

Prague Dependency Treebank 1.0 Complex sentence Martin arrived late to the session, since he had to accompany his sick mother. schematically (morphemes): Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother. dot - close connection of morphemes ('semes')

Prague Dependency Treebank 1.0 zdeleted items restored yorder of items - difference between 'underlying' and surface (morphemic) word order ytransductive components - Panevová, Oliva, Borota zcoordination (multidimensional) yJim and Mary, who have two children, went to Boston. ythe linearized notation is adequate: y((Jim Mary) Conj ((who) Act have ( Pat (two) Rstr children))) Act went ( Dir Boston) zstructures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.

Prague Dependency Treebank 1.0 Prague Dependency Treebank - corpus annotation zan intermediate level - 'analytical' representations ydependency trees, not always projective ynodes for all word tokens, even for punctuation marks ztectogrammmatical tree: coordinating conjunction as the head

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Morphological Layer

Prague Dependency Treebank 1.0 ACKNOWLEDGEMENTS

Prague Dependency Treebank 1.0 ANNOTATED CORPORA PDT version 1.0, 2000 ( ) Penn Treebank, release 3, 1999 ( )

Prague Dependency Treebank 1.0 TAG SETs Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, … English - language with poor inflection work, works, worked, working

Prague Dependency Treebank 1.0

TEXT SOURCES zLidové noviny zMladá Fronta Dnes zVesmír zČeskomoravský Profit...taken from Czech National Corpus z´88, ´89 WSJ articles zAir Travel Information System transcripts zBrown Corpus zSwitchboard transcripts

Prague Dependency Treebank 1.0 ANNOTATION STRATEGY - Penn Treebank TEXT Ken Church‘s stochastic tagger, Eric Brill‘s transformation tagger corrections by annotator ( GNU Emacs Lisp based package )

Prague Dependency Treebank 1.0 ANNOTATION STRATEGY - PDT Automatic Morphological Analyzer (AMA) two independent annotators; Linux, Win tools differences resolved by third annotator comparison with the current AMA; manual resolution; Win tools

Prague Dependency Treebank 1.0 INTERNAL FORMAT zSGML coding, csts dtd  word/tag(|tag)*

Prague Dependency Treebank 1.0 Pokus pokus NNIS1-----A---- o o RR zázrak zázrak NNIS4-----A Z: The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN./. SAMPLES

Prague Dependency Treebank 1.0 zSGML coding z word/tag z word/lemma/tag CONVERSION pdt2wsj.pl pdt2wsjFLT.pl

Prague Dependency Treebank 1.0 DATA SIZE

Prague Dependency Treebank 1.0 DATA SETs of MORPHOLOGICALLY ANNOTATED DATA

Prague Dependency Treebank 1.0 TOOLS zAutomatic Morphological Analyser/Generator of Czech yHMAnalyze.pl, HMGenerate.pl yDictionary: CZE_a yRemote Acces z Czech Taggers yHMM yExponential

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Analytical Layer in PDT

Prague Dependency Treebank 1.0 Introduction zInput: morphologically tagged sentences zGraph Editor: “user-friendly” software zOutput: ATS structure y„surface“ syntax tree structure ynodes labelled by the analytical functions

Prague Dependency Treebank 1.0 Two stages (chronologically) z(A) manual „analytic“ annotation (ATS) ytraining data for (B)(a) z(B) y(a) semiautomatic procedure (Collin‘s parser) y(b) manual correcting of (B)(a)

Prague Dependency Treebank 1.0 Constraints and limitations zany string has a node of its own yword-form, punctuation mark, etc. yAuxV, AuxP, AuxC, AuxX, AuxG… zreflecting the coordination and apposition relations yso called third dimension of the graph in the plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)

Prague Dependency Treebank 1.0 Constraints and limitations zno missing nodes (on the surface) can be added yanalytic funtion Ex_D is used zrelations between semi-automatic and manual procedure y80% edges are established correctly automatically

Prague Dependency Treebank 1.0 Project organization zteam consisting of 5-6 annotators zhandbook for ATS structure annotation z1999: sentences on ATS ztectogrammatical annotation follows

Prague Dependency Treebank 1.0 Adv AuxT První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 From the Analytical towards the Tectogrammatical layer

Prague Dependency Treebank 1.0 Introduction zATS annotation ynodes: xword forms xpunctuation xgraphical symbols zTGTS annotation xautosemantic words xdeletions yedges: xsurface relations xdeep layer functions

Prague Dependency Treebank 1.0 Input Czech sentence Morphological tagging and lexical disambiguation Tokenization Syntactic parsing and analytic function assignment Tree structure pruning Attribute assignments TGTS ATS PDT1.0 Annotation process

Prague Dependency Treebank 1.0 Transition procedure zdeterministic procedure operating on trees zmacro language for Graph Editor (C++ like) zautomatic changes & tools for annotators zRequirements ynew attributes for tectogrammatical layer yATS is recoverable from TGTS yautomatized to a maximally high degree

Prague Dependency Treebank 1.0 New attributes ztrlemma - lemma of the original node or lemma composed of joined nodes zmorphological grammatemes ygender, number, degree of comparison, tense, yaspect, iterativeness, verbal modality, deontic modality, sentence modality zpositionof the node zposition of the node yfunctor, topic-focus articulation, syntactic grammateme, ytype of relation (dependency, coordination, apposition), yphraseme, deletion, quoted word, direct speech, ycoreference, antecedent

Prague Dependency Treebank 1.0 Tree Structure Pruning yU toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. yFor those, who start actually at zero, the tax outcome for the state is not substantial.

Prague Dependency Treebank 1.0 Tree Structure Pruning yU toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. yFor those, who start actually at zero, the tax outcome for the state is not substantial. REG

Prague Dependency Treebank 1.0 Verbal Nodes … enterpreneurs should have (their) taxes … … podnikatelé by měli mít daně … PRED verbmod=CDN deontmod=HRT

Prague Dependency Treebank 1.0 Attribute Assignments  prepositions stored as fw attribute zquoted words yclause in quotes -> DSP yone pair of quotes in the sentence -> DSPP ystring in quotes -> QUOT zgender, number, tense, degcmp, aspect zdefault values

Prague Dependency Treebank 1.0 Macros for Annotators zkeyboard shortcuts (in Graph editor) ystructure changes xhide/recover nodes xmerge nodes yadd new nodes yfunctor assignments

Prague Dependency Treebank 1.0 Manual annotation zstructure checking zfunctors zdeletions of obligatory modifications zfeedback for formulating the handbook for annotators

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Tectogrammatical Layer

Prague Dependency Treebank 1.0

CT T T T T F F T T

z Jirka se včera opil do němoty a Honza dneska. z George himself yesterday drank to silence and Honza today.

Prague Dependency Treebank 1.0 Attributes of Coreferrential relations z only in MC  attributevalues coref the lemma of the antecedent corsnt NIL - in the same sentence PREV1... PREVi - position of the sentence which includes the antecedent  grammatical coreference antec the functor of the antecedent

Prague Dependency Treebank 1.0 Example Honza slíbil přijít včas. Honza promised to come in time. coref:Honza corsnt:NIL cornum:1 antec:ACT