Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000.

Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0 Functional Generative Description

3 Prague Dependency Treebank 1.0 Functional Generative Description ztheoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School zmethodological requirements of a formal description zlevels: ytectogrammatical (underlying) representations (TRs) with dependency based syntax ymorphemics yphonemics and phonetics  TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)

4 Prague Dependency Treebank 1.0 Dependency tree My younger brother arrived there yesterday. Linearized form, one-to-one relation: ((I) Appurt (younger) Rstr brother) Act arrive.Pret.Indic ( Dir there) ( Temp yesterday)

5 Prague Dependency Treebank 1.0 Dependency Tree zlabels - lexical meanings (abstract symbols) with indices yfunctors xsubscripts at parentheses oriented towards head y grammatemes - values of morphological categories x Tense, Modality, Number, Definiteness, etc. zprojectivity zvalency yarguments (inner participants) and adjuncts (circumstantials or 'free modifications') yobligatory and optional with a given head, ydeletable or not

6 Prague Dependency Treebank 1.0 Dependency Tree zparticipants (arguments) of verbs yActor/Bearer (underlying subject) yObjective (Patient, underlying direct object) yAddressee (underlying indirect object) yEffect ('second' object: to choose so. as sth.) yOrigin (to make sth. out of sth.) zadjuncts yLocative, several Directional and Temporal modifications yCondition, Means, Manner, etc.

7 Prague Dependency Treebank 1.0 Dependency Tree zinner participants yMaterial (Partitive) two baskets of sth. yIdentity the river Danube; the notion of operator z free modifications yPossession (Appurtenance) my table; Jim's brother yRestrictive rich man yDescriptive the Swedes, who are a Scandinavian nation Complementations dependent mainly on nouns

8 Prague Dependency Treebank 1.0 Dependency Tree zsyntactic grammatemes yLoc, Dir - in, on, under, between... y Regard - with, without zoperational (testable) criteria yfor distinguishing xarguments from adjuncts, xfrom each other y deletability (dialogue test)

9 Prague Dependency Treebank 1.0 Simplified valency frames yread V Act Addr Obj ychange V Act Obj Orig Eff ygive V Act Addr Obj ybrother N Appurt yman N yglass N Material yfull A Material obligatory complementations in blue

10 Prague Dependency Treebank 1.0 Topic-focus articulation z contextual boundness ymain verb CB/NB (T/F) ydependents to the left/right z communicative dynamism yleft-right (mother, sisters, transitive) ypartial ordering z underlying word order yleft-right ylinear ordering left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR) young there T

11 Prague Dependency Treebank 1.0 Topic-focus articulation z TFA - one of the basic aspects of underlying structures young there T yesterday F

12 Prague Dependency Treebank 1.0 Complex sentence z a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause My brother, whom you know, arrived there yesterday.

13 Prague Dependency Treebank 1.0 Complex sentence z function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs Martin came there late, since he had to accompany his sick mother.

14 Prague Dependency Treebank 1.0 Complex sentence Martin arrived late to the session, since he had to accompany his sick mother. schematically (morphemes): Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother. dot - close connection of morphemes ('semes')

15 Prague Dependency Treebank 1.0 zdeleted items restored yorder of items - difference between 'underlying' and surface (morphemic) word order ytransductive components - Panevová, Oliva, Borota zcoordination (multidimensional) yJim and Mary, who have two children, went to Boston. ythe linearized notation is adequate: y((Jim Mary) Conj ((who) Act have ( Pat (two) Rstr children))) Act went ( Dir Boston) zstructures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.

16 Prague Dependency Treebank 1.0 Prague Dependency Treebank - corpus annotation zan intermediate level - 'analytical' representations ydependency trees, not always projective ynodes for all word tokens, even for punctuation marks ztectogrammmatical tree: coordinating conjunction as the head

Prague Dependency Treebank 1.0

Morphological Layer

19 Prague Dependency Treebank 1.0 ACKNOWLEDGEMENTS

20 Prague Dependency Treebank 1.0 ANNOTATED CORPORA PDT version 1.0, 2000 (1996 - 2000) Penn Treebank, release 3, 1999 (1989 - 1999)

21 Prague Dependency Treebank 1.0 TAG SETs Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, … English - language with poor inflection work, works, worked, working

22 Prague Dependency Treebank 1.0

23 TEXT SOURCES zLidové noviny zMladá Fronta Dnes zVesmír zČeskomoravský Profit...taken from Czech National Corpus z´88, ´89 WSJ articles zAir Travel Information System transcripts zBrown Corpus zSwitchboard transcripts

24 Prague Dependency Treebank 1.0 ANNOTATION STRATEGY - Penn Treebank TEXT Ken Church‘s stochastic tagger, Eric Brill‘s transformation tagger corrections by annotator ( GNU Emacs Lisp based package )

25 Prague Dependency Treebank 1.0 ANNOTATION STRATEGY - PDT Automatic Morphological Analyzer (AMA) two independent annotators; Linux, Win tools differences resolved by third annotator comparison with the current AMA; manual resolution; Win tools

26 Prague Dependency Treebank 1.0 INTERNAL FORMAT zSGML coding, csts dtd  word/tag(|tag)*

27 Prague Dependency Treebank 1.0 Pokus pokus NNIS1-----A---- o o RR--4---------- zázrak zázrak NNIS4-----A----.. Z:------------- The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN./. SAMPLES

28 Prague Dependency Treebank 1.0 zSGML coding z word/tag z word/lemma/tag CONVERSION

29 Prague Dependency Treebank 1.0 DATA SIZE

30 Prague Dependency Treebank 1.0 DATA SETs of MORPHOLOGICALLY ANNOTATED DATA

31 Prague Dependency Treebank 1.0 TOOLS zAutomatic Morphological Analyser/Generator of Czech, yDictionary: CZE_a yRemote Acces z Czech Taggers yHMM yExponential

Prague Dependency Treebank 1.0

Analytical Layer in PDT

34 Prague Dependency Treebank 1.0 Introduction zInput: morphologically tagged sentences zGraph Editor: “user-friendly” software zOutput: ATS structure y„surface“ syntax tree structure ynodes labelled by the analytical functions

35 Prague Dependency Treebank 1.0 Two stages (chronologically) z(A) manual „analytic“ annotation (ATS) ytraining data for (B)(a) z(B) y(a) semiautomatic procedure (Collin‘s parser) y(b) manual correcting of (B)(a)

36 Prague Dependency Treebank 1.0 Constraints and limitations zany string has a node of its own yword-form, punctuation mark, etc. yAuxV, AuxP, AuxC, AuxX, AuxG… zreflecting the coordination and apposition relations yso called third dimension of the graph in the plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)

37 Prague Dependency Treebank 1.0 Constraints and limitations zno missing nodes (on the surface) can be added yanalytic funtion Ex_D is used zrelations between semi-automatic and manual procedure y80% edges are established correctly automatically

38 Prague Dependency Treebank 1.0 Project organization zteam consisting of 5-6 annotators zhandbook for ATS structure annotation z1999: 100000 sentences on ATS ztectogrammatical annotation follows

39 Prague Dependency Treebank 1.0 Adv AuxT První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.

Prague Dependency Treebank 1.0

From the Analytical towards the Tectogrammatical layer

42 Prague Dependency Treebank 1.0 Introduction zATS annotation ynodes: xword forms xpunctuation xgraphical symbols zTGTS annotation xautosemantic words xdeletions yedges: xsurface relations xdeep layer functions

43 Prague Dependency Treebank 1.0 Input Czech sentence Morphological tagging and lexical disambiguation Tokenization Syntactic parsing and analytic function assignment Tree structure pruning Attribute assignments TGTS ATS PDT1.0 Annotation process

44 Prague Dependency Treebank 1.0 Transition procedure zdeterministic procedure operating on trees zmacro language for Graph Editor (C++ like) zautomatic changes & tools for annotators zRequirements ynew attributes for tectogrammatical layer yATS is recoverable from TGTS yautomatized to a maximally high degree

45 Prague Dependency Treebank 1.0 New attributes ztrlemma - lemma of the original node or lemma composed of joined nodes zmorphological grammatemes ygender, number, degree of comparison, tense, yaspect, iterativeness, verbal modality, deontic modality, sentence modality zpositionof the node zposition of the node yfunctor, topic-focus articulation, syntactic grammateme, ytype of relation (dependency, coordination, apposition), yphraseme, deletion, quoted word, direct speech, ycoreference, antecedent

46 Prague Dependency Treebank 1.0 Tree Structure Pruning yU toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. yFor those, who start actually at zero, the tax outcome for the state is not substantial.

47 Prague Dependency Treebank 1.0 Tree Structure Pruning yU toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. yFor those, who start actually at zero, the tax outcome for the state is not substantial. REG

48 Prague Dependency Treebank 1.0 Verbal Nodes … enterpreneurs should have (their) taxes … … podnikatelé by měli mít daně … PRED verbmod=CDN deontmod=HRT

49 Prague Dependency Treebank 1.0 Attribute Assignments  prepositions stored as fw attribute zquoted words yclause in quotes -> DSP yone pair of quotes in the sentence -> DSPP ystring in quotes -> QUOT zgender, number, tense, degcmp, aspect zdefault values

50 Prague Dependency Treebank 1.0 Macros for Annotators zkeyboard shortcuts (in Graph editor) ystructure changes xhide/recover nodes xmerge nodes yadd new nodes yfunctor assignments

51 Prague Dependency Treebank 1.0 Manual annotation zstructure checking zfunctors zdeletions of obligatory modifications zfeedback for formulating the handbook for annotators

Prague Dependency Treebank 1.0

Tectogrammatical Layer

54 Prague Dependency Treebank 1.0

55 CT T T T T F F T T

56 z Jirka se včera opil do němoty a Honza dneska. z George himself yesterday drank to silence and Honza today.

57 Prague Dependency Treebank 1.0 Attributes of Coreferrential relations z only in MC  attributevalues coref the lemma of the antecedent corsnt NIL - in the same sentence PREV1... PREVi - position of the sentence which includes the antecedent  grammatical coreference antec the functor of the antecedent

58 Prague Dependency Treebank 1.0 Example Honza slíbil přijít včas. Honza promised to come in time. coref:Honza corsnt:NIL cornum:1 antec:ACT

