Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prague Dependency Treebank 1.0 Functional Generative Description.

Similar presentations


Presentation on theme: "Prague Dependency Treebank 1.0 Functional Generative Description."— Presentation transcript:

1 Prague Dependency Treebank 1.0 Functional Generative Description

2 theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School methodological requirements of a formal description levels:  tectogrammatical (underlying) representations (TRs) with dependency based syntax  morphemics  phonemics and phonetics TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)

3 The Language Layers Phonemic, Morphonological, Morphemic, Analytical (surface syntax) Tectogrammatical (deep syntax).

4 Dependency tree My younger brother arrived there yesterday. Linearized form, one-to-one relation: ((I) Appurt (younger) Rstr brother) Act arrive.Pret.Indic ( Dir there) ( Temp yesterday)

5 Dependency Tree labels - lexical meanings (abstract symbols) with indices  functors subscripts at parentheses oriented towards head  grammatemes - values of morphological categories Tense, Modality, Number, Definiteness, etc. projectivity valency  arguments (inner participants) and adjuncts (circumstantials or 'free modifications')  obligatory and optional with a given head,  deletable or not

6 Dependency Tree Arguments/participan ts of verbs  Actor/Bearer (underlying subject)  Objective (Patient, underlying direct object)  Addressee (underlying indirect object)  Effect ('second' object: to choose so. as sth.)  Origin (to make sth. out of sth.) Adjuncts  Locative, several Directional and Temporal modifications  Condition, Means, Manner, etc.

7 Dependency Tree Arguments (inner participants)  Material (Partitive) two baskets of sth.  Identity the river Danube; the notion of operator Adjuncts (free modifications)  Possession (Appurtenance) my table; Jim's brother  Restrictive rich man  Descriptive the Swedes, who are a Scandinavian nation Complementations dependent mainly on nouns

8 Dependency Tree syntactic grammatemes  Loc, Dir - in, on, under, between...  Regard - with, without operational (testable) criteria  for distinguishing arguments from adjuncts, from each other  deletability (dialogue test)

9 Simplified valency frames  read V Act Addr Obj  change V Act Obj Orig Eff  give V Act Addr Obj  brother N Appurt  man N  glass N Material  full A Material obligatory complementations in blue

10 Topic-focus articulation contextual boundness  main verb CB/NB (T/F)  dependents to the left/right communicative dynamism  left-right (mother, sisters, transitive)  partial ordering underlying word order  left-right  linear ordering left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR) young there T

11 Topic-focus articulation TFA - one of the basic aspects of underlying structures young there T yesterday F

12 Complex sentence a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause My brother, whom you know, arrived there yesterday.

13 Complex sentence function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs Martin came there late, since he had to accompany his sick mother.

14 Complex sentence Martin arrived late to the session, since he had to accompany his sick mother. schematically (morphemes): Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother. dot - close connection of morphemes ('semes')

15 deleted items restored  order of items - difference between 'underlying' and surface (morphemic) word order  transductive components - Panevová, Oliva, Borota coordination (multidimensional)  Jim and Mary, who have two children, went to Boston.  the linearized notation is adequate:  ((Jim Mary) Conj ((who) Act have ( Pat (two) Rstr children))) Act went ( Dir Boston) structures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.

16 Prague Dependency Treebank - corpus annotation an intermediate level - 'analytical' representations  dependency trees, not always projective  nodes for all word tokens, even for punctuation marks tectogrammmatical tree: coordinating conjunction as the head

17 Prague Dependency Treebank 1.0 Morphological Layer

18 ANNOTATED CORPORA PDT version 1.0, 2000 (1996 - 2000) (currently) ver. 2 Penn Treebank, release 3, 1999 (1989 - 1999) PropBank (currently)

19 The Levels in PDT Morphemic Analytical Tectogrammatical

20 TAG SETs Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, … English - language with poor inflection work, works, worked, working

21

22 TEXT SOURCES Lidové noviny Mladá Fronta Dnes Vesmír Českomoravský Profit...taken from Czech National Corpus ´88, ´89 WSJ articles Air Travel Information System transcripts Brown Corpus Switchboard transcripts

23 ANNOTATION STRATEGY - Penn Treebank TEXT Ken Church‘s stochastic tagger, Eric Brill‘s transformation tagger corrections by annotator ( GNU Emacs Lisp based package )

24 ANNOTATION STRATEGY - PDT Automatic Morphological Analyzer (AMA) two independent annotators; Linux, Win tools differences resolved by third annotator comparison with the current AMA; manual resolution; Win tools

25 INTERNAL FORMAT SGML coding, csts dtd word/tag(|tag)*

26 Pokus pokus NNIS1-----A---- o o RR--4---------- zázrak zázrak NNIS4-----A----.. Z:------------- The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN./. SAMPLES

27 SGML coding word/tag word/lemma/tag CONVERSION pdt2wsj.pl pdt2wsjFLT.pl

28 DATA SIZE

29 DATA SETs of MORPHOLOGICALLY ANNOTATED DATA

30 TOOLS Automatic Morphological Analyser/Generator of Czech  HMAnalyze.pl, HMGenerate.pl  Dictionary: CZE_a  Remote Access Czech Taggers  HMM  Exponential

31 Prague Dependency Treebank 1.0 Analytical Layer in PDT

32 Introduction Input: morphologically tagged sentences Graph Editor: “user-friendly” software Output: ATS structure  „surface“ syntax tree structure  nodes labelled by the analytical functions

33 Analytical Functions Pred - Predicate if it depends on the tree root Sb- Subject Obj- Object Adv- Adverbial Atv- Complement AtvV - Complement, if one governor is present Atr- Attribute Pnom- Nominal predicate‘s nominal part, depends on the copula „to be“ AuxV- Auxiliary verb „to be“ Coord- Coordination node Apos- Apposition node AuxR- Reflexive particle, which is neither Obj nor AuxT (passive) AuxT- Reflexive particle, lexically bound to the verb

34 Analytical Functions AuxP- Preposition or a part of compound preposition AuxC- Subordinate conjunction AuxO- (Superfluously) referring particle or emotional particle AuxZ- Rhematizer or another node acting to another constituent AuxX- Comma, but not the main coordinating comma AuxG- Other graphical symbols being not classified as AuxK AuxY- Other words, such as particles without a specific syntactic function, parts of lexical idioms, etc. AuxS- Sentence holder (the only added root to the tree) AuxK- Punctuation at the end of the sentence or direct speech or citation clause ExD- Ellipsis handling: functions for nodes which pseudo depend on a node on which the would not depend if there were no ellipsis AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap

35 Two stages (chronologically) (A) manual „analytic“ annotation (ATS)  training data for (B)(a) (B)  (a) semiautomatic procedure (Collin‘s parser)  (b) manual correcting of (B)(a)

36 Constraints and limitations any string has a node of its own  word-form, punctuation mark, etc.  AuxV, AuxP, AuxC, AuxX, AuxG… reflecting the coordination and apposition relations  so called third dimension of the graph in the plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)

37 Constraints and limitations no missing nodes (on the surface) can be added  analytic funtion Ex_D is used relations between semi-automatic and manual procedure  80% edges are established correctly automatically

38 Project organization team consisting of 5-6 annotators handbook for ATS structure annotation 100000 sentences on ATS tectogrammatical annotation follows

39 Projectivity/Nonprojectivity/Surface Order A(B, C) B C A B C A CB A

40 Projectivity/Non-projectivity/Surface Order A(B( C )) B C A C B A C B A

41 Adv AuxT První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.

42 Prague Dependency Treebank 1.0 From the Analytical towards the Tectogrammatical layer

43 Introduction ATS annotation  nodes: word forms punctuation graphical symbols TGTS annotation autosemantic words deletions  edges: surface relations deep layer functions

44 Input Czech sentence Morphological tagging and lexical disambiguation Tokenization Syntactic parsing and analytic function assignment Tree structure pruning Attribute assignments TGTS ATS PDT1.0 Annotation process

45 Transition procedure deterministic procedure operating on trees macro language for Graph Editor (perl) automatic changes & tools for annotators Requirements  new attributes for tectogrammatical layer  ATS is recoverable from TGTS  automatized to a maximally high degree

46 New attributes trlemma - lemma of the original node or lemma composed of joined nodes morphological grammatemes  gender, number, degree of comparison, tense,  aspect, iterativeness, verbal modality, deontic modality, sentence modality positionof the node position of the node  functor, topic-focus articulation, syntactic grammateme,  type of relation (dependency, coordination, apposition),  phraseme, deletion, quoted word, direct speech,  coreference, antecedent

47 Tree Structure Pruning  U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný.  For those, who start actually at zero, the tax outcome for the state is not substantial.

48 Tree Structure Pruning  U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný.  For those, who start actually at zero, the tax outcome for the state is not substantial. REG

49 Verbal Nodes … enterpreneurs should have (their) taxes … … podnikatelé by měli mít daně … PRED verbmod=CDN deontmod=HRT

50 Attribute Assignments prepositions stored as fw attribute quoted words  clause in quotes -> DSP  one pair of quotes in the sentence -> DSPP  string in quotes -> QUOT gender, number, tense, degcmp, aspect default values

51 Macros for Annotators keyboard shortcuts (in Graph editor)  structure changes hide/recover nodes merge nodes  add new nodes  functor assignments

52 Manual annotation structure checking functors deletions of obligatory modifications feedback for formulating the handbook for annotators

53 Prague Dependency Treebank 1.0 Tectogrammatical Layer

54

55 CT T T T T F F T T

56 Jirka se včera opil do němoty a Honza dneska. George himself yesterday drank to silence and Honza today.

57 Attributes of Coreferrential relations only in MC  attributevalues coref the lemma of the antecedent corsnt NIL - in the same sentence PREV1... PREVi - position of the sentence which includes the antecedent  grammatical coreference antec the functor of the antecedent

58 Example Honza slíbil přijít včas. Honza promised to come in time. coref:Honza corsnt:NIL cornum:1 antec:ACT


Download ppt "Prague Dependency Treebank 1.0 Functional Generative Description."

Similar presentations


Ads by Google