Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic
July 30, 2011LSA 2011 Prague Dependency Treebanks II2 Part II - Syntax and Semantics Tectogrammatical representation Valency lexicon Languages Czech, Arabic and English Technical issues Annotation scheme and format Tools for annotation Applications Summary, pointers, conclusion
July 30, 2011LSA 2011 Prague Dependency Treebanks II3 PDT Annotation Layers L0 (w) Words (tokens) automatic segmentation and markup only L1 (m) Morphology Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
July 30, 2011LSA 2011 Prague Dependency Treebanks II4 Layer 3 (t-layer): Tectogrammatical Underlying (deep) syntax 4 sublayers (integrated): dependency structure, (detailed) functors valency annotation topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... Total 39 attributes (vs. 5 at m-layer, 2 at a-layer)
July 30, 2011LSA 2011 Prague Dependency Treebanks II5 Analytical vs. Tectogrammatical Underlying verb + tense Deep function Elided Actor in Prepositions out Another ellipsis... (TR: sublayer 1 only shown)
July 30, 2011LSA 2011 Prague Dependency Treebanks II6 Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,...
July 30, 2011LSA 2011 Prague Dependency Treebanks II7 Tectogrammatical Functors “Actants”: ACT, PAT, EFF, ADDR, ORIG modify: verbs, nouns, adjectives cannot repeat in a clause, usually obligatory Free modifications (~ 50), semantically defined can repeat; optional, sometimes obligatory Ex.: LOC, DIR1,...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR,... Special Coordination, Rhematizers, Foreign phrases,... syntactic semantic
July 30, 2011LSA 2011 Prague Dependency Treebanks II8 Tectogrammatical Example Analytical verb form: (he) allowed would-be to-be enrolled směl by být zapsán Additional attributes (grammatemes): conditional + “allow” Collapsed
July 30, 2011LSA 2011 Prague Dependency Treebanks II9 Tectogrammatical Example Passive construction (action) (The) book has-been translated [by Mr. X] Kniha byla přeložena Disappeared Added
July 30, 2011LSA 2011 Prague Dependency Treebanks II10 Tectogrammatical Example Object (he) gave him a-book dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame
July 30, 2011LSA 2011 Prague Dependency Treebanks II11 Tectogrammatical Example Incomplete phrases Peter works well, but Paul badly Petr pracuje dobře, ale Pavel špatně Added
July 30, 2011LSA 2011 Prague Dependency Treebanks II12 Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,...
July 30, 2011LSA 2011 Prague Dependency Treebanks II13 Deep Word Order Topic/Focus Example: Baker bakes rolls. vs. Baker IC bakes rolls. Analytical dep. tree:
July 30, 2011LSA 2011 Prague Dependency Treebanks II14 Deep Word Order Topic/Focus Deep word order: from “old” information to the “new” one (left-to- right) at every level (head included) projectivity by definition (almost...) i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic attribute of every node (t, f, c) restricted by d.w.o. and other constraints
July 30, 2011LSA 2011 Prague Dependency Treebanks II15 Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,...
July 30, 2011LSA 2011 Prague Dependency Treebanks II16 Coreference Grammatical relative clauses which, who Peter and Paul, who... control infinitival constructions John promised to go... reflexive pronouns {him,her,thme}self(-ves) Mary saw herself in... John go he home promise PRED ACT PAT ACT DIR3
July 30, 2011LSA 2011 Prague Dependency Treebanks II17 Coreference Textual Ex.: Peter moved to Iowa after he finished his PhD.
July 30, 2011LSA 2011 Prague Dependency Treebanks II18 Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,...
July 30, 2011LSA 2011 Prague Dependency Treebanks II19 Grammatemes Detailed functors (subfunctors) only for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of,... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT Lexical (underlying) number (SG/PL), tense, modality, degree of comparison,... strictly only where necessary (agreement!)
July 30, 2011LSA 2011 Prague Dependency Treebanks II20 Example - simplified view Se zuby jsem měl v minulosti jen problémy. With teeth I-have had in the-past only problems.
July 30, 2011LSA 2011 Prague Dependency Treebanks II21 Fully Annotated Sentence The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.
July 30, 2011LSA 2011 Prague Dependency Treebanks II22 Arabic Example: Tectogrammatics In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.
July 30, 2011LSA 2011 Prague Dependency Treebanks II23 English PDT-style Annotation Morphology and Syntax By conversion Tectogrammatical annotation Guidelines (English TR: by S. Cinková) Pre-annotation Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al. Valency From Propbank Frame Files (Cinková, Šindlerová, Nedolužko, Semecký)
July 30, 2011LSA 2011 Prague Dependency Treebanks II24 Example - English TR Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN)
July 30, 2011LSA 2011 Prague Dependency Treebanks II25 Valency in the PDT Valency: specific ability of a word to combine itself with other units of meaning dát (give) Eva matka (mother) ACT ADDR pršet (rain) zítra (tomorrow) TWHEN plakat (cry) Adam noc (night) ACT TWHEN Specific behavior dar (gift) PAT neděle (Sunday) TWHEN --- Modifies anything
July 30, 2011LSA 2011 Prague Dependency Treebanks II26 Valency - Basic Principles inner participants vs. free modifications (arguments vs. adjuncts) obligatory vs. optional modifications (the dialogue test)
July 30, 2011LSA 2011 Prague Dependency Treebanks II27 Inner Participant … … Free Modification ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5) each occurs just with particular verbs each modifies the verb only once (in a clause) Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70) can modify in principle any verb can be repeated (within the same clause)
July 30, 2011LSA 2011 Prague Dependency Treebanks II28 Inner Participants syntactic criteria - Actor and Patient semantic criteria for other inner participants (if a verb has more than two arguments) Argument shifting Actor Patient Addressee Origin Effect Petr has dug a hole. The teacher asked a pupil. Semantic Effect (as a cognitive role) shifted to the position of Patient. Semantic Addresse shifted to the position of Patient.
July 30, 2011LSA 2011 Prague Dependency Treebanks II29 Obligatory … Optional A: John left. B: From where? A: *I don't know. A: John left. B: To where? A: I don't know. „ from where“ obligatory modification „to where“ optional modification The Dialogue Test Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.
July 30, 2011LSA 2011 Prague Dependency Treebanks II30 Valency frame obligatoryoptional argument adjunct Structure: one meaning of the word one valency frame Contents : functor obligatoriness surface form word: leave meaning 1: sb left sth meaning 2: sb left from somewhere frame1: ACT PAT frame2: ACT DIR1
July 30, 2011LSA 2011 Prague Dependency Treebanks II31 Valency lexicon: PDT-VALLEX 8500 verb senses / valency frames 9000 noun sense / valency frames some adjectives and adverbs PDT-VALLEX Entry verb: dosáhnout meaning 1: to reach sth meaning 2: to get sb to do sth meaning 3: … meaning 4: …
July 30, 2011LSA 2011 Prague Dependency Treebanks II32 The PDT-VALLEX editor ‘lay down’ resign win ask senses:
July 30, 2011LSA 2011 Prague Dependency Treebanks II33 Valency Lexicon and TrEd to write sth (about sth)
July 30, 2011LSA 2011 Prague Dependency Treebanks II34 Corpus Valency Lexicon Corpus – occurrences of „uzavřít“ (to close) : ENTRY: uzavřít vf 1 : ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf 2 : ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) Lexicon: Sentence 2035: Sentence 15345:Sentence 51042:
July 30, 2011LSA 2011 Prague Dependency Treebanks II35 Valency and Text Generation Tectogrammatical Representation has all the information to (re)generate the surface form of the sentence: in a “generalized” form non-redundant (almost... but for generation, it is o.k.)...except the links to a-layer, however links used only for training [statistical models for] parsing/generation modules not present when e.g. doing text planning, translation,... valency dictionary: form of “learned” knowledge
July 30, 2011LSA 2011 Prague Dependency Treebanks II36 Valency and Text Generation Using valency for......getting the correct (lemma, tag) of verb arguments Example: starat_se PRED Martin ACT tygr PAT Martin starat V o tygr VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) se Martin se stará o tygry. “Martin takes care of tigers.” “to take care of” “tiger”
July 30, 2011LSA 2011 Prague Dependency Treebanks II37 The Annotation Process 4 sublayers work on structure first, rest in parallel Structure automatic preprocessing - programmed conversion from analytical layer annotation Grammatemes mostly automatically (based on lower layers’ annotation), manual checking, corrections Cross-sublayer/cross-layer checking partly automatic, then manual
July 30, 2011LSA 2011 Prague Dependency Treebanks II38 The Annotation Process Scheme
July 30, 2011LSA 2011 Prague Dependency Treebanks II39 Tectogrammatical Annotation Tools Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd) Customizable graphical tree editor Preprocessing Data from analytical layer, preprocessed Online dependency function preassignment
July 30, 2011LSA 2011 Prague Dependency Treebanks II40 The [Manual] Annotation Tool Perl/PerlTk based, platform-independent Linux, Windows 95/98/2000, Solaris,... Perl as the “macro” language “unlimited” online processing capability Flexibility for interactive checking split screen, graphical “diff” function Customization, printing, “plugins”,...
July 30, 2011LSA 2011 Prague Dependency Treebanks II41 The Annotation Scheme XML + principles of linear- and tree-based standoff annotation PML (Prague Markup Language) Layer schemes (Relax NG) PDT/PADT: t(ecto), a(nalytic), m(orphology), … English: + phrase-based (p-layer)
July 30, 2011LSA 2011 Prague Dependency Treebanks II42 PML/XML Annotation Layers Strictly top-down links w+m+a can be easily “knitted” API for cross-layer access (programming) PML Schema / Relax NG [z and audio layers: used for spoken data (audio as layer “-1”)] LFG analogy: f-struct Φ c-struct z-layer audio BYL BYS ČELO LESA …
July 30, 2011LSA 2011 Prague Dependency Treebanks II43 The Prague Markup Language Example m-layer data, linked to w-layer: manual w#w-tr/_12941_01_00013.fs-s1w4 basic pocházela pocházet_:T VpQW---XR-AA Pointer to w-layer
July 30, 2011LSA 2011 Prague Dependency Treebanks II44 PDT 2.0: The Data Data sizes
July 30, 2011LSA 2011 Prague Dependency Treebanks II45 Searching the Treebanks TrEd extension: PML-TQ Backend: database server Frontend: TrEd or Web browser Web access Sample data (Czech, English [soon]): anonymous / anonymous Full access (LSA 2011 particiapnts only, 2011): LSA2011 / UC.Boulder Full access: licence needed for the corpora Available later this year at
July 30, 2011LSA 2011 Prague Dependency Treebanks II46 Using the Results: Parsing Several parsers of Czech Analytical layer dependency syntax Trained on PDT 1.0 data, 1.2 mil. words Collins(98), Charniak(00), Žabokrtský(02), Ribarov(04), Nivre(05), Zeman(05), McDonald(05), CoNLL’06 (19 parsers) Best results accuracy: percent of correct dependencies: 84-85% for a single parser, > 86% for a combination
July 30, 2011LSA 2011 Prague Dependency Treebanks II47 Tectogrammatical Parsing Newest results: 4 phases Transformation -based learning FnTBL Largely langu- age independent Coreference: >90% m- and a-layer: Attributemanualauto structure89,3 %76,4 % functor85,5 %77,4 % val_frame.rf 92,3 %90,9 % t_lemma 93,5 %90,9 % nodetype 94,5 %92,6 % gram/sempos 93,8 %91,5 % a/lex.rf 96,5 %95,1 % a/aux.rf 94,3 %90,3 % is_member 94,3 %89,5 % is_generated 96,6 %95,2 % deepord 68,0 %66,7 %
July 30, 2011LSA 2011 Prague Dependency Treebanks II48 Tectogrammatical Layer in Machine Translation The Translation (“Vauquois”) triangle transfer source target Tectogrammatical Representation Surface Syntax Morphology Generation Cz En
July 30, 2011LSA 2011 Prague Dependency Treebanks II49 Dependency trees in MT According to his opinion UAL's executives were misinformed about the financing of the original transaction. Transfer: Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno. - structure (~0) - lexical - functions - grammatical
July 30, 2011LSA 2011 Prague Dependency Treebanks II50 Analytical Layer Correspondence
July 30, 2011LSA 2011 Prague Dependency Treebanks II51 Tectogrammatical Correspondence The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River. ‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River.
July 30, 2011LSA 2011 Prague Dependency Treebanks II52 Valency and Translation leave: leave-1 to leave [from] somewhere leave-2 to leave sth for sb Translating (from English into Czech): which equivalent to chose? nechat vs. odjet/opustit which prepositions, cases,... to use? accusative vs. “z” (“from”) with genitive vs....?
July 30, 2011LSA 2011 Prague Dependency Treebanks II53 Valency and Translation leave-1 nechat-3 ACT() PAT() LOC() ACT(.1) PAT(.4) LOC() leave-2 odjet-1 ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])
July 30, 2011LSA 2011 Prague Dependency Treebanks II54 To summarize… PDT is/has (a)… Dependency-based treebanking project Czech (other languages: – Eng, Ar) Ongoing projects (other inst.): Italian, Old Greek, Latin, … ~ 1mil. words sufficient size for ML experiments 4 layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and full information at all levels, but... interlinked (for the development of parsers/generators) Valency dictionary integrated (links from data)
July 30, 2011LSA 2011 Prague Dependency Treebanks II55 Some pointers Current version of PDT: v2.0, LDC2006T01 all three levels, 1.9/1.5/0.8 Mwords Research -> Corpora (Treebank(s)) LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0) Workshop 2002 Using TL for MT Generation 1 st version of English dep. Treebank This workshp page, many links to resources, tools