Download presentation
Presentation is loading. Please wait.
Published byAlison Palmer Modified over 9 years ago
1
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic
2
19.1.2015PARSEME Training School Prague 2 Outline Treebanks –Phrase-(Constituency-) based: The Penn Treebank –Dependency: The Prague Dependency Treebanks The Penn Treebank (basics) The Prague Dependency Treebank –Layers of Annotation Morphology Syntax Semantics Valency
3
THE PENN TREEBANK 19.1.2015PARSEME Training School Prague 3
4
19.1.2015PARSEME Training School Prague 4 Phrase- vs. Dependency- Based Treebanks The original: The Penn Treebank –Phrase-based style; good for parsing by CFG grammars Followers –Almost all Penn-based treebanks Chinese, Arabic, Korean, … –Negra (German), many others Now: dependency parsing prevails Conversion from phrase-based treebanks –Might lose information, heads added „ad hoc“ “native” dependency treebanks: annotated as such –Considered “better” –Hindi/Urdu, TIGER (sort of); both styles manually annotated –PDT (of course) and similar ones »PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin
5
The Penn Treebank Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)www.ldc.upenn.edu –First the Wall Street Journal part (1 mil. words, 2312 documents) Added other text types –ATIS corpus (dialogs, travel reservations) –Brown corpus annotated for syntax –Switchboard (spoken language, tel. conversations) 19.1.2015PARSEME Training School Prague 5
6
19.1.2015PARSEME Training School Prague 6 Penn Treebank Format ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. POS tag (NNS) (noun, plural) Phrase label (NP) Noun Phrase “Preterminal”
7
The Penn Treebank(s) Extensions –Annotation of named entities, co-reference (BBN) –cf. also previous slides –Function labels (SBJ, OBJ, TMP,...) –PropBank Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary) (S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2)))... ) –NomBank Like PropBank, but for nouns and their “arguments” Other languages (Chinese, Arabic,...) 19.1.2015PARSEME Training School Prague 7
8
THE PRAGUE DEPENDENCY TREEBANK 19.1.2015PARSEME Training School Prague 8
9
19.1.2015PARSEME Training School Prague 9 The Prague Dependency Treebanks: the Basics Original Treebank: PDT 1.0, 2001 (morf., dep. syntax) First full release: PDT 2.0 –http://ufal.mff.cuni.cz/pdt2.0http://ufal.mff.cuni.cz/pdt2.0 LDC2006T01, see http://www.ldc.upenn.eduhttp://www.ldc.upenn.edu –Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0http://ufal.mff.cuni.cz/pdt3.0 Basic general features –Multilayered annotation, interlinked layers –Dependency-based syntax (both surface and deep) –Information structure of the sentence (topic/focus) –Grammatical and basic textual coreference –New: discourse relations, MWEs Languages: Czech, English (also parallel), Arabic –Student work on “samples”: Indonesian, Urdu, Russian, … –Spoken: work started on Czech and English (non-parallel, dialogs)
10
19.1.2015PARSEME Training School Prague 10 The Prague Dependency Treebank Three basic layers of annotation –Morphemic layer –Surface syntax (“analytical”) layer –“Tectogrammatical” layer: underlying syntax, semantic roles (valency), inf. structure, coreference Size –830,000 words (tokens) = 50000 sentences in 3165 full documents (texts) Format –Prague Markup Language (XML- based) –Now also:.treex format For smooth uise in the TreeX platform http://ufal.mff.cuni.cz/treex
11
19.1.2015PARSEME Training School Prague 11 PDT (Czech) Data 4 sources: –Lidové noviny (daily newspaper, incl. extra sections) –DNES (Mladá fronta Dnes) (daily newspaper) –Vesmír (popular science magazine, monthly) –Českomoravský Profit (economical journal, weekly) Full articles selected –article ~ DOCUMENT (basic corpus unit) Time period: 1990-1995 1.8 million tokens (~110,000 sentences total )
12
19.1.2015PARSEME Training School Prague 12 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) PDT 3.0 –Dependency, “functor”, grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0: mass, clauses, formemes, discourse,... PDT 1.0 (2001) PDT 2.0 (2006)
13
19.1.2015PARSEME Training School Prague 13 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
14
19.1.2015PARSEME Training School Prague 14 Morphological Attributes Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 Ex.: nejnezajímavějším “(to) the most uninteresting”
15
19.1.2015PARSEME Training School Prague 15 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
16
19.1.2015PARSEME Training School Prague 16 Layer 2 (a-layer): Analytical Syntax Dependency + Analytical Function dependent governor The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated.
17
19.1.2015PARSEME Training School Prague 17 Analytical Syntax: Functions Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom “Double” dependency: AtrAdv, AtrObj, AtrAtr Special (function words, punctuation,...): Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY Prepositions/Conjunctions: AuxP, AuxC Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK Structural Elipsis: ExD, Coordination etc.: Coord, Apos
18
PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon 19.1.2015PARSEME Training School Prague 18
19
Tectogrammatical Annotation Underlying (deep) syntax 5 sublayers (integrated and/or standoff annotation): –dependency structure, (detailed) functors valency annotation –topic/focus and deep word order –coreference (mostly grammatical only) –discourse –all the rest (grammatemes): detailed functors underlying gender, number, mass nouns,... Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer) 19.1.2015PARSEME Training School Prague 19
20
19.1.2015PARSEME Training School Prague 20 Tectogrammatical vs. analytical syntax TR: No function words AR: All words Predicate verb “Location” In practice, that procedure will require making of certified copies. Re-inserted elided actor of “making”
21
Dependency Structure Similar to the surface (Analytical) layer......but: –certain nodes deleted auxiliaries, non-autosemantic words, punctuation (some) multiword expressions -> 1 node –some nodes added based on word (mostly verb, noun) valency some ellipsis resolution –detailed dependency relation labels (functors) 19.1.2015PARSEME Training School Prague 21
22
Tectogrammatical Functors “Actants”: ACT, PAT, EFF, ADDR, ORIG –modify: verbs, nouns, adjectives –cannot repeat in a clause, usually obligatory Free modifications (~ 50), semantically defined –can repeat; optional, sometimes obligatory –Ex.: LOC, DIR1,...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR,... Special –Coordination, Rhematizers, Foreign phrases (#Forn),... 19.1.2015PARSEME Training School Prague 22 “syntactic” semantic MWEs
23
Deep Word Order Topic/Focus Example: Baker bakes rolls. vs. Baker IC bakes rolls. 19.1.2015PARSEME Training School Prague 23 Analytical dep. tree:
24
Deep Word Order Topic/Focus Deep word order: –from “old” information to the “new” one (left-to- right) at every level (head included) –projectivity by definition (almost...) i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic –attribute of every node (t, f, c) –restricted by d.w.o. and other constraints 19.1.2015PARSEME Training School Prague 24
25
Coreference Grammatical (easy) –relative clauses which, who –Peter and Paul, who... –control infinitival constructions –John promised to go home –reflexive pronouns {him,her,thme}self(-ves) –Mary saw herself in... 19.1.2015PARSEME Training School Prague 25 John go he home promise PRED ACT PAT ACT DIR3
26
Coreference Textual –Ex.: Peter moved to Iowa after he finished his PhD. 19.1.2015PARSEME Training School Prague 26
27
Grammatemes Detailed functors (“subfunctors”) –needed for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of,... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT Lexical (underlying) –number (Sg/Pl), tense, modality, degree of comparison, mass-noun?; is_person_name, is_dsp_root,... 19.1.2015PARSEME Training School Prague 27 MWEs
28
Attribute Summary I node types –complex, coap, qcomplex, root, atom,... functor, subfunctor –TWHEN: TWHEN.basic, TWHEN.before is_member, is_generated, is_parenthesis, is_dsp_root, is_state, quot_type, is_preson_name... grammatemes (16): –aspect, degcmp, deontmod, sempos, tense, indeftype, politeness, person,... 19.1.2015PARSEME Training School Prague 28 MWEs
29
Attribute Summary II topic/focus: –tfa, deepord valency: t_lemma, val_frame.rf bookkeeping: id coref_gram.rf, coref_text.rf, compl.rf –reference to TR node, type of coreference sentmod Linking to analytical layer –a.lex.rf (“main” anal. node), a.aux.rf (others) 19.1.2015PARSEME Training School Prague 29
30
VALENCY IN PDT 19.1.2015PARSEME Training School Prague 30
31
Prague Dependency Treebank & Valency Valency in the PDT –Valency lexicon for PDT –General valency lexicon Valency in deep vs. surface syntax –Links between the layers w.r.t. valency Valency and word sense –Sense-disambiguated occurrences: Links from data to the lexicon Valency in translation, text generation 19.1.2015PARSEME Training School Prague 31
32
Definition of Valency Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning Properties of valency: –Specific for every word meaning (in general) leave: sb left sth for sb vs. sb left from somewhere similar to PropBank leave.02 vs. leave.01 –Typically strongly correlates with surface form (Czech) morphological case (~ ending), preposition+case,...) –Semantic constraints 19.1.2015PARSEME Training School Prague 32
33
Structure of Valency word (lemma) –word sense group 1 valency frame: –slot 1 slot 2 slot 3 surface expression –word sense group 2... 19.1.2015PARSEME Training School Prague 33 vyměnit (to replace) vyměnit 1 ACT PATEFF Nom. Acc.za+Acc. vyměnit 2...
34
PDT-Vallex Entry dosáhnout: “to reach”, “to get [sb to do sth]” browser/user-formatted example: 19.1.2015PARSEME Training School Prague 34
35
MWEs in PDT-Vallex Types included: –Reflexive particle (se, si) smát se – to laugh všimnout si – to notice –Idiomatic constructions dosáhnout svého - to achieve one’s goals běhá mi mráz po zádech – to give me the shivers –Light verb constructions (and similar) uzavřit dohodu – to agree [on sth], strike an agreement,... vzbuzovat pochybnosti – to doubt, to raise doubts 19.1.2015PARSEME Training School Prague 35 smát_se (t_lemma) DPHR (argument) CPHR (argument)
36
Corpus ↔ Valency Lexicon 19.1.2015PARSEME Training School Prague 36 Corpus: ENTRY: uzavřít (to close) vf 1 : ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf 2 : ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) Lexicon: Sentence 2035: Sentence 15345:Sentence 51042:
37
Valency & Text Generation 19.1.2015PARSEME Training School Prague 37 Using valency for... –...getting the correct (lemma, tag) of verb arguments Example: starat_se PRED Martin ACT tygr PAT Martin....1.......... starat V.............. o............... tygr....4.......... VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) se............... Martin se stará o tygry. “Martin takes care of tigers.” “to take care of” “tiger”
38
PARALLEL TREEBANK CZ-EN 19.1.2015PARSEME Training School Prague 38
39
19.1.2015PARSEME Training School Prague 39 Parallel Czech-English Annotation English text → Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal): –Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer –Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English Alignment –Natural, sentence level only (now)
40
19.1.2015PARSEME Training School Prague 40 Human Translation of WSJ Texts Hired translators / FCE level –5 rounds of revision + annotation Specific rules for translation –Sentence per sentence only …to get simple 1:1 alignment –Fluent Czech at the target side –If a choice, prefer “literal” translation The numbers: –English tokens: 1,173,766 –Czech tokens:1,151,150 –Sentence pairs: 49,208
41
19.1.2015PARSEME Training School Prague 41 English Annotation POS and Syntax Automatic conversion from Penn Treebank –PDT morphological layer From POS tags –PDT analytic layer From: –Penn Treebank Syntactic Structure –Non-terminal labels –Function tags (non-terminal “suffixes”) 2-step process –Head determination rules –Conversion to dependency + analytic function
42
19.1.2015PARSEME Training School Prague 42 Head Determination Rules Exhaustive set of rules –By J. Eisner + M. Cmejrek/J. Curin –4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP –Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) –J. Robinson (1960s)
43
19.1.2015PARSEME Training School Prague 43 Example: Head Determination Rules (J.E.) (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:
44
19.1.2015PARSEME Training School Prague 44 Conversion: Analytic Structure, Functions Analytic Function assignment (conversion) Rules, based on functional tags: -SBJ Sb -PRD Pnom -BNF Obj -DTV Obj -LGS Obj-ADV Adv -DIR Adv-EXT Adv -LOC Adv-MNR Adv -PRP Adv-PUT Adv -TMP Adv –Ad-hoc rules (if functional tags missing) –Lemmatization (years → year)
45
19.1.2015PARSEME Training School Prague 45 Example: Analytical Structure, Functions (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation
46
Czech-English Example 19.1.2015PARSEME Training School Prague 46 Dicku Darmane, zavolejte do své kanceláře!Dick Darman, call your office!
47
SUMMARY OF PART 1/1 19.1.2015PARSEME Training School Prague 47
48
19.1.2015PARSEME Training School Prague 48 PDT Treebanks at UFAL (written language) Czech –Prague Dependency Treebank Complex annotation, all levels, additional annotation –Translation of Penn Treebank Tectogrammatical layer only, no t/f –Analytical, morphology: automatic tool English –Re-annotation of Penn Treebank Other languages –Arabic (own annotation) –Other: by conversion (HamleDT – 30 treebanks)
49
Prague Dependency Treebanks Annotation: –4 layers: Words, lemmas/tags, surface dep. syntax, tectogrammatics –Tectogrammatical layer: No function words, semantic relations Valency/verb arguments (some MWE features) –Separate valency lexicon, fully linked from PDT nodes Coreference, Topic/focus, Discourse Links back to analytical layer (parsing!) 19.1.2015PARSEME Training School Prague 49 Now also in the Universal Dependency format! https://github.com/UniversalDependencies
50
19.1.2015PARSEME Training School Prague 50 Pointers PDT 2.0 (the “Original”), newest version: PDT 3.0 –http://ufal.mff.cuni.cz/pdt2.0http://ufal.mff.cuni.cz/pdt2.0 –http://ufal.mff.cuni.cz/pdt3.0http://ufal.mff.cuni.cz/pdt3.0 PCEDT –http://ufal.mff.cuni.cz/pcedt2.0/http://ufal.mff.cuni.cz/pcedt2.0/ PEDT –English side of PCEDT, additional: NE, coreference –http://ufal.mff.cuni.cz/pedt2.0/http://ufal.mff.cuni.cz/pedt2.0/ PADT (Arabic, morphology + surface syntax) –http://ufal.mff.cuni.cz/padthttp://ufal.mff.cuni.cz/padt Other corpora, PDT-Vallex, EngVallex: –Search at http://lindat.czhttp://lindat.cz LDC catalog numbers: –LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0) CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only) –http://ufal.mff.cuni.cz/conll2009-sthttp://ufal.mff.cuni.cz/conll2009-st HamleDT 2.0 (30 treebanks in unified format) –http://ufal.mff.cuni.cz/hamledthttp://ufal.mff.cuni.cz/hamledt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.