Presentation is loading. Please wait.

Presentation is loading. Please wait.

Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.

Similar presentations


Presentation on theme: "Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science."— Presentation transcript:

1 Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic

2 19.1.2015PARSEME Training School Prague 2 Outline Treebanks –Phrase-(Constituency-) based: The Penn Treebank –Dependency: The Prague Dependency Treebanks The Penn Treebank (basics) The Prague Dependency Treebank –Layers of Annotation Morphology Syntax Semantics Valency

3 THE PENN TREEBANK 19.1.2015PARSEME Training School Prague 3

4 19.1.2015PARSEME Training School Prague 4 Phrase- vs. Dependency- Based Treebanks The original: The Penn Treebank –Phrase-based style; good for parsing by CFG grammars Followers –Almost all Penn-based treebanks Chinese, Arabic, Korean, … –Negra (German), many others Now: dependency parsing prevails Conversion from phrase-based treebanks –Might lose information, heads added „ad hoc“ “native” dependency treebanks: annotated as such –Considered “better” –Hindi/Urdu, TIGER (sort of); both styles manually annotated –PDT (of course) and similar ones »PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin

5 The Penn Treebank Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)www.ldc.upenn.edu –First the Wall Street Journal part (1 mil. words, 2312 documents) Added other text types –ATIS corpus (dialogs, travel reservations) –Brown corpus annotated for syntax –Switchboard (spoken language, tel. conversations) 19.1.2015PARSEME Training School Prague 5

6 19.1.2015PARSEME Training School Prague 6 Penn Treebank Format ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. POS tag (NNS) (noun, plural) Phrase label (NP) Noun Phrase “Preterminal”

7 The Penn Treebank(s) Extensions –Annotation of named entities, co-reference (BBN) –cf. also previous slides –Function labels (SBJ, OBJ, TMP,...) –PropBank Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary) (S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2)))... ) –NomBank Like PropBank, but for nouns and their “arguments” Other languages (Chinese, Arabic,...) 19.1.2015PARSEME Training School Prague 7

8 THE PRAGUE DEPENDENCY TREEBANK 19.1.2015PARSEME Training School Prague 8

9 19.1.2015PARSEME Training School Prague 9 The Prague Dependency Treebanks: the Basics Original Treebank: PDT 1.0, 2001 (morf., dep. syntax) First full release: PDT 2.0 –http://ufal.mff.cuni.cz/pdt2.0http://ufal.mff.cuni.cz/pdt2.0 LDC2006T01, see http://www.ldc.upenn.eduhttp://www.ldc.upenn.edu –Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0http://ufal.mff.cuni.cz/pdt3.0 Basic general features –Multilayered annotation, interlinked layers –Dependency-based syntax (both surface and deep) –Information structure of the sentence (topic/focus) –Grammatical and basic textual coreference –New: discourse relations, MWEs Languages: Czech, English (also parallel), Arabic –Student work on “samples”: Indonesian, Urdu, Russian, … –Spoken: work started on Czech and English (non-parallel, dialogs)

10 19.1.2015PARSEME Training School Prague 10 The Prague Dependency Treebank Three basic layers of annotation –Morphemic layer –Surface syntax (“analytical”) layer –“Tectogrammatical” layer: underlying syntax, semantic roles (valency), inf. structure, coreference Size –830,000 words (tokens) = 50000 sentences in 3165 full documents (texts) Format –Prague Markup Language (XML- based) –Now also:.treex format For smooth uise in the TreeX platform http://ufal.mff.cuni.cz/treex

11 19.1.2015PARSEME Training School Prague 11 PDT (Czech) Data 4 sources: –Lidové noviny (daily newspaper, incl. extra sections) –DNES (Mladá fronta Dnes) (daily newspaper) –Vesmír (popular science magazine, monthly) –Českomoravský Profit (economical journal, weekly) Full articles selected –article ~ DOCUMENT (basic corpus unit) Time period: 1990-1995 1.8 million tokens (~110,000 sentences total )

12 19.1.2015PARSEME Training School Prague 12 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) PDT 3.0 –Dependency, “functor”, grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0: mass, clauses, formemes, discourse,... PDT 1.0 (2001) PDT 2.0 (2006)

13 19.1.2015PARSEME Training School Prague 13 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

14 19.1.2015PARSEME Training School Prague 14 Morphological Attributes Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 Ex.: nejnezajímavějším “(to) the most uninteresting”

15 19.1.2015PARSEME Training School Prague 15 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

16 19.1.2015PARSEME Training School Prague 16 Layer 2 (a-layer): Analytical Syntax Dependency + Analytical Function dependent governor The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated.

17 19.1.2015PARSEME Training School Prague 17 Analytical Syntax: Functions Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom “Double” dependency: AtrAdv, AtrObj, AtrAtr Special (function words, punctuation,...): Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY Prepositions/Conjunctions: AuxP, AuxC Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK Structural Elipsis: ExD, Coordination etc.: Coord, Apos

18 PDT Annotation Layers L0 (w) Words (tokens) –automatic segmentation and markup only L1 (m) Morphology –Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax) –Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax) –Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon 19.1.2015PARSEME Training School Prague 18

19 Tectogrammatical Annotation Underlying (deep) syntax 5 sublayers (integrated and/or standoff annotation): –dependency structure, (detailed) functors valency annotation –topic/focus and deep word order –coreference (mostly grammatical only) –discourse –all the rest (grammatemes): detailed functors underlying gender, number, mass nouns,... Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer) 19.1.2015PARSEME Training School Prague 19

20 19.1.2015PARSEME Training School Prague 20 Tectogrammatical vs. analytical syntax TR: No function words AR: All words Predicate verb “Location” In practice, that procedure will require making of certified copies. Re-inserted elided actor of “making”

21 Dependency Structure Similar to the surface (Analytical) layer......but: –certain nodes deleted auxiliaries, non-autosemantic words, punctuation (some) multiword expressions -> 1 node –some nodes added based on word (mostly verb, noun) valency some ellipsis resolution –detailed dependency relation labels (functors) 19.1.2015PARSEME Training School Prague 21

22 Tectogrammatical Functors “Actants”: ACT, PAT, EFF, ADDR, ORIG –modify: verbs, nouns, adjectives –cannot repeat in a clause, usually obligatory Free modifications (~ 50), semantically defined –can repeat; optional, sometimes obligatory –Ex.: LOC, DIR1,...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR,... Special –Coordination, Rhematizers, Foreign phrases (#Forn),... 19.1.2015PARSEME Training School Prague 22 “syntactic” semantic MWEs

23 Deep Word Order Topic/Focus Example: Baker bakes rolls. vs. Baker IC bakes rolls. 19.1.2015PARSEME Training School Prague 23 Analytical dep. tree:

24 Deep Word Order Topic/Focus Deep word order: –from “old” information to the “new” one (left-to- right) at every level (head included) –projectivity by definition (almost...) i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic –attribute of every node (t, f, c) –restricted by d.w.o. and other constraints 19.1.2015PARSEME Training School Prague 24

25 Coreference Grammatical (easy) –relative clauses which, who –Peter and Paul, who... –control infinitival constructions –John promised to go home –reflexive pronouns {him,her,thme}self(-ves) –Mary saw herself in... 19.1.2015PARSEME Training School Prague 25 John go he home promise PRED ACT PAT ACT DIR3

26 Coreference Textual –Ex.: Peter moved to Iowa after he finished his PhD. 19.1.2015PARSEME Training School Prague 26

27 Grammatemes Detailed functors (“subfunctors”) –needed for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of,... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT Lexical (underlying) –number (Sg/Pl), tense, modality, degree of comparison, mass-noun?; is_person_name, is_dsp_root,... 19.1.2015PARSEME Training School Prague 27 MWEs

28 Attribute Summary I node types –complex, coap, qcomplex, root, atom,... functor, subfunctor –TWHEN: TWHEN.basic, TWHEN.before is_member, is_generated, is_parenthesis, is_dsp_root, is_state, quot_type, is_preson_name... grammatemes (16): –aspect, degcmp, deontmod, sempos, tense, indeftype, politeness, person,... 19.1.2015PARSEME Training School Prague 28 MWEs

29 Attribute Summary II topic/focus: –tfa, deepord valency: t_lemma, val_frame.rf bookkeeping: id coref_gram.rf, coref_text.rf, compl.rf –reference to TR node, type of coreference sentmod Linking to analytical layer –a.lex.rf (“main” anal. node), a.aux.rf (others) 19.1.2015PARSEME Training School Prague 29

30 VALENCY IN PDT 19.1.2015PARSEME Training School Prague 30

31 Prague Dependency Treebank & Valency Valency in the PDT –Valency lexicon for PDT –General valency lexicon Valency in deep vs. surface syntax –Links between the layers w.r.t. valency Valency and word sense –Sense-disambiguated occurrences: Links from data to the lexicon Valency in translation, text generation 19.1.2015PARSEME Training School Prague 31

32 Definition of Valency Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning Properties of valency: –Specific for every word meaning (in general) leave: sb left sth for sb vs. sb left from somewhere similar to PropBank leave.02 vs. leave.01 –Typically strongly correlates with surface form (Czech) morphological case (~ ending), preposition+case,...) –Semantic constraints 19.1.2015PARSEME Training School Prague 32

33 Structure of Valency word (lemma) –word sense group 1 valency frame: –slot 1 slot 2 slot 3 surface expression –word sense group 2... 19.1.2015PARSEME Training School Prague 33 vyměnit (to replace) vyměnit 1 ACT PATEFF Nom. Acc.za+Acc. vyměnit 2...

34 PDT-Vallex Entry dosáhnout: “to reach”, “to get [sb to do sth]” browser/user-formatted example: 19.1.2015PARSEME Training School Prague 34

35 MWEs in PDT-Vallex Types included: –Reflexive particle (se, si) smát se – to laugh všimnout si – to notice –Idiomatic constructions dosáhnout svého - to achieve one’s goals běhá mi mráz po zádech – to give me the shivers –Light verb constructions (and similar) uzavřit dohodu – to agree [on sth], strike an agreement,... vzbuzovat pochybnosti – to doubt, to raise doubts 19.1.2015PARSEME Training School Prague 35 smát_se (t_lemma) DPHR (argument) CPHR (argument)

36 Corpus ↔ Valency Lexicon 19.1.2015PARSEME Training School Prague 36 Corpus: ENTRY: uzavřít (to close) vf 1 : ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf 2 : ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) Lexicon: Sentence 2035: Sentence 15345:Sentence 51042:

37 Valency & Text Generation 19.1.2015PARSEME Training School Prague 37 Using valency for... –...getting the correct (lemma, tag) of verb arguments Example: starat_se PRED Martin ACT tygr PAT Martin....1.......... starat V.............. o............... tygr....4.......... VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) se............... Martin se stará o tygry. “Martin takes care of tigers.” “to take care of” “tiger”

38 PARALLEL TREEBANK CZ-EN 19.1.2015PARSEME Training School Prague 38

39 19.1.2015PARSEME Training School Prague 39 Parallel Czech-English Annotation English text → Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal): –Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer –Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English Alignment –Natural, sentence level only (now)

40 19.1.2015PARSEME Training School Prague 40 Human Translation of WSJ Texts Hired translators / FCE level –5 rounds of revision + annotation Specific rules for translation –Sentence per sentence only …to get simple 1:1 alignment –Fluent Czech at the target side –If a choice, prefer “literal” translation The numbers: –English tokens: 1,173,766 –Czech tokens:1,151,150 –Sentence pairs: 49,208

41 19.1.2015PARSEME Training School Prague 41 English Annotation POS and Syntax Automatic conversion from Penn Treebank –PDT morphological layer From POS tags –PDT analytic layer From: –Penn Treebank Syntactic Structure –Non-terminal labels –Function tags (non-terminal “suffixes”) 2-step process –Head determination rules –Conversion to dependency + analytic function

42 19.1.2015PARSEME Training School Prague 42 Head Determination Rules Exhaustive set of rules –By J. Eisner + M. Cmejrek/J. Curin –4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP –Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) –J. Robinson (1960s)

43 19.1.2015PARSEME Training School Prague 43 Example: Head Determination Rules (J.E.) (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:

44 19.1.2015PARSEME Training School Prague 44 Conversion: Analytic Structure, Functions Analytic Function assignment (conversion) Rules, based on functional tags: -SBJ Sb -PRD Pnom -BNF Obj -DTV Obj -LGS Obj-ADV Adv -DIR Adv-EXT Adv -LOC Adv-MNR Adv -PRP Adv-PUT Adv -TMP Adv –Ad-hoc rules (if functional tags missing) –Lemmatization (years → year)

45 19.1.2015PARSEME Training School Prague 45 Example: Analytical Structure, Functions (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation

46 Czech-English Example 19.1.2015PARSEME Training School Prague 46 Dicku Darmane, zavolejte do své kanceláře!Dick Darman, call your office!

47 SUMMARY OF PART 1/1 19.1.2015PARSEME Training School Prague 47

48 19.1.2015PARSEME Training School Prague 48 PDT Treebanks at UFAL (written language) Czech –Prague Dependency Treebank Complex annotation, all levels, additional annotation –Translation of Penn Treebank Tectogrammatical layer only, no t/f –Analytical, morphology: automatic tool English –Re-annotation of Penn Treebank Other languages –Arabic (own annotation) –Other: by conversion (HamleDT – 30 treebanks)

49 Prague Dependency Treebanks Annotation: –4 layers: Words, lemmas/tags, surface dep. syntax, tectogrammatics –Tectogrammatical layer: No function words, semantic relations Valency/verb arguments (some MWE features) –Separate valency lexicon, fully linked from PDT nodes Coreference, Topic/focus, Discourse Links back to analytical layer (parsing!) 19.1.2015PARSEME Training School Prague 49 Now also in the Universal Dependency format! https://github.com/UniversalDependencies

50 19.1.2015PARSEME Training School Prague 50 Pointers PDT 2.0 (the “Original”), newest version: PDT 3.0 –http://ufal.mff.cuni.cz/pdt2.0http://ufal.mff.cuni.cz/pdt2.0 –http://ufal.mff.cuni.cz/pdt3.0http://ufal.mff.cuni.cz/pdt3.0 PCEDT –http://ufal.mff.cuni.cz/pcedt2.0/http://ufal.mff.cuni.cz/pcedt2.0/ PEDT –English side of PCEDT, additional: NE, coreference –http://ufal.mff.cuni.cz/pedt2.0/http://ufal.mff.cuni.cz/pedt2.0/ PADT (Arabic, morphology + surface syntax) –http://ufal.mff.cuni.cz/padthttp://ufal.mff.cuni.cz/padt Other corpora, PDT-Vallex, EngVallex: –Search at http://lindat.czhttp://lindat.cz LDC catalog numbers: –LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0) CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only) –http://ufal.mff.cuni.cz/conll2009-sthttp://ufal.mff.cuni.cz/conll2009-st HamleDT 2.0 (30 treebanks in unified format) –http://ufal.mff.cuni.cz/hamledthttp://ufal.mff.cuni.cz/hamledt


Download ppt "Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science."

Similar presentations


Ads by Google