Download presentation
Presentation is loading. Please wait.
Published byEleanore Montgomery Modified over 8 years ago
1
Praha, 31.10.20111 From the Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages Dan Zeman, Martin Popel David Mareček, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič ÚFAL MFF UK The research has been supported by the grants P406/11/1499, …
2
Praha, 31.10.20112 Results
3
Praha, 31.10.20113 That Was CoNLL-X Does it mean that Turkish (63.2) and Arabic (66.9) are difficult languages; German (87.3) and Japanese (90.1) are easy languages? Not necessarily… Data size ( 50000 training sentences) Chance (small test set, big deviation) Language differences Domain differences (=> sentence length) Annotation style differences
4
Praha, 31.10.20114 The TASK As many languages as possible Find style differences Unify styles => normalized treebank Try alternatives to the unified style => transformations What is the best for your parser?
5
Praha, 31.10.20115 The TALK Dan: Data overview Normalization Martin: Transformation overview Experiments Coordination
6
Praha, 31.10.20116 Languages Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
7
Praha, 31.10.20117 CoNLL-X: 13 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
8
Praha, 31.10.20118 CoNLL 2007: 10 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
9
Praha, 31.10.20119 CoNLL 2009: 7 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
10
Praha, 31.10.201110 ICON 2009-2010: 3 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
11
Praha, 31.10.201111 Other: 10 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
12
Praha, 31.10.201112 Dependencies (PDT-like) / Converted / Constituents Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
13
Praha, 31.10.201113 We Cannot Process (yet) Chinese (Sinica Treebank) Hebrew (constituents) Icelandic (constituents) The others: Various levels of success
14
Praha, 31.10.201114 Data Size
15
Praha, 31.10.201115 Sentence Length
16
Praha, 31.10.201116 Nonprojective Dependencies
17
Praha, 31.10.201117 Nonprojectivity
18
Praha, 31.10.201118 Nonprojective Latin ? the mind carries to say forms changed into new bodies ?
19
Praha, 31.10.201119 DZ Interset: Unify Morpho-Tags t-prppfa-Pos: verb Verbform: part Gender: fem Number: plu Case: acc Tense: past Voice: pass Aspect: perf
20
Praha, 31.10.201120 DZ Interset: Unify Morpho-Tags la:t-prppfa- en:IN de:ADJA Pos|Dat|Sg|Neut ru:S ЕД СРЕД ВИН fi:ALL|SG|DV-JA|N pt:pron pron-indp |M|S ja:PSE VsFP4--XR--P--- RR--X---------- AANS3----1----- NNNS4---------- NNXSX---------- PDYSX---------- TT-------------
21
Praha, 31.10.201121 Original Morphology Manual / Auto / Both Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)
22
Praha, 31.10.201122 We Don't Touch Tokenization Multi-word expressions = single tokens / nodes in some treebanks Separated elsewhere We don't normalize it
23
23 We Don't Touch Tokenization NULL nodes दीवाली के दिन जुआ खेलें मगर NULL घर में या होटल में. dīvālī ke dina juā khelem ̇ magara NULL ghara mem ̇ yā hoṭala mem ̇. On Diwali they gamble but [they do so] at home or hotel.
24
Praha, 31.10.201124 Dependency Relation Labels DEPREL column in CoNLL data Afun (“analytical function”) in Prague Treebanks Examples: Sb, Pred, Obj, Adv, Atr, AuxP [cs] nobj [da] … noun object (e.g. of preposition) PG [de] … phrasal genitive (von-PP instead of gen) AMS [de] … measure argument of adj (zwei Jahre alt) k1 (karta / agent), k2 (karma / patient), k3 (karana / instrument), k4 (sampradaana / recipient) … [hi]
25
Praha, 31.10.201125 Structural Variations DDT: exotic animal Prepositions and/or postpositions Subordinated clauses Verb groups Punctuation Apposition Coordination We try to automatically identify these constructions and restructure them as in PDT.
26
Praha, 31.10.201126 Danish Dependency Treebank
27
Praha, 31.10.201127 Prepositions
28
Praha, 31.10.201128 Subordinated Clauses
29
Praha, 31.10.201129 Verb Groups
30
Praha, 31.10.201130 Final Punctuation On artificial root [cs, ar, sl, grc, ta] Between artificial root and main predicate [tr] On main predicate [bg, ca, da, de, en, es, et, fi, hu, …] On the predicate of the last clause [hi] On previous token [eu, it, ja, nl] No punctuation [ru, ro]
31
Praha, 31.10.201131 Paired Punctuation
32
Praha, 31.10.201132 Paired Punctuation
33
Praha, 31.10.201133 Coordination: Mel'čuk
34
Praha, 31.10.201134 Coordination: Prague
35
Praha, 31.10.201135 Coordination: [ro, zh]
36
Praha, 31.10.201136 Coordination: Stanford
37
Praha, 31.10.201137 Coordination: Tesnière
38
Praha, 31.10.201138 END OF PART ONE
39
Praha, 31.10.201139 thank you děkujeme شكرا благодаря তোমাকে ধন্যবাদ gràciestakdanke ευχαριστώ gracias aitäh eskerrik asko kiitos शुक्रिया köszönöm þakka þér grazie ありがとう gratias dank obrigado mulţumesc спасибо hvala tack நன் றி ధన్యవాదాలు teşekkür ederim 謝謝
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.