Presentation is loading. Please wait.

Presentation is loading. Please wait.

Praha, 31.10.20111 From the Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages Dan Zeman, Martin Popel David Mareček, Loganathan Ramasamy,

Similar presentations


Presentation on theme: "Praha, 31.10.20111 From the Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages Dan Zeman, Martin Popel David Mareček, Loganathan Ramasamy,"— Presentation transcript:

1 Praha, 31.10.20111 From the Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages Dan Zeman, Martin Popel David Mareček, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, Jan Hajič ÚFAL MFF UK The research has been supported by the grants P406/11/1499, …

2 Praha, 31.10.20112 Results

3 Praha, 31.10.20113 That Was CoNLL-X Does it mean that  Turkish (63.2) and Arabic (66.9) are difficult languages;  German (87.3) and Japanese (90.1) are easy languages? Not necessarily…  Data size ( 50000 training sentences)  Chance (small test set, big deviation)  Language differences  Domain differences (=> sentence length)  Annotation style differences

4 Praha, 31.10.20114 The TASK As many languages as possible Find style differences Unify styles => normalized treebank Try alternatives to the unified style => transformations What is the best for your parser?

5 Praha, 31.10.20115 The TALK Dan: Data overview Normalization Martin: Transformation overview Experiments Coordination

6 Praha, 31.10.20116 Languages Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

7 Praha, 31.10.20117 CoNLL-X: 13 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

8 Praha, 31.10.20118 CoNLL 2007: 10 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

9 Praha, 31.10.20119 CoNLL 2009: 7 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

10 Praha, 31.10.201110 ICON 2009-2010: 3 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

11 Praha, 31.10.201111 Other: 10 Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

12 Praha, 31.10.201112 Dependencies (PDT-like) / Converted / Constituents Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

13 Praha, 31.10.201113 We Cannot Process (yet) Chinese (Sinica Treebank) Hebrew (constituents) Icelandic (constituents) The others:  Various levels of success

14 Praha, 31.10.201114 Data Size

15 Praha, 31.10.201115 Sentence Length

16 Praha, 31.10.201116 Nonprojective Dependencies

17 Praha, 31.10.201117 Nonprojectivity

18 Praha, 31.10.201118 Nonprojective Latin ? the mind carries to say forms changed into new bodies ?

19 Praha, 31.10.201119 DZ Interset: Unify Morpho-Tags t-prppfa-Pos: verb Verbform: part Gender: fem Number: plu Case: acc Tense: past Voice: pass Aspect: perf

20 Praha, 31.10.201120 DZ Interset: Unify Morpho-Tags la:t-prppfa- en:IN de:ADJA Pos|Dat|Sg|Neut ru:S ЕД СРЕД ВИН fi:ALL|SG|DV-JA|N pt:pron pron-indp |M|S ja:PSE VsFP4--XR--P--- RR--X---------- AANS3----1----- NNNS4---------- NNXSX---------- PDYSX---------- TT-------------

21 Praha, 31.10.201121 Original Morphology Manual / Auto / Both Ancient Greek (grc) Arabic (ar) Basque (eu) Bengali (bn) Bulgarian (bg) Catalan (ca) Chinese (zh) Czech (cs) Danish (da) Dutch (nl) English (en) Estonian (et) Finnish (fi) German (de) Greek (el) Hebrew (he) Hindi (hi) Hungarian (hu) Icelandic (is) Italian (it) Japanese (ja) Latin (la) Portuguese (pt) Romanian (ro) Russian (ru) Slovene (sl) Spanish (es) Swedish (sv) Tamil (ta) Telugu (te) Turkish (tr)

22 Praha, 31.10.201122 We Don't Touch Tokenization Multi-word expressions = single tokens / nodes in some treebanks Separated elsewhere We don't normalize it

23 23 We Don't Touch Tokenization NULL nodes दीवाली के दिन जुआ खेलें मगर NULL घर में या होटल में. dīvālī ke dina juā khelem ̇ magara NULL ghara mem ̇ yā hoṭala mem ̇. On Diwali they gamble but [they do so] at home or hotel.

24 Praha, 31.10.201124 Dependency Relation Labels DEPREL column in CoNLL data Afun (“analytical function”) in Prague Treebanks Examples:  Sb, Pred, Obj, Adv, Atr, AuxP [cs]  nobj [da] … noun object (e.g. of preposition)  PG [de] … phrasal genitive (von-PP instead of gen)  AMS [de] … measure argument of adj (zwei Jahre alt)  k1 (karta / agent), k2 (karma / patient), k3 (karana / instrument), k4 (sampradaana / recipient) … [hi]

25 Praha, 31.10.201125 Structural Variations DDT: exotic animal Prepositions and/or postpositions Subordinated clauses Verb groups Punctuation Apposition Coordination We try to automatically identify these constructions and restructure them as in PDT.

26 Praha, 31.10.201126 Danish Dependency Treebank

27 Praha, 31.10.201127 Prepositions

28 Praha, 31.10.201128 Subordinated Clauses

29 Praha, 31.10.201129 Verb Groups

30 Praha, 31.10.201130 Final Punctuation On artificial root [cs, ar, sl, grc, ta] Between artificial root and main predicate [tr] On main predicate [bg, ca, da, de, en, es, et, fi, hu, …] On the predicate of the last clause [hi] On previous token [eu, it, ja, nl] No punctuation [ru, ro]

31 Praha, 31.10.201131 Paired Punctuation

32 Praha, 31.10.201132 Paired Punctuation

33 Praha, 31.10.201133 Coordination: Mel'čuk

34 Praha, 31.10.201134 Coordination: Prague

35 Praha, 31.10.201135 Coordination: [ro, zh]

36 Praha, 31.10.201136 Coordination: Stanford

37 Praha, 31.10.201137 Coordination: Tesnière

38 Praha, 31.10.201138 END OF PART ONE

39 Praha, 31.10.201139 thank you děkujeme شكرا благодаря তোমাকে ধন্যবাদ gràciestakdanke ευχαριστώ gracias aitäh eskerrik asko kiitos शुक्रिया köszönöm þakka þér grazie ありがとう gratias dank obrigado mulţumesc спасибо hvala tack நன் றி ధన్యవాదాలు teşekkür ederim 謝謝


Download ppt "Praha, 31.10.20111 From the Jungle to a Park: Harmonizing Dependency Treebanks of 30 Languages Dan Zeman, Martin Popel David Mareček, Loganathan Ramasamy,"

Similar presentations


Ads by Google