Improving Parsing Accuracy by Combining Diverse Dependency Parsers Daniel Zeman and Zdeněk Žabokrtský ÚFAL MFF, Univerzita Karlova, Praha
Overview introduction existing parsers and their accuracies methods of combination switching unbalanced voting results conclusion Vancouver, 10.10.2005 Zeman & Žabokrtský
Dependency Parsing parse: S 2N S = set of all sentences N = set of natural numbers Vancouver, 10.10.2005 Zeman & Žabokrtský
Dependency Parsing Píše dopis svému příteli . Píše dopis svému příteli VB-S---3P-AA-- VeYS------A--- NNIS1-----A--- NNIS4-----A--- P8ZS3---------- NNMS3-----A--- NNMS5-----A--- NNMS6-----A--- Z:------ Píše dopis svému příteli . Vancouver, 10.10.2005 Zeman & Žabokrtský
Dependency Parsing letter to his friend . He is writing a letter VB-S---3P-AA-- VeYS------A--- NNIS1-----A--- NNIS4-----A--- P8ZS3---------- NNMS3-----A--- NNMS5-----A--- NNMS6-----A--- Z:------ write letter to his friend . Vancouver, 10.10.2005 Zeman & Žabokrtský
Dependency Parsing 1 4 letter to his friend . He is writing a letter 1 4 write letter to his friend . Vancouver, 10.10.2005 Zeman & Žabokrtský
Prague Dependency Treebank (PDT 1.0) Czech 1 255 590 training tokens in 73 088 non-empty sentences 63 353 tune tokens in 3646 sentences 62 677 test tokens in 3673 sentences accuracy = percentage of tokens with correctly assigned parent nodes (each dep. tree has an artificial root node) Vancouver, 10.10.2005 Zeman & Žabokrtský
Existing Parsers For Czech (PDT; accuracies on Tune set): 83.6 % Eugene Charniak’s (ec) [ported from English] 81.7 % Michael Collins’ (mc) [ported from English] 74.3 % Zdeněk Žabokrtský’s (zz) [hand-made rules] 73.8 % Daniel Zeman’s (dz) [dependency n-grams] Tomáš Holan’s: 71.0 % pshrt 69.5 % left-to-right [push-down automaton] 62.0 % right-to-left [push-down automaton] Vancouver, 10.10.2005 Zeman & Žabokrtský
More Existing Parsers New parsers (2005): Nivre & Jenssen [push-down automaton] McDonald & Ribarov [max span tree] EC++ (Hall & Novák) No accuracy figures for our Tune set. But they are better than most of our parsers pool. Vancouver, 10.10.2005 Zeman & Žabokrtský
Good Old Truth: Two Heads Are More Than One! van Halteren et al.: tagging Brill and Wu: tagging Brill and Hladká: bagging parsers Henderson and Brill: constituent parsing Frederking and Nirenburg: machine translation Fiscus: speech recognition Borthwick: named entity recognition Inui and Inui: partial parsing Florian and Yarowsky: word sense disambiguation Chu-Carroll et al.: question answering Vancouver, 10.10.2005 Zeman & Žabokrtský
Voting Question: “What is the index of the parent of the i-th node?” Answer: ec: “7” mc: “7” zz: “5” dz: “11” Resulting answer: “7” Vancouver, 10.10.2005 Zeman & Žabokrtský
Emerging Issues Are the parsers different enough to contribute uniquely? How do we do if all parsers disagree? What if the resulting structure is not a tree? Vancouver, 10.10.2005 Zeman & Žabokrtský
Uniqueness of a Parser How many parents are there that only parser X found? pool of 7 parsers (ec, mc, zz, dz, thr, thl, thp) test data set ec: 1.7 % zz: 1.2 % (rule-based parser, no statistics, non-projectivities!) mc: 0.9 % others: 0.3 – 0.4 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Uniqueness of a Parser How many parents are there that only parser X found? four best parsers (ec, mc, zz, dz) test data set ec: 3.0 % zz: 2.0 % (rule-based parser, no statistics, non-projectivities!) mc: 1.7 % dz: 1.0 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Uniqueness of a Parser How many parents are there that only parser X found? two best parsers (ec, mc) test data set ec: 8.1 % mc: 6.2 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Uniqueness of a Parser The unique things are hard to push through The real strength will always be there where parsers agree (voting) Vancouver, 10.10.2005 Zeman & Žabokrtský
Majority vs. Oracle test data ec: 85.0 % ec mc zz dz thr thl thp Majority: >half of the parsers 76.8 % 75.1 % 82.9 % Oracle: at least one parser 95.8 % 94.0 % 93.0 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Majority Voting Three parsers (ec+mc+zz): 83 % of parents known by at least two parsers (majority) However, ec alone achieves 85 %! For some parents, there is no majority! (ecmczz) In such cases, use ec’s opinion. together 86.7 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Weighting the Parsers We have backed-off to ec. Why? — It’s the best parser of all! How do we know? We can measure the accuracies on the Tune data set. Can we use the accuracies in a more sophisticated way? Vancouver, 10.10.2005 Zeman & Žabokrtský
Weighting the Parsers A parser has so many votes, how many percent of accuracy it achieves. E.g., mc+zz would outvote ec+thr: 81.7 + 74.3 = 156 > 154.6 = 83.6 + 71.0 Context is not taken into account (so far). Vancouver, 10.10.2005 Zeman & Žabokrtský
Context Hope: one parser is good at PP attachment while another knows how to build coordination (e.g.). Features such as morphology of the dependent node may help to find the right parser. The context sensitivity of the combining classifier was trained on the Tune Data Set. Vancouver, 10.10.2005 Zeman & Žabokrtský
Context Features For each node of: the dependent, and the parents proposed by the respective parsers: part of speech, subcategory, gender, number, case, inner gender, inner number, person, degree of comparison, negativeness, tense, voice, semantic flags (proper name, geography…) For each gov-dep pair: mutual position (left neighbor, right far…) For each parser pair: do the two parsers agree? Vancouver, 10.10.2005 Zeman & Žabokrtský
Decision Trees We have trained C5 (Quinlan) Very minor improvement (0.1 %) Got quite simple decision trees Mimic voting — parser agreement are the most important features (in fact, this is not context) Did not help with just two parsers (ec+mc) (no voting possible) Vancouver, 10.10.2005 Zeman & Žabokrtský
Example of a Decision Tree agreezzmc = yes: zz (3041/1058) agreezzmc = no: :...agreemcec = yes: ec (7785/1026) agreemcec = no: :...agreezzec = yes: ec (2840/601) agreezzec = no: :...zz_case = 6: zz (150/54) zz_case = 3: zz (34/10) zz_case = X: zz (37/20) zz_case = undef: ec (2006/1102) zz_case = 7: zz (83/48) zz_case = 2: zz (182/110) zz_case = 4: zz (108/57) zz_case = 1: ec (234/109) zz_case = 5: mc (1) zz_case = root: :...ec_negat = A: mc (117/65) ec_negat = undef: ec (139/65) ec_negat = N: ec (1) ec_negat = root: ec (2) Vancouver, 10.10.2005 Zeman & Žabokrtský
It is not guaranteed that the result is a tree! 1 2 3 1 2 3 1 2 3 1 2 3 Vancouver, 10.10.2005 Zeman & Žabokrtský
Note: We Actually May Be Willing to Accept Non-Trees The method of computing accuracy motivates to look at nodes, not the whole structure. Suppose that one edge in a cycle is wrong, we do not know which one, all others are good. If we wrongly select the bad one, we get two wrong edges. When partial relations are sought for, the whole structure may not matter. Vancouver, 10.10.2005 Zeman & Žabokrtský
How to Preserve Treeness In each step (adding a new dependency), rule out parsers whose proposal introduces a cycle. If all parsers propose cycles, abandon the whole structure. Use ec’s tree as is, instead. Vancouver, 10.10.2005 Zeman & Žabokrtský
Results Baseline (ec): 85.0 % Four parsers (ec+mc+zz+dz), cycles allowed: 87.0 % 91.6 % structures are trees Four parsers (ec+mc+zz+dz), cycles banned: 86.9 % (sorry for the typos in the paper sec. 5.4) Vancouver, 10.10.2005 Zeman & Žabokrtský
Unbalanced Combination (Brill & Hladká in Hajič et al., 1998) Is precision more important to us than recall? Better say nothing than make a mistake. That may be our priority when: preprocessing text for annotators extracting various phenomena from a corpus (if there is no parse for a sentence, never mind, we just will not extract anything from here) Vancouver, 10.10.2005 Zeman & Žabokrtský
Unbalanced Combination Include only dependencies proposed by at least half of the parsers. Some nodes won’t get a parent. Results for 7 parsers: precision 90.7 % recall 78.6 % f-measure 84.2 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Unbalanced Combination Interesting: Unbalanced voting of even number of parsers prefers recall over precision! Sometimes one half of the parsers proposes one parent, while the other half agree on another candidate. Results for 4 best parsers: precision 85.4 % recall 87.7 % f-measure 86.5 % Vancouver, 10.10.2005 Zeman & Žabokrtský
Related Work Brill and Hladká combined several “parsers” — in fact one parser, trained on different bags of training data. 6 % error reduction, cf. with our 13 % Brill and Henderson combined three constituency-based parsers. They did not find context helpful either. no crossing bracket introduction lemma Vancouver, 10.10.2005 Zeman & Žabokrtský
Summary Combination techniques successfully applied to dependency parsing. Keeping treeness is not too expensive (in terms of accuracy). Vancouver, 10.10.2005 Zeman & Žabokrtský
Future Work We are preparing the voting right for new parsers (Nivre/Jenssen, Ribarov/McDonald, Charniak/Hall/Novák) As these parsers are better than most of our current parser pool, we expect the results to improve — provided the new parsers are able to contribute new ideas. Vancouver, 10.10.2005 Zeman & Žabokrtský
Thank you. Vancouver, 10.10.2005 Zeman & Žabokrtský