Presentation is loading. Please wait.

Presentation is loading. Please wait.

Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University

Similar presentations


Presentation on theme: "Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University"— Presentation transcript:

1 Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2 2 23 August 2008 English Syntax Tree

3 3 23 August 2008

4 4

5 5 DE – EN Alignment

6 6 23 August 2008 SMULTRON Stockholm MULtilingual TReebank 1000 sentences in 3 languages (DE-EN-SV)  500 from Jostein Gaarder’s Sophie’s World (~ 7 500 tokens, 14 tokens/sentence) and  500 from Economy texts (~ 11 000 tokens, 22 tokens/sentence) ABB Quarterly report Rainforest Alliance: Banana Certification Program SEB Annual report  Released: January 2008 www.ling.su.se/dali/research/smultron/index.htm

7 7 23 August 2008 German Annotation

8 8 23 August 2008 German sentence: flat annotation

9 9 23 August 2008 German sentence: deepened

10 10 23 August 2008 English Annotation

11 11 23 August 2008 English Syntax Tree

12 12 23 August 2008 English annotation Follows the Penn Treebank guidelines Slower annotation because of  insertion of traces  secondary edges  deeper trees

13 13 23 August 2008

14 14 23 August 2008 Tree Alignment

15 15 23 August 2008 Sentence alignment Word alignment  input for Statistical MT Phrase alignment  linguistically motivated phrases  input for Example-based MT

16 16 23 August 2008 Alignment Example

17 17 23 August 2008 Tools for Parallel Treebanks creating and editing trees  from mono-lingual treebanks  PoS-taggers, chunkers, editor, ’tree-enricher’ aligning phrases  use of word alignment tools  tree alignment editor  Stockholm TreeAligner searching across languages  TIGER-Search for parallel treebanks  Stockholm TreeAligner

18 18 23 August 2008 Guidelines for Alignment 1. Align words and phrases that represent the same meaning and could serve as translation units in an MT system. 2. Align as many words and phrases as possible. 3. Distinguish between exact and approximate alignments. 4. 1:n word / phrase alignments are allowed, but not m:n word / phrase alignments. 5. m:n sentence alignments are allowed.

19 19 23 August 2008 Examples Do not align:  die Verwunderung über das Leben  their astonishment at the world Do align:  was für eine seltsame Welt  what an extraordinary world

20 20 23 August 2008 Specific rules a pronoun in one language shall never be aligned with a full noun in the other names are aligned regardless of spelling, unless the name is changed (fiction) ignore number/case but not voice

21 21 23 August 2008 Exact vs approximate alignment best vs. ”second-best” translation an acronym in one language shall be aligned as approximate (fuzzy) with a spelled-out term in the other  PT – Power Technologies difficult distinctions  einer der ersten Tage im Mai – early May

22 22 23 August 2008 Related Research Blinker project (Melamed) Prague Czech-English Treebank Example-based MT in Dublin Linköping English-Swedish Treebank

23 23 23 August 2008 Experiment 12 students to align 20 tree pairs DE-EN  10 tree pairs from Sophie’s world  10 tree pairs from Economy text advanced CL students received  short introduction  the written guidelines

24 24 23 August 2008 Gold Standard Alignment (DE-EN) word - wordphrase - phrase exactapprox.exactapprox. 10 sent. Sophie 7534612 7858 10 sent. Econ 15919629 17871

25 25 23 August 2008 Experiment: Results The students created a huge variety in number of alignments Sophie part: from 47 to 125 (ø = 94.3) Econ part: from 62 to 259 (ø = 186.9)  the 3 students with the lowest numbers were non-native speakers of German  1 student had misunderstood the task

26 26 23 August 2008 Experiment: Results The remaining 8 students had a high overlap with the gold standard (Recall):  Sophie part: from 48% to 81% (ø = 68.7%)  Econ part: from 66% to 89% (ø = 75.5%) Precision  Sophie part: from 81% to 97% (ø = 89.1%)  Econ part: from 78% to 94% (ø = 88.2%)

27 27 23 August 2008 Discrepancies students sometimes aligned a word (or some words) with a node.  e.g. the word natürlich to the phrase of course students sometimes aligned a German verb group with a single verb form in English  e.g. ist zurückzuführen vs. reflecting

28 28 23 August 2008 Discrepancies based on different grammatical forms: a definite single NP in German with an indefinite plural NP in English  der Umsatz vs. revenues a German genitive NP with a PP in English  der beiden Divisionen vs. of the two divisions

29 29 23 August 2008 Missed by all students alignment of German word to empty token in English  wenn sie die Hand ausstreckte vs.  herself shaking hands

30 30 23 August 2008

31 31 23 August 2008 Conclusions 1. Our alignment guidelines are sufficient for a core of clear alignment decisions. 2. Needed: 1. Better alignment rules with concrete examples. 2. Better support tools (consistency checking). 3. The distinction between exact alignment and approximate alignment is very tricky.

32 32 23 August 2008 Thank You for Your Attention! Questions???

33 33 23 August 2008 Applications of Parallel Treebanks For the Translator 1. corpus for translation studies  search tools needed For the Computational Linguist 1. input for Example-based Machine Translation 2. evaluation corpus for word, phrase or clause alignment 3. training corpus for transfer rules

34 34 23 August 2008 Alignment Example

35 35 23 August 2008 Parallel Treebanking DE sentence SV sentence flat DE tree ANNOTATE - PoS tagger (STTS) - Chunker (TIGER) flat SV tree PoS tagger (SUC) STTS conversion ANNOTATE - Chunker (SWE-TIGER) DE treeSV tree DeepeningDeepening + Back conv. phrase alignment


Download ppt "Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University"

Similar presentations


Ads by Google