Download presentation
Presentation is loading. Please wait.
Published byAleksandar Petković Modified over 6 years ago
1
Experiments on Processing Overlapping Parallel Corpora
University of Tartu Mark Fishel and Heiki-Jaan Kaalep
2
Outline: Parallel corpora containing overlapping parts
A method for processing these Some experiments on JRC-Acquis (Estonian, Latvian, English)
3
Overlapping parallel corpora
Hunglish and OPUS Hu-En subtitles Hunglish and JRC-Acquis Hu-En legislation texts Univ. of Tartu corpus and JRC-Acquis Et-En legislation texts JRC-Acquis Vanilla and HunAlign legislation texts
4
Overlapping parallel corpora
Additional troubles for handling: source version differences encoding differences format differences But also potential benefits: detect alignment errors raise corpora quality increase segmentation depth
5
ParAlign – the method A method of finding and matching corresponding corpora parts Enables combining corpora detecting potential error spots increasing alignment depth evaluating and improving alignment quality
6
Method based on finding corpora correspondence:
7
Aligning the corresponding language parts:
8
Aligning the corresponding language parts:
Edit distance over the corpora documents comparing N to M sentences matching weight = approx. sentence matching Approximate sentence matching: modified edit distance same letter different case replacing free number inserting/replacing infinitely costly punctuation replacing cheap
9
Aligning the language alignments:
Levenstein distance
10
ParAlign, the Implementation
Combine corpora, include side with more sentences Print out all mismatching parts (potential error spots) Use one corpus as guideline, proof the other one Available at
11
Method Benefits: Handles different segmentation levels (M to N al. unit relations) Insensitive to minor input differences Encoding Typing errors …
12
Experiment-1 Univ. of Tartu corpus and JRC-Acquis (English-Estonian)
Overlapping parts found by comparing the CELEX codes Aim: generate joint corpus
13
Results Joint corpus size: al. units
14
Segmentation differences
15
Experiment-2 JRC-Acquis
English-Estonian English-Latvian Estonian-Latvian Aim: compare alignments produced by Vanilla and HunAlign almost 100% overlapping
16
Results En-Et En-Lv Et-Lv Hun Van Matching Mismatching Single 83.5%
85.3% 83.8% 86.2% 98.0% 98.2% Mismatching 15.9% 13.7% 15.5% 12.8% 0.1% 0.2% Single 0.6% 1.0% 0.7% 1.9% 1.6%
17
Future Work Other corpora Optimizing Test on other domains
18
Summary A method for parallel corpora combining/comparing/evaluating/… using overlapping parts Implementation available Joint En-Et corpus Comparison results between HunAlign and Vanilla versions of Jrc-Acquis En-Et, En-Lv and Et-Lv parts
19
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.