Download presentation
Presentation is loading. Please wait.
Published byClyde Norris Modified over 9 years ago
1
Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent
2
Dutch Parallel Corpus Annotated sentence aligned corpus 10 million words Dutch - English / Dutch – French Linguistic annotations –PoS & lemma –Shallow syntactic analysis Quality control May 2006- September 2009
3
Users and applications Fundamental research –Translation studies / contrastive linguistics –Corpus linguistics Support applications –Translation support (CAT) –Didactic support (CALL) HLT applications –Machine Translation / Terminology Extraction –Training and test data
4
Fundamental Research High-quality data Balanced by translation direction Contrastive LinguisticsTranslation Studies Translation productTranslation process Language systemsTranslation strategies
5
Parallel & comparable corpus Dutch texts Dutch translations English & French translations English & French texts
6
Language Learning - CorpusCall Computer Assisted Language Learning –Reference samples –Learning activity Key Words in Context –Authentic language usage Example Nederlex –Electronic reading platform for French students learning Dutch –Development reading platform: FUNDP, Namur –Compilation parallel corpus: REBECA project (K.U.Leuven Campus Kortrijk)
7
Nederlex
8
Full text corpora as Translator’s aid Computer assisted Translation –To identify more appropriate TL equivalent, idiomatic expressions –Extension to bilingual dictionaries –Words in context Example: TransSearch (Canadian Hansards) –Simard & Macklovitch 2005
9
Machine Translation Data-driven development of MT-systems –Example Based MT & Statistical MT P. Khoen 2005: 110 SMT-systemen trained on Europarl-corpus –Example output Finnish-English: we know very well that the current treaties are not enough and that in future, it is necessary to develop a better structure for the union and, therefore perustuslaillisempi structure, which also expressed more clearly what the member states and the union is concerned.
10
Large corpora are useful … Number crunching applications Statistical analysis Automatic analysis No human intervention
11
… but less adequate for: Applications involving quality at all levels Applications involving human analysis Educational applications
12
DPC requirements 1)Corpus design 2)Linguistic annotation 3)Quality control 4)Corpus exploitation & availability
13
DPC requirements 1)Corpus design 2)Linguistic annotation 3)Quality control 4)Corpus exploitation & availability
14
Design: translation directions Language Pairs & Translation Directions Balanced wrt language pair and translation direction –Min. 2 mio words/translation direction EN NL EN NL FR NL FR
15
Design: text types Commercial publishers –Fictional & non-fictional literature e.g. novels, essays –Journalistic texts, e.g. news articles Institutions –Instructive texts, e.g. user manuals –Administrative texts, e.g. meeting minutes –External communication, e.g. promotion material, newsletters
16
Text providers Quality –Published material –Professional translation division Copyright clearance –License agreements –Collaboration with Dutch Agency of HLT
17
50 Text providers Text TypeProvider Administrative textsEuropean parliament, Europarl, Melexis, Flemish government, Speeches Kok, Balkende, Melexis, FOD Sociale Zekerheid, … External CommunicationBMM, Bosch, Barco, NMBS Holding, Arcelor Mittal, Fédération du tourisme de la province de Namur, Westtoer, … LiteratureOns Erfdeel, Lannoo, Vlaams Fonds der Letteren, Nijgh&VanDitmar, Le Dilletante, … Journalistic textsRoularta, The Independent, The Guradian/ The Observer, De Standaard, De Morgen, Campuskrant, ING, Fortis, … Instructive textsIBM, Bosch, DNS, Eli-lilly, …
18
DPC requirements 1)Corpus design 2)Linguistic annotation 3)Quality control 4)Corpus exploitation & availability
19
Linguistic Annotation Structure –Paragraphs, sentences, words
20
Linguistic Annotation Structure –Paragraphs, sentences, words Alignment –Sentence alignment Vanilla Aligner Microsoft Bilingual Aligner Melamed’s GMA Aligner –(Sub-sentential alignment)
21
Linguistic Annotation Structure –Paragraphs, sentences, words Alignment –Sentence alignment –(Sub-sentential alignment) Linguistic annotation –Lemma –PoS
22
Corpus Representation Text Mark-up –TEI Encoding –UTF8
23
DPC requirements 1)Corpus design 2)Linguistic annotation 3)Quality control 4)Corpus exploitation & availability
24
Quality control Manually checked –10% of whole corpus Spot checking –Based on error analysis of manually verified data Automatic control procedures –e.g. automatic comparison of output from different alignment programs
25
Alignment merge Tekst taal1Tekst taal2 AL1 AL2 1234512345 1234512345 1234512345 1234512345 1234512345 1234512345 manual check Alignment merge
26
Quality control Manually checked –10% of whole corpus Spot checking –Based on error analysis of manually verified data Automatic control procedures –e.g. automatic comparison of output from different alignment programs
27
External validation Formal validation by CST (Centre for language Technology - Copenhagen) Suitability test by Xplanation
28
DPC requirements 1)Corpus design 2)Linguistic annotation 3)Quality control 4)Corpus exploitation & availability
29
Corpus exploitation Web search interface –Parallel KWIC concordance –Simple queries –Extended queries Pattern matching & annotation labels Full text resource –Data-driven automatic learning (e.g. SMT) –Two monolingual XML-files + alignment file
30
Metadata Additional filter to retrieve samples –Text-related data Language, text type, domain and keywords –Translation-related data Source language, target language –Annotation-related data Quality label
31
Availability Via Dutch Agency for Human Language Technologies (TST-centrale)
32
DPC objectives Quality control Level of annotation –Sentence alignment –PoS, lemma Balanced composition –Translation direction –Text types Availability –Via Dutch Agency for Human Language Technologies (TST-centrale)
33
K.U. Leuven campus Kortrijk Prof. Dr. Piet Desmet Dr. Hans Paulussen Lic. Maribel Montero Perez Univeristy College Ghent - School of Translation Studies Prof. Dr. Willy Vandeweghe Dra. Lieve Macken Orphée Declercq DPC Team
34
Questions? www.kuleuven-kortrijk.be/dpc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.