Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku ( 4th TELRI seminar, Bratislava,
Parallel corpora multilingual language research –lexicography –contrastive linguistics –MT –... parallel corpora = essential importance role of English as lingua communis –common language pairs: en : L x L x : en
Croatian-English parallel corpora 1 st English-Croatian pairing –Rudolf Filipović, –Yugoslav Serbo-Croatian-English Contrastive Project –Brown corpus cut in half ( tokens) –preserving original genre balance –morphosyntactically marked –translated –concordance with morphosyntactic categories as keywords –bilingual sentence database 1 st usage of computers in contrastive linguistics? tapes with data still archived in Institute of linguistics in Zagreb but no computer system which could read them project publications: Contrastive Studies, New Contrastive Studies, Chapters in Contrastive Linguistics
Croatian-English parallel corpora 2 2 nd Croatian-English pair –Plato’s Republic, TELRI CD-ROM, –hr-en not the only pair –rather small –properly aligned?
Croatian-English parallel corpora 3 3 rd hr-en parallel corpus –collecting in Institute of linguistics, Philosophical Faculty, Zagreb –aims to test: text conversion procedures corpus organization alignment and encoding will be used later in parallel corpora projects –Croatian-Slovene parallel corpus approved in July by both Ministries of science as one of 17 bilateral scientific projects in humanities launched effectively last week
Corpus collecting representativeness in parallel corpora? –demand for parallelism narrows the choice (at least for languages with smaller number of speakers and/or translations) –we are happy to get any valuable translations –result: unbalanced and nonrepresentative set of parallel texts “methodologically cleaner” approach (possible?) –texts from one source –Corpus of X –no problems with representativeness?
Corpus collecting 2 source: Croatia Weekly –publisher: Croatian Institute for Culture and Information (HIKZ) –started January 1998 –like USA today: different domains politics, economy and finance, tourism, ecology, culture, art, events, sports –12 pages, A3 –prepared in Croatian then translated by professional translating office availability –No. 90 is being prepared now –access to all texts in electronic form in both languages but except for first 5 issues
Corpus collecting 3 size –average issue: tokens hr tokens en –approx.: tokens hr tokens en “methodological disturbance” –the biggest weekly newspaper Nacional important source of hr-texts for Croatian National Corpus started with English translations of approx. 30% of Croatian issue on their Web-page –Ministry of science and technology description of all closed scientific projects in RH on Web Croatian and English
Making corpus platform –NT instead of UNIX –all software (commercial, shareware, custom-made) runs on win9X text formats –hr texts = “naked ASCII”, no markup => manual marking –en texts = DTP file, RTF extraction conversion –2XML: custom made software we use for Croatian National Corpus input: HTML, RTF output: XML, no header two-step conversion by user-defined scripts enables high level of automation
Making corpus 2 Sentence marking –script in Search&Replace shareware by Funduc SW – after punctuation followed by capital letter –filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc. Tokenizer –custom made –XML input –output: tabbed file XML with …
Making corpus 3 CW hr1X CW hr7X CW hr25X CW hr41X PredsjednikCW hr44R TuđmanCW hr56R primioCW hr63R KinkelaCW hr70R,CW hr77I VedrineaCW hr79R iCW hr88R PrimakovaCW hr90R CW hr99X CW hr10X CW hr110X CW hr126X TuđmanCW hr129R :CW hr135I HrvatskaCW hr137R vojnoCW hr146R,CW hr151I gospodarskiCW hr153R iCW hr165R sigurnosnoCW hr167R orijentiranaCW hr178R naCW hr191R europskeCW hr194R integracijeCW hr203R.CW hr214I CW hr215X CW hr220X MinistriCW hr223R VedrineCW hr232R iCW hr240R KinkelCW hr242R uputiliCW hr249R zahtjevCW hr257R HrvatskojCW hr265R daCW hr275R izradiCW hr278R konkretanCW hr285R planCW hr295R povratkaCW hr300R izbjeglicaCW hr309R.CW hr319I
Making corpus 4 Predsjednik Tuđman primio Kinkela, Vedrinea i Primakova Tuđman : Hrvatska vojno, gospodarski i sigurnosno orijentirana na europske integracije. Ministri Vedrine i Kinkel uputili zahtjev Hrvatskoj da izradi konkretan plan povratka izbjeglica. Na sastanku s Primakovom dogovoren Tuđmanov posjet Rusiji Hrvatska je spremna jamčiti puna manjinska prava svim Srbima, građanima Republike Hrvatske, nastaviti će s politikom koja je dovela do mirne reintegracije uz punu zaštitu i sigurnost srpske manine u cijeloj Hrvatskoj i spremna je prihvatiti sve izbjeglice iz SRJ i sve izbjeglice koji to žele, izjavio je dr Mate Granić nakon sastanka između predsjednika Tuđmana i ministara vanjskih poslova Njemačke i Francuske Klausa Kinkela i Huberta Vedrinea, što je u Predsjedničkim dvorima.
Alignment testing stage demo of Atril’s DéjàVu translations memory database V aligning module –works fine for 1:1 alignments –handwork for 2:1, 3:1, 1:2, 1:3 export to TMX format
Alignment 2
Alignment 3
Alignment 4 encoding problem: How to store alignments? several ways to do it now –CES with pointers to IDs in 3 rd file –translations memory (Translation Units as aligned pairs) since we are in XML => PLUG project dtd (Tiedemann 1998) si-en parallel corpus (Erjavec 1999): SGML, modified TEI to have TU. But all upper and lower level encoding (,, ) are lost. Is there a way to retain it? –TEIXML dtd, Nancy, July Interpretation of TEI dtd? Would that dtd prefer alignment by IDs and pointers? Is the SGML/XML decision really a problem to us? To the same element we can attach different headers, convert character entities and have SGML instead of XML?
Preliminary statistics for aligning already it seems that we would have a lot of handwork discrepancy between number of and in hr and en hren% increase CW CW alignment is not on the schedule yet