Presentation is loading. Please wait.

Presentation is loading. Please wait.

Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku.

Similar presentations


Presentation on theme: "Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku."— Presentation transcript:

1 Procedures in building Croatian-English parallel corpus Marko Tadić (marko.tadic@ffzg.hr) Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku (http://www.ffzg.hr/zzl/zzl-home.htm) 4th TELRI seminar, Bratislava, 1999-11-05

2 Parallel corpora multilingual language research –lexicography –contrastive linguistics –MT –... parallel corpora = essential importance role of English as lingua communis –common language pairs: en : L x L x : en

3 Croatian-English parallel corpora 1 st English-Croatian pairing –Rudolf Filipović, 1968-1971 –Yugoslav Serbo-Croatian-English Contrastive Project –Brown corpus cut in half (505.822 tokens) –preserving original genre balance –morphosyntactically marked –translated –concordance with morphosyntactic categories as keywords –bilingual sentence database 1 st usage of computers in contrastive linguistics? tapes with data still archived in Institute of linguistics in Zagreb but no computer system which could read them project publications: Contrastive Studies, New Contrastive Studies, Chapters in Contrastive Linguistics

4 Croatian-English parallel corpora 2 2 nd Croatian-English pair –Plato’s Republic, TELRI CD-ROM, 1998. –hr-en not the only pair –rather small –properly aligned?

5 Croatian-English parallel corpora 3 3 rd hr-en parallel corpus –collecting in Institute of linguistics, Philosophical Faculty, Zagreb –aims to test: text conversion procedures corpus organization alignment and encoding will be used later in parallel corpora projects –Croatian-Slovene parallel corpus approved in July by both Ministries of science as one of 17 bilateral scientific projects in humanities launched effectively last week

6 Corpus collecting representativeness in parallel corpora? –demand for parallelism narrows the choice (at least for languages with smaller number of speakers and/or translations) –we are happy to get any valuable translations –result: unbalanced and nonrepresentative set of parallel texts “methodologically cleaner” approach (possible?) –texts from one source –Corpus of X –no problems with representativeness?

7 Corpus collecting 2 source: Croatia Weekly –publisher: Croatian Institute for Culture and Information (HIKZ) –started January 1998 –like USA today: different domains politics, economy and finance, tourism, ecology, culture, art, events, sports –12 pages, A3 –prepared in Croatian then translated by professional translating office availability –No. 90 is being prepared now –access to all texts in electronic form in both languages but except for first 5 issues

8 Corpus collecting 3 size –average issue:15.170 tokens hr 17.900 tokens en –approx.:1.300.000 tokens hr 1.520.000 tokens en “methodological disturbance” –the biggest weekly newspaper Nacional important source of hr-texts for Croatian National Corpus started with English translations of approx. 30% of Croatian issue on their Web-page –Ministry of science and technology description of all closed scientific projects in RH on Web Croatian and English

9 Making corpus platform –NT instead of UNIX –all software (commercial, shareware, custom-made) runs on win9X text formats –hr texts = “naked ASCII”, no markup => manual marking –en texts = DTP file, RTF extraction conversion –2XML: custom made software we use for Croatian National Corpus input: HTML, RTF output: XML, no header two-step conversion by user-defined scripts enables high level of automation

10 Making corpus 2 Sentence marking –script in Search&Replace shareware by Funduc SW – after punctuation followed by capital letter –filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc. Tokenizer –custom made –XML input –output: tabbed file XML with …

11 Making corpus 3 CW011199803260101hr1X CW011199803260101hr7X CW011199803260101hr25X CW011199803260101hr41X PredsjednikCW011199803260101hr44R TuđmanCW011199803260101hr56R primioCW011199803260101hr63R KinkelaCW011199803260101hr70R,CW011199803260101hr77I VedrineaCW011199803260101hr79R iCW011199803260101hr88R PrimakovaCW011199803260101hr90R CW011199803260101hr99X CW011199803260101hr10X CW011199803260101hr110X CW011199803260101hr126X TuđmanCW011199803260101hr129R :CW011199803260101hr135I HrvatskaCW011199803260101hr137R vojnoCW011199803260101hr146R,CW011199803260101hr151I gospodarskiCW011199803260101hr153R iCW011199803260101hr165R sigurnosnoCW011199803260101hr167R orijentiranaCW011199803260101hr178R naCW011199803260101hr191R europskeCW011199803260101hr194R integracijeCW011199803260101hr203R.CW011199803260101hr214I CW011199803260101hr215X CW011199803260101hr220X MinistriCW011199803260101hr223R VedrineCW011199803260101hr232R iCW011199803260101hr240R KinkelCW011199803260101hr242R uputiliCW011199803260101hr249R zahtjevCW011199803260101hr257R HrvatskojCW011199803260101hr265R daCW011199803260101hr275R izradiCW011199803260101hr278R konkretanCW011199803260101hr285R planCW011199803260101hr295R povratkaCW011199803260101hr300R izbjeglicaCW011199803260101hr309R.CW011199803260101hr319I

12 Making corpus 4 Predsjednik Tuđman primio Kinkela, Vedrinea i Primakova Tuđman : Hrvatska vojno, gospodarski i sigurnosno orijentirana na europske integracije. Ministri Vedrine i Kinkel uputili zahtjev Hrvatskoj da izradi konkretan plan povratka izbjeglica. Na sastanku s Primakovom dogovoren Tuđmanov posjet Rusiji Hrvatska je spremna jamčiti puna manjinska prava svim Srbima, građanima Republike Hrvatske, nastaviti će s politikom koja je dovela do mirne reintegracije uz punu zaštitu i sigurnost srpske manine u cijeloj Hrvatskoj i spremna je prihvatiti sve izbjeglice iz SRJ i sve izbjeglice koji to žele, izjavio je dr Mate Granić nakon sastanka između predsjednika Tuđmana i ministara vanjskih poslova Njemačke i Francuske Klausa Kinkela i Huberta Vedrinea, što je u Predsjedničkim dvorima.

13 Alignment testing stage demo of Atril’s DéjàVu translations memory database V2.3.82 aligning module –works fine for 1:1 alignments –handwork for 2:1, 3:1, 1:2, 1:3 export to TMX format

14 Alignment 2

15 Alignment 3

16 Alignment 4 encoding problem: How to store alignments? several ways to do it now –CES with pointers to IDs in 3 rd file –translations memory (Translation Units as aligned pairs) since we are in XML => PLUG project dtd (Tiedemann 1998) si-en parallel corpus (Erjavec 1999): SGML, modified TEI to have TU. But all upper and lower level encoding (,, ) are lost. Is there a way to retain it? –TEIXML dtd, Nancy, July 1999. Interpretation of TEI dtd? Would that dtd prefer alignment by IDs and pointers? Is the SGML/XML decision really a problem to us? To the same element we can attach different headers, convert character entities and have SGML instead of XML?

17 Preliminary statistics for aligning already it seems that we would have a lot of handwork discrepancy between number of and in hr and en hren% increase CW010 195195 7297969.2 154831817617.4 CW011 178178 67575411.7 148531760218.5 alignment is not on the schedule yet


Download ppt "Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku."

Similar presentations


Ads by Google