Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Integrating translation technology at undergraduate level Belinda Maia University of Porto.
WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Linguistics and translation theory Mark Shuttleworth Teaching Translation Swansea, 20 January 2006.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.
1 CA201 Word Application Creating Document for the Web Week # 9 By Tariq Ibn Aziz Dammam Community college.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
15 September How Computers Work: Other Forms of Data.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
Chapter 3 Software Two major types of software
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Research methods in corpus linguistics Xiaofei Lu.
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Chapter 22 Systems Design, Implementation, and Operation Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 22-1.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Machine translation Context-based approach Lucia Otoyo.
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
EBSCOadmin. Select Change Password Select EBSCOadmin Security.
Pržno, Republic of Montenegro 8 October 2007 TRANSLATION FOR EU ACCESSION TRANSLATION FOR EU ACCESSION Jasminka Novak, Head of Service Independent Service.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Funded by: © AHDS Oxford Text Archive and good practice in the creation of electronic resources Martin Wynne
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
GCSE Information and Communications Technology. Assessment The course is split into 60% coursework and 40% exam You will produce coursework in year 10.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
© ITEDO Software 2001 From 3D CAD to Web catalogs Dieter Weidenbrück.
Translation Technologies Računalne tehnologije za prevo đ enje dr. Špela Vintar Department of Translation Studies Faculty of Arts University of Ljubljana.
Sofia Garcia/Roberto Silva Tutorial Workshop, GrenobleDate: 31/Jan/2007 The work of a professional translator and the translation agency V1.0.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Multi-lingual & multi- institutional distant learning Example of an international master programme in Computational Linguistics November, Blaubeuren,
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
1 Digital Preservation Testbed Database Preservation Issues Remco Verdegem Bern, 9 April 2003.
Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
Lecture: Web Design Assis. Prof. Freshta Hanif Ehsan Faculty of Computer Science Kabul Polytechnic University Spring Semester
XML Alyssa Roberts. What is XML? Extensible Markup Language Specification to creating custom mark-up languages Simplified version of SGML, originally.
An exercise in preservation and applied technology Making an Electronic Text.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University Council, ENS-LSH, Lyon (France), 1 April 2009.
GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
WORLD CONSORTIUM Welcome to. An overview by Phil Elliott Satzconcept Skandinavia a.s.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
3D modeling Computer programs used for developing a mathematical representation of any three-dimensional surface of objects, also called 3D modeling.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
Search Engine Architecture
Corpus Linguistics I ENG 617
Database Driven Websites
Translation Workspace File Filters
Using Translation Memory to Speed up Translation Process
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
Márton Németh – László Drótos How to catalogue a web archive?
A review of the current situation of university press in Croatia
Presentation transcript:

Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku ( 4th TELRI seminar, Bratislava,

Parallel corpora multilingual language research –lexicography –contrastive linguistics –MT –... parallel corpora = essential importance role of English as lingua communis –common language pairs: en : L x L x : en

Croatian-English parallel corpora 1 st English-Croatian pairing –Rudolf Filipović, –Yugoslav Serbo-Croatian-English Contrastive Project –Brown corpus cut in half ( tokens) –preserving original genre balance –morphosyntactically marked –translated –concordance with morphosyntactic categories as keywords –bilingual sentence database 1 st usage of computers in contrastive linguistics? tapes with data still archived in Institute of linguistics in Zagreb but no computer system which could read them project publications: Contrastive Studies, New Contrastive Studies, Chapters in Contrastive Linguistics

Croatian-English parallel corpora 2 2 nd Croatian-English pair –Plato’s Republic, TELRI CD-ROM, –hr-en not the only pair –rather small –properly aligned?

Croatian-English parallel corpora 3 3 rd hr-en parallel corpus –collecting in Institute of linguistics, Philosophical Faculty, Zagreb –aims to test: text conversion procedures corpus organization alignment and encoding will be used later in parallel corpora projects –Croatian-Slovene parallel corpus approved in July by both Ministries of science as one of 17 bilateral scientific projects in humanities launched effectively last week

Corpus collecting representativeness in parallel corpora? –demand for parallelism narrows the choice (at least for languages with smaller number of speakers and/or translations) –we are happy to get any valuable translations –result: unbalanced and nonrepresentative set of parallel texts “methodologically cleaner” approach (possible?) –texts from one source –Corpus of X –no problems with representativeness?

Corpus collecting 2 source: Croatia Weekly –publisher: Croatian Institute for Culture and Information (HIKZ) –started January 1998 –like USA today: different domains politics, economy and finance, tourism, ecology, culture, art, events, sports –12 pages, A3 –prepared in Croatian then translated by professional translating office availability –No. 90 is being prepared now –access to all texts in electronic form in both languages but except for first 5 issues

Corpus collecting 3 size –average issue: tokens hr tokens en –approx.: tokens hr tokens en “methodological disturbance” –the biggest weekly newspaper Nacional important source of hr-texts for Croatian National Corpus started with English translations of approx. 30% of Croatian issue on their Web-page –Ministry of science and technology description of all closed scientific projects in RH on Web Croatian and English

Making corpus platform –NT instead of UNIX –all software (commercial, shareware, custom-made) runs on win9X text formats –hr texts = “naked ASCII”, no markup => manual marking –en texts = DTP file, RTF extraction conversion –2XML: custom made software we use for Croatian National Corpus input: HTML, RTF output: XML, no header two-step conversion by user-defined scripts enables high level of automation

Making corpus 2 Sentence marking –script in Search&Replace shareware by Funduc SW – after punctuation followed by capital letter –filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc. Tokenizer –custom made –XML input –output: tabbed file XML with …

Making corpus 3 CW hr1X CW hr7X CW hr25X CW hr41X PredsjednikCW hr44R TuđmanCW hr56R primioCW hr63R KinkelaCW hr70R,CW hr77I VedrineaCW hr79R iCW hr88R PrimakovaCW hr90R CW hr99X CW hr10X CW hr110X CW hr126X TuđmanCW hr129R :CW hr135I HrvatskaCW hr137R vojnoCW hr146R,CW hr151I gospodarskiCW hr153R iCW hr165R sigurnosnoCW hr167R orijentiranaCW hr178R naCW hr191R europskeCW hr194R integracijeCW hr203R.CW hr214I CW hr215X CW hr220X MinistriCW hr223R VedrineCW hr232R iCW hr240R KinkelCW hr242R uputiliCW hr249R zahtjevCW hr257R HrvatskojCW hr265R daCW hr275R izradiCW hr278R konkretanCW hr285R planCW hr295R povratkaCW hr300R izbjeglicaCW hr309R.CW hr319I

Making corpus 4 Predsjednik Tuđman primio Kinkela, Vedrinea i Primakova Tuđman : Hrvatska vojno, gospodarski i sigurnosno orijentirana na europske integracije. Ministri Vedrine i Kinkel uputili zahtjev Hrvatskoj da izradi konkretan plan povratka izbjeglica. Na sastanku s Primakovom dogovoren Tuđmanov posjet Rusiji Hrvatska je spremna jamčiti puna manjinska prava svim Srbima, građanima Republike Hrvatske, nastaviti će s politikom koja je dovela do mirne reintegracije uz punu zaštitu i sigurnost srpske manine u cijeloj Hrvatskoj i spremna je prihvatiti sve izbjeglice iz SRJ i sve izbjeglice koji to žele, izjavio je dr Mate Granić nakon sastanka između predsjednika Tuđmana i ministara vanjskih poslova Njemačke i Francuske Klausa Kinkela i Huberta Vedrinea, što je u Predsjedničkim dvorima.

Alignment testing stage demo of Atril’s DéjàVu translations memory database V aligning module –works fine for 1:1 alignments –handwork for 2:1, 3:1, 1:2, 1:3 export to TMX format

Alignment 2

Alignment 3

Alignment 4 encoding problem: How to store alignments? several ways to do it now –CES with pointers to IDs in 3 rd file –translations memory (Translation Units as aligned pairs) since we are in XML => PLUG project dtd (Tiedemann 1998) si-en parallel corpus (Erjavec 1999): SGML, modified TEI to have TU. But all upper and lower level encoding (,, ) are lost. Is there a way to retain it? –TEIXML dtd, Nancy, July Interpretation of TEI dtd? Would that dtd prefer alignment by IDs and pointers? Is the SGML/XML decision really a problem to us? To the same element we can attach different headers, convert character entities and have SGML instead of XML?

Preliminary statistics for aligning already it seems that we would have a lot of handwork discrepancy between number of and in hr and en hren% increase CW CW alignment is not on the schedule yet