1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;

Slides:



Advertisements
Similar presentations
26./27. Juni 2006 Saarbrücken Workshop on multilingual semantic annotation, Saarbrücken, 26/ Comments on Emanuele Pianta: Exploiting Parallel Texts.
Advertisements

Learning lessons: implementing the autonomy approach Brian R. Morrison Kanda University of International Studies.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
I Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan,
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
K.U. Leuven Leuven Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.
A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Statistical Machine Translation. General Framework Given sentences S and T, assume there is a “translator oracle” that can calculate P(T|S), the probability.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Research methods in corpus linguistics Xiaofei Lu.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Natural Language Processing Expectation Maximization.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
Procedures in building Croatian-English parallel corpus Marko Tadić Filozofski fakultet Sveučilišta u Zagrebu, Zavod za lingvistiku.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
CH.4 PROBABILITY AND TEXT SAMPLING Data mining LAB 이아람.
February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Current Event #___By:____#__ Title of Article:_____________________ Source:___________________________ Date Published:____________________ Date Due:_________________________.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, September 2005 STATISTICAL ANALYSIS.
Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.
Koller’s linguistic-oriented approach. Using a linguistic-oriented approach, Koller (1995: 196-7), being under the influence of Nida’s science of translation,
GeometryFebruary 3 and 4Similarity and Right Triangles Identify the relationship for each pair (or triplet) of angles and what you can conclude about the.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Workshop Aims & Objectives
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
Expectation-Maximization Algorithm
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Applied Linguistics.
Presentation transcript:

1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy; Zagreb, Croatia

2 Croatian-English parallel corpus n compiled at the Institute of Linguistics, Faculty of Philosophy, Zagreb n Languages u source L:Croatian u target L:English n consists of 113 issues of “Croatia Weekly”

3 Corpus parameters u Articles: 4,343 u Sentences: F HR67,694(15.59 s/article avg.) F EN75,390(17.36 s/article avg.) u Tokens: F HR1,490,964(22.03 w/s avg.) F EN1,796,744(23.83 w/s avg.) F Total3,287,708

4 Corpus parameters 2 n “Vanilla” aligner, sentence level n alignments u 0:1 310in 235 articles 0.45% u 1:0 25in 12 articles 0.04% u 1: in4143 articles 84.12% u 1:2 8611in3288 articles 12.76% u 2:1 1391in1012 articles 2.06% u 2:2 379in 345 articles 0.56% n Total alignments: in 4143 articles n analysis only of 1:1 alignments

5 The aim n detection of possible translation equivalents of multiword units (pairs) which occurred in aligned sentences n basic assumption: u in a pair of aligned sentences pairs of words could be translated with pairs of words n seek for translation equivalent candidates between pairs of pairs of words

6 Test sample u research done on the test sample (issues 14-32) F :1 alignements F tokens HR188,851HR188,851 EN223,112EN223,112 F types HR 35,777HR 35,777 EN 16,521EN 16,521

7 Methods of TE detection n First attempt u pure frequency counts of all possible pairs hr: Ove su jabuke velike. Skupe su jabuke. en: These apples are big. Apples are expensive. u possible pairs: ove su these apples su jabukeapples are jabuke velikeare big skupe su apples are su jabukeare expensive

8 Methods of TE detection 2 n all possible pairs of pairs of tokens within aligned sentence were generated u sentence pairs aligned 1:1 u 5,259,631 pairs of pairs n results poor: u frequency gain in expected TE was very low u real TE pairs were severely outnumbered by irrelevant pairs

9 Methods of TE detection 3 n Second attempt: use MI for pair selection u MI calculated for pairs from each language tokenstypes HR EN HR EN pairsdifferent pairs HR EN HR EN pairsMI=>10MI=>15 pairsMI=>10MI=>15 HR EN HR EN

10 Methods of TE detection 4 n expectation u pair of tokens from source L with high MI could have TE which is also pair of tokens from target L with high MI u list of pairs of pairs of tokens filtered by MI

11 List of pairs of pairs

12 Results n disappointing n high MI of pairs of tokens in both languages does not necessarily mean that translation equivalence could be detected n the fact that pairs of tokens with high MI appear in both sides of aligned sentences is not enough to establish TE

13 Possible directions n find out the way of connecting MI values from both languages n problem of morphology u Croatian = inflective language u affects pairs of tokens u affects frequency of pairs u affects MI value u calculating MI for types or lemmas? n is MI value useful for morphologically rich languages?