ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, 25-27 September 2005 STATISTICAL ANALYSIS.

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

26./27. Juni 2006 Saarbrücken Workshop on multilingual semantic annotation, Saarbrücken, 26/ Comments on Emanuele Pianta: Exploiting Parallel Texts.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
NUMERICAL DESCRIPTIVE STATISTICS Measures of Variability.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 3: Thursday (corpora)
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Titles are Initial Cap Sub-titles are Initial Cap Replace the photo, match size and placement, and send to back – below art layer.
A quick introduction to the analysis of questionnaire data John Richardson.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Statistical Natural Language Processing Diana Trandabăț
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Slide 1 Lecture 4: Measures of Variation Given a stem –and-leaf plot Be able to find »Mean ( * * )/10=46.7 »Median (50+51)/2=50.5 »mode.
Statistics Workshop Tutorial 3
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Working freelance for an international organisation.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Describing Behavior Chapter 4. Data Analysis Two basic types  Descriptive Summarizes and describes the nature and properties of the data  Inferential.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
1 Machine Assisted Human Translation (MAHT) (…aka “Translation Memory” or “CAT tool”) …and what it does for the translator…
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
EMPLOYMENT AND EARNINGS James and Clayton. Topic of Interest Describes the economic status of all businesses in Canada (trends) Helps with determining.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Appendix B: Statistical Methods. Statistical Methods: Graphing Data Frequency distribution Histogram Frequency polygon.
Essential Statistics Chapter 51 Least Squares Regression Line u Regression line equation: y = a + bx ^ –x is the value of the explanatory variable –“y-hat”
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Satisfactory Academic Progress Managing the Components & Monitoring Compliance.
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Inference about the slope parameter and correlation
Measuring Monolinguality
Understanding Research Results: Description and Correlation
Section 2.5 notes Measures of Variation
Regression and Correlation of Data
Translating Collocations for Bilingual Lexicons
Regression and Correlation of Data
What’s your New Year’s Resolution?
Presentation transcript:

ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, September 2005 STATISTICAL ANALYSIS OF NOUN LEMMAS IN THE ITALIAN AND SWISS CONSTITUTION AND THEIR TRANSLATIONS INTO CROATIAN

What? Constitution of the Republic of Italy Constitution of the Republic of Italy (original in Italian + translation in Croatian) – 139 art. + transitory provisions); effective since Federal Constitution of the Swiss Confederation Federal Constitution of the Swiss Confederation (original in Italian + translation in Croatian<It/Germ/Eng.) – 196 art. (+tr. provisions); in force since 2000.

Why? objective: objective: test terminological consistency between SL & TL prerequisites: prerequisites: - parallel corpora as rich resources of translation equivalents - small corpora - small corpora

How? Data processing: Conversion into the HTML format Conversion into the HTML format Sentence alignment Sentence alignment Lemmatisation (inflectionally rich language!!) Lemmatisation (inflectionally rich language!!) Corpus annotation (POS tagging) Corpus annotation (POS tagging) Word alignment Word alignment Word frequency lists Word frequency lists

Testing terminological consistency of translation 1. HYPOTHESIS 1 Italian noun lemma = 1 translation equivalent in Croatian  Constitution Constitution 2. STATISTICAL TESTING - the minimum least square method - Y = a + bX - Correlation coefficient (R)

Correlation of the most frequent Italian and Croatian noun lemmas in the Federal Constitution of the Swiss Confederation (51) a = 0,009  b =  0,030 R = 0,978

Correlation of the most frequent Italian and Croatian noun lemmas in the Constitution of the Republic of Italy (31) a = 0,075  b = 0,938  R = 0,975

Deviation from linearity (a) Accidental (translators’ mistakes) (a) Accidental (translators’ mistakes) (b) Justified (still not expected!) (b) Justified (still not expected!) - stillistic differencies - stillistic differencies e.g. use of relative pronun instead of a noun (1:0) e.g. use of relative pronun instead of a noun (1:0) - polysemy (1:2) e. g. It. titolo 11 x e. g. It. titolo 11 x = Cr. naslov 6 x ( eng. title) = Cr. naslov 6 x ( eng. title) = Cr. vrijednosni papiri 1 x ( eng. Securities) = Cr. vrijednosni papiri 1 x ( eng. Securities) - as idiom: 1) a titolo transitorio = privremeno / eng. temporarily; - as idiom: 1) a titolo transitorio = privremeno / eng. temporarily; 2) a titolo oneroso = za plaću /eng. against payment 2) a titolo oneroso = za plaću /eng. against payment

Italian noun lemmas present in Italian and Swiss constitutions = candidates for glossary

Conclusions the minimum least square method appeared to be adequate for verification of translation the minimum least square method appeared to be adequate for verification of translation the verification does not have to be carried out on the entire sample, but only on the lemmas with the highest frequency covering at least one order of magnitude the verification does not have to be carried out on the entire sample, but only on the lemmas with the highest frequency covering at least one order of magnitude the best candidates for glossary are those lemmas which are repeated with the high frequency in both constitutions the best candidates for glossary are those lemmas which are repeated with the high frequency in both constitutions