Using a parallel corpus in translation practice and research Ana Frankenberg-Garcia

Slides:



Advertisements
Similar presentations
A corpus-based study of loan words in original and translated texts
Advertisements

The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk
Colouring COMPARA contrastive and monolingual colour studies in English and Portuguese Rosário Silva Susana Inácio Diana Santos.
Are translations longer than source texts? A corpus-based study of explicitation Ana Frankenberg-Garcia ISLA, Lisbon.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
It has been stated that Portuguese is the sixth language in the world in terms of native speakers, the fourth most used in Internet Interaction, and that.
Application of e-Learning in Under-Graduated Medical Teaching A Systematic Review Authors: Ana Carolina Martins, Ana Rita Figueiredo, Bruno Reis, Carolina.
Integrating translation technology at undergraduate level Belinda Maia University of Porto.
Introduction to Computational Linguistics
Using monolingual and parallel corpora to teach English in Portugal Ana Frankenberg-Garcia ISLA-LX & FCSH-UNL.
Information and Communication Technologies 1 Working with Portuguese corpora Diana Santos Linguateca
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Teaching Translation at University Level James Dickins Prof. of Arabic University of Leeds.
Uses of a Corpus “[E]xplore actual patterns of language use”
Richard West Moscow November  WHAT is engineering vocabulary?  WHO should teach it?  HOW do learners learn/teachers teach it?  WITH WHAT? What.
Universidade do Porto Faculdade de Medicina Introdução à Medicina Application of e-Learning in Under-Graduated Medical Teaching A Systematic Review Authors:
Lost in parallel concordances Ana Frankenberg-Garcia ISLA, Lisbon.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Linguistics and translation theory Mark Shuttleworth Teaching Translation Swansea, 20 January 2006.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Harnessing Corpora for real and virtual ELT purposes IFELT Belinda Maia FLUP 10/
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Corpora and the ‘general public’ Belinda Maia and Luís Sarmento Universidade do Porto.
Working with COMPARA an online parallel corpus of English and Portuguese fiction Ana Frankenberg-Garcia.
Using TF-IDF to Determine Word Relevance in Document Queries
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
REACTION REACTION Workshop Overview Lisbon, PT and Austin, TX Mário J. Silva University of Lisbon, Portugal.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Promoting Brazilian Literature Abroad Translation and Publication Incentives Overview – Frankfurt Buchmesse 2013.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
Masaryk University, Brno Friday 13 th September Katie Mansfield
Using corpora for bespoke language teaching
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
1st Workshop on Natural Language Processing and Human Language Technologies Universidade do Algarve, Faro, Portugal June 16-17, 2010
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
Name: Ana Martins Age:13 years old Hobbies:Play computer and listen to music
Chapter 10 Language and Computer English Linguistics: An Introduction.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Chapter 5 What are General- Purpose Dictionaries Really For? 5.1 The Study of Reference Needs.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Multilingual Search Shibamouli Lahiri
Types of Dictionaries A. Types of Dictionaries in terms of form/medium: - Books (advantages & disadvantages) - CDs (advantages & disadvantages) - Internet/Online.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Electronic Spreadsheets as a Teaching Tool in Translation Practice Classes JOACYR TUPINAMBÁS DE OLIVEIRA POLAND, CRACOW, 2016.
THE PROCESS OF WORDS BEING ENTERED IN A DICTIONARY WORD FORMATION IN ENGLISH Magdalena Soklevska April, 2016.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Corpus Linguistics I ENG 617
Are translations longer than source texts?
(word formation: follow up)
Las caricaturas del dibujante Antonio Nueva estación de metro
Using GOLD to Tracking L2 Development
Text Format Files Number Files Size(Bytes) Words Number
Using Dictionaries in Translation (223 TRAJ)
Presentation transcript:

Using a parallel corpus in translation practice and research Ana Frankenberg-Garcia

Machine Translation Using machines to analyse Human Translation

The study of human translation  Traditionally not a hard science  Difficult to be systematic But with the technology of corpus linguistics, things can change …

What is a corpus? large specific criteria text-retrieval software machine-readable

Advantages of using corpora to study human translation  An enormous amount of translated texts  Systematic analyses  Quantifiable results

A bi-directional parallel corpus of Portuguese and English COMPARA Project leaders Ana Frankenberg-Garcia & Diana Santos Research assistants Rosário Silva & Susana Inácio Initial support ( ) FCT (Portugal) ISLA (Lisboa) Oxford University (Language Centre) Present funding ( ) Linguateca: FCT/ POSI (POSI/PLP/43931/2001)

PT source texts EN source texts COMPARA structure EN translations PT translations COMPARA

English Portuguese Original Translated Portuguese Portuguese Original Translated English Source Translations Texts

COMPARA 8.0 varieties Portugal Brazil Angola Mozambique UK US South Africa PORTUGUESE ENGLISH Unbalanced distribution!

COMPARA 8.0 Publication dates

COMPARA 8.0 genre Published fiction other genres EXTENSIBLE

COMPARA 8.0 authors Portuguese writers Camilo Castelo Branco Eça de Queirós José Cardoso Pires José Saramago Jorge de Sena Lídia Jorge Mário de Carvalho Sá Carneiro

COMPARA 8.0 authors Brazilian writers Aluísio Azevedo Autran Dourado Chico Buarque Jô Soares José de Alencar Machado de Assis Manuel Antônio de Almeida Marcos Rey Patrícia Melo Paulo Coelho Rubem Fonseca

COMPARA 8.0 authors Angolan writers José Eduardo Agualusa Mozambiquean writers Mia Couto

COMPARA 8.0 authors British writers David Lodge Ian McEwan Julian Barnes Joseph Conrad Joanna Trollope Kazuo Ishiguro Lewis Carrol Mary Shelley Oscar Wilde

COMPARA 8.0 authors American writers Henry James Edgar Allan Poe Richard Zimler South African writers Nadine Gordimer

Can any text be included in the corpus?  Only published source texts and translations  Only English translated directly from Portuguese Portuguese translated directly from English  Only human translations!

71 source texts (extracts) 74 translations COMPARA 8.0 texts

COMPARA 8.0 size 1,536,269 1,423,937 words in in English Portuguese Largest edited parallel corpus containing Portuguese

COMPARA users and uses  Language learners - bilingual dictionary with examples  Language teachers - exercises and tests  Translators - language equivalents  Translation lecturers - exercises & problems  Translation theorists - test translation hypotheses  Lexicographers - bilingual dictionaries  Computational linguists - machine translation Latest statistics: queries per month

COMPARA availability Free, online For research and education

COMPARA access COMPARA

“nodded”

Studies using COMPARA 1.Observing source texts and translations 2.Constrasting Portuguese and English 3.Comparing translated and untranslated language 4.Examining the characteristics of translated texts

1. Observing source texts & translations Improving bilingual dictionaries and machine-translation programs Frankenberg-Garcia (2002) nod Ribeiro & Dias (2005) grande Specia et al. (2005) word-sense disambiguation

2. Contrasting English and Portuguese Contrasting original fiction in English and Portuguese Frankenberg-Garcia (2005) PT Loan words EN Loan words PT Loan languages EN Loan languages

3. Comparing translated and untranslated language diferente(s) simplesmente end.* up translations source texts * 30,715,4 15,6 5,1 13,5 2,8 * frequency/100 K words in COMPARA x 3 x 4 x lemma “rezar” 5,612,4 2 x

4. Examining the characteristics of translated texts Are translations longer than source texts? Frankenberg-Garcia (2004) Explicitation Hypothesis

Pt 1500 words Pt 1500 words Pt 1500 words Pt 1500 words Pt 1500 words Pt 1500 words Pt 1500 words Pt 1500 words En 1500 words En 1500 words En 1500 words En 1500 words En 1500 words En 1500 words En 1500 words En 1500 words ? Source texts Translations 8 PT authors 8 EN authors 8 PT translators 8 EN translators

ST TT + 5% Matched t-test: 95% probability TT longer than ST Source texts Translations

Studies such as these were unthinkable before corpora Many other studies are possible! COMPARA is free and available online Contact us: To conclude....