Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez.

Slides:



Advertisements
Similar presentations
Machine Translation: Interlingual Methods Thanks to Les Sikos Bonnie J. Dorr, Eduard H. Hovy, Lori S. Levin.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
CODE/ CODE SWITCHING.
Feature Structures and Parsing Unification Grammars Algorithms for NLP 18 November 2014.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Statistical NLP: Lecture 3
PSY 369: Psycholinguistics Some basic linguistic theory part2.
NLP and Speech 2004 Feature Structures Feature Structures and Unification.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
SB Program University of Jyväskylä Machine Translation Research Seminar on Software Business Antti Ilmo.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Translation Divergence LING 580MT Fei Xia 1/10/06.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
Creation of a Russian-English Translation Program Karen Shiells.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
323 Morphology The Structure of Words 1.1 What is Morphology? Morphology is the internal structure of words. V: walk, walk+s, walk+ed, walk+ing N: dog,
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Language Knowledge Engineering Lab. Kyoto University NTCIR-10 PatentMT, Japan, Jun , 2013 Description of KYOTO EBMT System in PatentMT at NTCIR-10.
ACL Birds of a Feather Corpus Annotation with Interlingual Content Interlingual Annotation of Multilingual Text Corpora Bonnie Dorr, David Farwell, Rebecca.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Deeper Sentiment Analysis Using Machine Translation Technology Kanauama Hiroshi, Nasukawa Tetsuya Tokyo Research Laboratory, IBM Japan Coling 2004.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
MT with an Interlingua Lori Levin April 13, 2009.
Linguistic Essentials
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
WHAT IS LANGUAGE?. INTRODUCTION In order to interact,human beings have developed a language which distinguishes them from the rest of the animal world.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
November 16, 2004 Lexicon (An Interacting Subsystem in UG) Part-II Rajat Kumar Mohanty IIT Bombay.
Natural Language Processing Chapter 2 : Morphology.
Supertagging CMSC Natural Language Processing January 31, 2006.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Levels of Linguistic Analysis
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Machine Translation Divergences: A Formal Description and Proposed Solution Bonnie J. Dorr University of Maryland Presented by: Soobia Afroz.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Child Syntax and Morphology
Eliciting a corpus of word-aligned phrases for MT
Approaches to Machine Translation
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
LIN1300 What is language? Dr Marie-Claude Tremblay 1.
Representation of Actions as an Interlingua
Machine Translation Nov 8, 2006
Approaches to Machine Translation
Levels of Linguistic Analysis
Linguistic Essentials
A Path-based Transfer Model for Machine Translation
Structure of a Lexicon Debasri Chakrabarti 13-May-19.
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez

Overview Introduction Morphology Primer Translation Mismatches  Types  Solutions Translation Divergences  Types  Solutions Different MT Systems Generation Heavy Machine Translation DUSTer

Source ≠ Target Languages don’t encode the same information in the same way  Makes MT complicated  Keeps all of us employed

Morphology in a Nutshell Morphemes are word parts  Work +er  Iki +ta +ku +na +ku +na +ri +ma +shi +ta Types of Morphemes  Derivational: makes new word  Inflectional: adds information to an existing word

Morphology in a Nutshell Analytic/Isolating  little or no inflectional morphology, separate words  Vietnamese, Chinese  I was made to go Synthetic  Lots of inflectional morphology  Fusional vs. Agglutinating  Romance Languages, Finnish, Japanese, Mapudungun  Ika (to go) +se (to make/let) +rare (passive) +ta (past tense)  He need +s (3 rd person singular) it.

Translation Differences Types  Translation Mismatches Different information from source to target  Translation Divergences Same information from source to target, but the meaning is distributed differently in each language

Translation Mismatches “…the information that is conveyed is different in the source and target languages” Types:  Lexical level  Typological level

Lexical Mismatches A lexical item in one language may have more distinctions than in another Brother 弟 otouto Younger Brother 兄さん Ani-san Older Brother

Typological Mismatches Mismatch between languages with different levels of grammaticalization One language may be more structurally complex Source marking, Obligatory Subject

Typological Mismatches Source: Quechua vs. English  (they say) s/he was singing --> takisharansi  taki (sing) +sha (progressive) +ra (past) + n (3rd sg) +si (reportative) Obligatory Arguments: English vs. Japanese  Kusuri wo Nonda --> (I, you, etc.) took medicine.  Makasemasu! -->(I’ll) leave (it) to (you)

Translation Mismatch Solutions More information --> Less information (easy) Less information --> More information (hard)  Context clues  Language Models  Generalization  Formal representations

Translation Divergences “…the same information is conveyed in source and target texts” Divergences are quite common  Occurs in about 1 out of every three sentences in the TREC El Norte Newspaper corpus (Spanish-English)  Sentences can have multiple kinds of divergences

Translation Divergence Types Categorial Divergence Conflational Divergence Structural Divergence Head Swapping Divergence Thematic Divergence

Categorial Divergence Translation that uses different parts of speech Tener hambre (have hunger) --> be hungry Noun --> adjective

Conflational Divergence The translation of two words using a single word that combines their meaning Can also be called a lexical gap X stab Z --> X dar puñaladas a Z (X give stabs to Z) glastuinbouw --> cultivation under glass

Structural Divergence A difference in the realization of incorporated arguments PP to Object  X entrar en Y (X enter in Y) --> X enter Y  X ask for a referendum --> X pedir un referendum (ask-for a referendum)

Head Swapping Divergence Involves the demotion of a head verb and the promotion of a modifier verb to head position S NPVP NV PP VP Yo entro en el cuarto corriendo S NPVP NVPP I ran into the room.

Thematic Divergence This divergence occurs when sentence arguments switch argument roles from one language to another X gustar a Y (X please to Y) --> Y like X

Divergence Solutions and Statistical/EBMT Systems Not really addressed explicitly in SMT Covered in EBMT only if it is covered extensively in the data

Divergence Solutions and Transfer Systems Hand-written transfer rules Automatic extraction of transfer rules from bi-texts Problematic with multiple divergences

Divergence Solutions and Interlingua Systems Mel’čuk’s Deep Syntactic Structure Jackendoff’s Lexical Semantic Structure Both require “explicit symmetric knowledge” from both source and target language Expensive

Divergence Solutions and Interlingua Systems John swam across a river Juan cruza el río nadando [event CAUSE JOHN [event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]] [manner SWIM+INGLY]]

Generation-Heavy MT Built to address language divergences Designed for source-poor/target-rich translation Non-Interlingual Non-Transfer Uses symbolic overgeneration to account for different translation divergences

Generation-Heavy MT Source language  syntactic parser  translation lexicon Target language  lexical semantics, categorial variations & subcategorization frames for overgeneration  Statistical language model

GHMT System

Analysis Stage Independent of Target Language Creates a deep syntactic dependency Only argument structure, top-level conceptual nodes & thematic-role information Should normalize over syntactic & morphological phenomena

Translation Stage Converts SL lexemes to TL lexemes Maintains dependency structure

Analysis/Translation Stage GIVE (v) [cause go] I agent STAB (n) theme JOHN goal

Generation Stage Lexical & Structural Selection Conversion to a thematic dependency Uses syntactic-thematic linking map “loose” linking  Structural expansion Addresses conflation & head-swapped divergences  Turn thematic dependency to TL syntactic dependency Addresses categorial divergence

Generation Stage: Structural Expansion

Generation Stage Linearization Step  Creates a word lattice to encode different possible realizations  Implemented using oxyGen engine Sentences ranked & extracted  Nitrogen’s statistical extractor

Generation Stage

GHMT Results 4 of 5 Spanish-English divergences “can be generated using structural expansion & categorial variations” The remaining 1 out of 5 needed more world knowledge or idiom handling SL syntactic parser can still be hard to come by

Divergences and DUSTer Helps to overcome divergences for word alignment & improve coder agreement Changes an English sentence structure to resemble another language More accurate alignment and projection of dependency trees without training on dependency tree data

DUSTer Motivation for the development of automatic correction of divergences 1. “Every Language Pair has translation divergences that are easy to recognize” 2. “Knowing what they are and how to accommodate them provides the basis for refined word level alignment” 3. “Refined word-level” alignment results in improved projection of structural information from English to another language

DUSTer

Bi-text parsed on English side only “Linguistically Motivated” & common search terms Conducted on Spanish & Arabic (and later Chinese & Hindi) Uses all of the divergences mentioned before, plus a “light verb” divergence  Try  put to trying  poner a prueba

DUSTer Rule Development Methods Identify canonical transformations for each divergence type Categorize English sentences into divergence type or “none” Apply appropriate transformations Humans align E  E’  foreign language

DUSTer Rules # "kill" => "LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to” # "borrow" => "take lent (thing) to” # "hurt" => "make harm to” # "fear" => "have fear of” # "desire" => "have interest in” # "rest" => "have repose on” # "envy" => "have envy of” type1.B.X [English{2 1 3} Spanish{ } ] [ Verb [ Noun ] [ Noun ] ] [ LightVB [ Noun ] [ Noun ] [ Oblique [ Noun ] ] ]

DUSTer Results

Conclusion Divergences are common They are not handled well by most MT systems GHMT can account for divergences, but still needs development DUSTer can handle divergences through structure transformations, but requires a great deal of linguistic knowledge

The End Questions?

References Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4, pp , Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002 Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp , Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp , Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp , Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991

Other Divergences Idioms Aspectual Divergences Knowledge outside of Lexical Semantics