Machine Translation Introduction to MT. Dan Jurafsky Machine Translation Fully automatic Helping human translators Enter Source Text: Translation from.

Slides:



Advertisements
Similar presentations
The Structure of Sentences Asian 401
Advertisements

CODE/ CODE SWITCHING.
Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez.
1 Linguistics week 11 Finish assimilation; start morphology.
Introduction to Semantics and Pragmatics. LING NLP 2 NLP tends to focus on: Syntax – Grammars, parsers, parse trees, dependency structures.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
1 Words and the Lexicon September 10th 2009 Lecture #3.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 23 Jim Martin.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Syntax and Grammar John Goldsmith Cognitive Neuroscience May 1999.
Translation Divergence LING 580MT Fei Xia 1/10/06.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
LIN6932 Topics in Computational Linguistics
Psych156A/Ling150: Psychology of Language Learning Lecture 17 Language Structure.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
The classification of languages Introduction to Linguistics 2.
Parts of Speech (Lexical Categories). Parts of Speech Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) The building blocks of sentences The [ N.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more.
323 Morphology The Structure of Words 1.1 What is Morphology? Morphology is the internal structure of words. V: walk, walk+s, walk+ed, walk+ing N: dog,
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 4, Jan 15, 2007.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
1 Natural Language Processing Gholamreza Ghassem-Sani Fall 1383.
CSP 517 Natural Language Processing Winter 2015 Machine Translation: Word Alignment Yejin Choi Slides from Dan Klein, Luke Zettlemoyer, Dan Jurafsky, Ray.
Leksička semantika i pragmatika 3. predavanje. Machine Translation The Story of the Stone –=The Dream of the Red Chamber (Cao Xueqin 1792) Issues: (“Language.
Formal Properties of Language. Grammar Morphology Syntax Semantics.
Parts of Speech Notes. Part of Speech: Nouns  A naming word  Names a person, place, thing, idea, living creature, quality, or idea Examples: cowboy,
Formal Properties of Language: Talk is achieved through the interdependent components of sounds, words, sentences, and meanings.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
Lecture Three Morphology.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Rules, Movement, Ambiguity
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
WORDS The term word is much more difficult to define in a technical sense, and like many other linguistic terms, there are often arguments about what exactly.
Natural Language Processing Chapter 2 : Morphology.
Sight Words.
1 Syntax 1. 2 In your free time Look at the diagram again, and try to understand it. Phonetics Phonology Sounds of language Linguistics Grammar MorphologySyntax.
Levels of Linguistic Analysis
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Machine Translation Divergences: A Formal Description and Proposed Solution Bonnie J. Dorr University of Maryland Presented by: Soobia Afroz.
The Structure of Language Finding Patterns in the Noise Presented by Cliff Jones, M.A., Linguistics.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
NATURAL LANGUAGE PROCESSING
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Grammar & Function Look at the picture below and share your ideas. You needn’t have run so quickly. Why did the man say it? 1.The other man was running.
Linguistics 1 Syntax Week 2 Lectures 3 & 4.
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Lecture 9: Machine Translation (I) October 25, 2005 Dan Jurafsky
Machine Translation Nov 8, 2006
Approaches to Machine Translation
Levels of Linguistic Analysis
Introduction to Machine Translation
Introduction to English morphology
Editing Process: English 10 Spoken Language
Presentation transcript:

Machine Translation Introduction to MT

Dan Jurafsky Machine Translation Fully automatic Helping human translators Enter Source Text: Translation from Stanford’s Phrasal: 这 不过 是 一 个 时间 的 问题. This is only a matter of time.

Dan Jurafsky Google Translate Fried ripe plantains: os-maduros-fritos/ os-maduros-fritos/

Dan Jurafsky Machine Translation The Story of the Stone (“The Dream of the Red Chamber”) Cao Xueqin 1792 Chinese gloss: Dai-yu alone at bed on think-of-with-gratitude Bao-chai… again listen to window outside bamboo tip plantain leaf of on, rain sound sigh drop, clear cold penetrate curtain, not feeling again fall down tears come. Hawkes translation: As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

Dan Jurafsky Difficulties in Chinese to English translation Long Chinese sentences: 4 English sentences to 1 Chinese Chinese no pronouns or articles (English the, a) Chinese has locative post-positions, English prepositions Chinese bed on, window outside, English on the bed, outside the window Chinese rarely marks tense: English as, turned to, had begun, Chinese tou, ‘penetrate’ -> English penetrated Chinese relative clauses are before the noun, English after Chinese: [window outside bamboo on] rain English: rain [on the bamboo outside the window] Stylistic and cultural differences Chinese bamboo tip plaintain leaf -> bamboos and plantains Chinese rain sound sigh drop -> insistent rustle of the rain Chinese ma ‘curtain’ -> curtains of her bed

Dan Jurafsky Alignment in Machine Translation

Dan Jurafsky Early MT History 1946 Booth and Weaver discuss MT in New York idea of dictionary-based direct translation 1947 Warren Weaver suggests translation by computer 1949 Weaver memorandumWeaver memorandum 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian-English MT lots of labs take up MT

Dan Jurafsky 1949 Weaver memorandum “There are certain invariant properties which are… common to all languages” ‘When I look at an article in Russian, I say "This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”’ “[If] one can see… N words on either side, then, if N is large enough, one can unambiguously decide the meaning of the central word.” 8

Dan Jurafsky The History of MT: Pessimism 1959/1960 Yehoshua Bar-Hillel “Report on the state of MT in US and GB” FAHQ MT too hard because we would have to encode all of human knowledge Instead we should work on computer tools for human translators

Dan Jurafsky The claim that fully automatic high quality MT is impossible Yehoshua Bar-Hillel A Demonstration of the Nonfeasibility of Fully Automatic High Quality Translation. Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy. Pen 1 : Enclosure for small children Pen 2 : Writing utensil Pen 1 : Enclosure for small children

Dan Jurafsky The box was in the pen.

Dan Jurafsky The claim that fully automatic high quality MT is impossible Yehoshua Bar-Hillel, 1960 “I now claim that no existing or imaginable program will enable an electronic computer to determine…”

Dan Jurafsky The state of the art in MT

Dan Jurafsky The state of the art in MT

Dan Jurafsky History of MT: Further Pessimism The ALPAC report Headed by John R. Pierce of Bell Labs Conclusions: MT doesn’t work MT a failure: all current MT work had to be post-edited Intelligibility and informativeness worse than human We don’t need MT anyhow Already too many human translators from Russian Results: MT research suffered Funding loss Number of research labs declined Association for Machine Translation and Computational Linguistics dropped MT from its name

Dan Jurafsky MT in the modern age Resurgence of MT in Europe and Japan Domain-specific rule-based systems 1990-present Rise of Statistical Machine Translation

Machine Translation Introduction to MT

Machine Translation Language Divergences

Dan Jurafsky Language Similarities and Divergences Typology: the study of systematic cross-linguistic similarities and differences What are the dimensions along which human languages vary?

Dan Jurafsky Syntactic Variation: Basic Word Orders SVO (Subject-Verb-Object) languages English, German, French, Mandarin I baked a pizza SOV Languages Japanese, Hindi English: He adores listening to music Japanese: kare ha ongaku wo kiku no ga daisuki desu he music to listening adores VSO languages Irish, Classical Arabic, Tagalog In many languages one word order is more basic

Dan Jurafsky Morphology Morpheme: “ Minimal meaningful unit of language” Word = Morpheme + Morpheme + Morpheme +… Stems: (base form, root) hope+ing  hopinghop  hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German

Dan Jurafsky Morphemes per Word isolating synthetic Vietnamese Joseph Greenberg A Quantitative Approach to the Morphological Typology of Language. IJAL 26: Yakut (Turkic) 2.17 English 1.68 West Greenlandic (Eskimo- Inuit) Swahili

Dan Jurafsky Few morphemes per word: Cantonese “He said this was the biggest building in the whole country” Each word in this sentence has one morpheme (and one syllable): keui wa chyuhn gwok jeui daaih gaan nguk haih li gaan he say entire country most big bldg house is this bldg

Dan Jurafsky Many Morphemes per word: Turkish uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized

Dan Jurafsky Word Segmentation Are word boundaries marked in writing? Some writing systems: boundaries between words not marked Chinese, Japanese, Thai Word segmentation becomes an important part of text normalization for MT Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences: Modern Standard Arabic, Chinese Sentence segmentation may be necessary for MT between these languages and languages like English

Dan Jurafsky Inferential Load: cold vs. hot languages Hot languages: Who did what to whom is marked explicitly English Cold languages: The hearer has more “figuring out” of who the various actors in the various events are Japanese, Chinese Balthasar Bickel Referential density in discourse and syntactic typology. Language 79:2,

Dan Jurafsky Inferential Load: The blue noun phrases are not in the Chinese original 飓风丽塔已经减弱为第三级飓风, Rita weakened and was downgraded to a Category 3 storm; ø 迫近美国德课萨斯州和路易斯安那州, [Rita/it/the storm] is moving close to Texas and Louisiana; 当局表示, the authorities announced; 虽然 ø 在登陆前可能再稍微减弱, although [Rita/it/the storm] might weaken again before landing, 但 ø 仍然会非常危险, [Rita/it/the storm] is still very dangerous; ø 预料 ø 会在当地时间星期六凌晨在德州和路易斯安那州之间登陆, [the authorities] predict [Rita/it/the storm] will arrive at the Texas- Louisiana border on Saturday morning local time; ø 直接吹袭休斯敦市东面的主要炼油设施。 [Rita/it/the storm] will directly hit the oil-refining industry east of Houston.

Dan Jurafsky Lexical Divergences Word to phrases: English computer science French informatique Part of Speech divergences English She likes to sing German Sie singt gerne [She sings likefully] English I’m hungry Spanish Tengo hambre [I have hunger]

Dan Jurafsky Lexical Specificity Divergences Grammatical specificity Spanish: plural pronouns have gender ( ellos/ellas ) English: plural pronouns no gender ( they ) So translating “they” from English to Spanish, need to figure out gender of the referent!

Dan Jurafsky Lexical Divergences: Semantic Specificity English brother Mandarin gege (older brother), didi (younger brother) English wall German Wand (inside) Mauer (outside) English fish Spanish pez (the creature) pescado (fish as food) Cantonese ngau English cow beef

Dan Jurafsky Predicate Argument divergences English Spanish The bottle floated out.La botella salió flotando. The bottle exited floating Satellite-framed languages: direction of motion is marked on the satellite Crawl out, float off, jump down, walk over to, run after Most of Indo-European, Hungarian, Finnish, Chinese Verb-framed languages: direction of motion is marked on the verb Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu families L. Talmy Lexicalization patterns: Semantic Structure in Lexical Form.

Dan Jurafsky Predicate Argument divergences: Heads and Argument swapping Heads: English: X swim across Y Spanish: X crucar Y nadando English: I like to eat German: Ich esse gern English: I’d prefer vanilla German: Mir wäre Vanille lieber Arguments: Spanish: Y me gusta English: I like Y German: Der Termin fällt mir ein English: I forget the date Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4,

Dan Jurafsky Predicate-Argument Divergence Counts Found divergences in 32% of sentences in UN Spanish/English Corpus Part of Speech X tener hambre Y have hunger 98% Phrase/Light verb X dar puñaladas a Z X stab Z 83% Structural X entrar en Y X enter Y 35% Heads swap X cruzar Y nadando X swim across Y 8% Arguments swap X gustar a Y Y likes X 6% B.Dorr et al DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment

Machine Translation Language Divergences

Machine Translation Three classical methods for MT

Dan Jurafsky 3 Classical methods for MT Direct Transfer Interlingua

Dan Jurafsky Three MT Approaches: Direct, Transfer, Interlingual

Dan Jurafsky Direct Translation Proceed word-by-word through text Translating each word No intermediate structures except morphology Knowledge is in the form of Huge bilingual dictionary word-to-word translation information After word translation, can do simple reordering Adjective ordering English -> French/Spanish

Dan Jurafsky Direct MT Dictionary entry

Dan Jurafsky Direct MT

Dan Jurafsky Problems with direct MT German Chinese

Dan Jurafsky The Transfer Model Idea: apply contrastive knowledge, i.e., knowledge about the difference between two languages Steps: Analysis: Syntactically parse source language Transfer: Rules to turn this parse into parse for target language Generation: Generate target sentence from parse tree

Dan Jurafsky English to French English: Adjective Noun French: Noun Adjective This is not always true Route mauvaise ‘bad road, badly-paved road’ Mauvaise route ‘wrong road’ But is a reasonable first approximation Rule:

Dan Jurafsky Transfer rules

Dan Jurafsky Transferring the green witch…. 45

Dan Jurafsky Interlingua Instead of N 2 sets of transfer rules Use meaning as a representation language 1.Parse source sentence into meaning representation 2.Generate target sentence from meaning. Intuition: Use other NLP applications to do MT work English book to Spanish: libro or reservar Disambiguate book into concepts BOOKVOLUME and RESERVE Need 2N systems (a parser and generator for each language)

Dan Jurafsky Interlingua for Mary did not slap the green witch

Machine Translation Three classical methods for MT