CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb,

Slides:



Advertisements
Similar presentations
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
The Practical Value of Statistics for Sentence Generation: The Perspective of the Nitrogen System Irene Langkilde-Geary.
Modality Lecture 10. Language is not merely used for conveying factual information A speaker may wish to indicate a degree of certainty to try to influence.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
Adverbial Clauses and Phrases Lesson 8. Santa Clause does like to write.
Statistical NLP: Lecture 3
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
Dr. Ansa Hameed Syntax (4).
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
Machine Translation, Language Divergence and Lexical Resources Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 17– Alignment in SMT) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb, 2011.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Inversion in the English Language.
आप क्या करते हैं? What do you do?` Hindi 1: Lesson 7a.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
“ Poetry is what gets lost in translation.” Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as.
GUIDE : Prof. Amitabha Mukerjee By :Amit Kumar (10074) Ankit Modi (10104)
CS : Speech, Natural Language Processing and the Web/Topics in Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12: Deeper.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
1 Syntax 1 Homework to be handed in on 19th or 26th December: Fromkin morphology exercises 6, 7, & 14. Also, do either 8 or 9, and only one of 15, 16 or.
Natural Language Processing Lecture 6 : Revision.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
CS460/IT632 Natural Language Processing/Language Technology for the Web Guest Lecture (31/03/06) Prof. Niladri Chatterjee IIT Delhi Guest Lecture on Machine.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 12, Feb 13, 2007.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Rules, Movement, Ambiguity
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.
क्या MLM से निराश हो चुके हो ? आप हर दिन नया प्लान करना पड़ता है जिस भी प्लान को शुरू करते हो कंपनी 5- या 6 महीनो में चली जाती है पैसे के साथ-साथ अपनी
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Sight Words.
1 Syntax 1. 2 In your free time Look at the diagram again, and try to understand it. Phonetics Phonology Sounds of language Linguistics Grammar MorphologySyntax.
SYNTAX.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
Linguistics Lecture-1: Words Pushpak Bhattacharyya, CSE Department, IIT Bombay 14 June, 2008.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Grammar Chapter 10. What is Grammar? Basic Points description of patterns speakers use to construct sentences stronger patterns - most nouns form plurals.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Counterfactuals aka: Past subjunctive/Hypothethical past/Unreal past Conditionals and Counterfactuals Copyright © 2009Copyright © 2009 Jishnu Shankar Credited.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Multi-tasking verbal forms The Participles Perfective Copyright © 2009Copyright © 2009 Jishnu Shankar Credited downloads allowed for non-commercial purposes.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Syntactical Changes in English Dr. Muhammad Shahbaz.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Active and passive voice in easy manner
Approaches to Machine Translation
Syntax 1.
Statistical NLP: Lecture 3
SYNTAX.
He answered in a very rude manner.
Part I: Basics and Constituency
निम्न लिखत उदाहरण को देखिये
सफ़र की हद है वहां तक की कुछ निशान रहे : चले चलो की जहाँ तक ये आसमान रहे : ये क्या उठाये कदम और आ गयी मंजिल : मज़ा तो तब है के पैरों में कुछ थकान रहे
Approaches to Machine Translation
X-bar Schema Linguistics lecture series
Presentation transcript:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011

Key difference between Statistical/ML- based NLP and Knowledge- based/linguistics-based NLP Stat NLP: speed and robustness are the main concerns KB NLP: Phenomena based Example: Boys, Toys, Toes To get the root remove “s” How about foxes, boxes, ladies Understand phenomena: go deeper Slower processing

Perspective on Statistical MT What is a good translation? Faithful to source Fluent in target fluency faithfulness

Word-alignment example (1) (2) (3) (4) Ram has an apple राम के पास एक सेब है (1) (2)(3) (4) (5) (6) Ram of near an apple is

Kinds of MT Systems (point of entry from source to the target text) fwdfwd

Why is MT difficult? Classical NLP problems Ambiguity Lexical: Went to the bank to withdraw money Structural: Saw the boy with a telescope Ellipsis: I wanted a book and John a pen Co-reference Anaphora: John said he likes music Hypernymic: John’s house is a robust structure

Why is MT Difficult Language Divergence Lexico-Semantic Divergence Structural Divergence

Language Divergence (English Hindi: Noun to Adjective) The demands on sportsmen today can lead to burnout at an early age. (noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) खिलाड़यों से जो आज अपेक्षाएं हैं, वे उन्हें कम उम्र में अक्रियाशील कर सकती हैं Sportsmen-from, which today demands exist, that (correlative) them early age in inactive do can (aspectual) V-AUX.

Language Divergence (English Hindi: Noun to Verb) Every concert they gave us was a sell-out. (an event for which all the tickets have been sold) उनके हर संगीत - कार्यक्रम के सभी टिकट बिक गए थे। Their every concert-of all ticket sell- past-passive-plural (were sold out).

Language Divergence (English Hindi: Adjective to Adverb) The children were watching in wide- eyed amazement. (with eyes fully open because of fear, great surprise, etc) बच्चे आश्चर्य से आँखें फाड़े देख रहे थे। Children amazement-with eyes opening widely seeing were.

Language Divergence (English Hindi: Adjective to Verb) He was in a bad mood at breakfast and wasn't very communicative. (able and willing to talk and give information to other people) नाश्ते के समय वह खराब मूड में था और ज्यादा बात - चीत नहीं कर रहा था। Breakfast-of time he bad mood-in was and much conversation not do-past- progressive-sing (was doing).

Language Divergence (English Hindi: Preposition to Adverb) It gets cooler toward evening. (near a point in time) शाम होते - होते ठंडक बढ़ जाती है। Evening happening-happening (reduplication; typical Indian language phenomenon) cold increase-goes (verb compound; polar vector).

Language Divergence (English Hindi: idiomatic usage) Given her interest in children, teaching seems the right job for her. (when you consider sth) बच्चों के प्रति ( में ) उसकी दिलचस्पी देखते हुए, अध्यापन उसके लिए उचित लगता है। Children-towards her interest having seen, teaching for her appropriate seems.

Language Divergence is ubiquitous (Marathi-Hindi-English: case marking and postpositions transfer: works!) Not only for languages from distant families, but also within close cousins प्रथम ताख्यात वर्तमान (simple present) तो जातो. वह जाता है। He goes. स्थिरसत्य (universal truth) पृथ्वी सूर्याभोवती फिरते. पृथ्वी सूर्य के चारों ओर घूमती है। The earth revolves round the sun.

Language Divergence (Marathi-Hindi-English: case marking and postpositions: works again!) ऐतिहासिक सत्य (historical truth) कृष्ण अर्जुनास सांगतो... कृष्ण अर्जुन से कहते हैं... Krushna says to Arjuna… अवतरण (quoting) दामले म्हणतात,... दामले कहते हैं,... Damle says,...

Language Divergence (Marathi-Hindi-English: case marking and postpositions: does not work!) संनिहित भूत (immediate past) कधी आलास ? हा येतो इतकाच ! कब आये ? बस अभी आया । When did you come? Just now (I came). निःसंशय भविष्य (certainty in future) आता तो मार खातो खास ! अब वह मार खायगा ही ! He is in for a thrashing. आश्वासन (assurance) मी तुम्हाला उद्या भेटतो. मैं आप से कल मिलता हूँ। I will see you tomorrow.

Language Divergence Theory: Lexico- Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002) Conflational divergence E: stab; H: churaa se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan Structural divergence E: SVO; H: SOV Categorial divergence Change is in POS category (many examples discussed) Head swapping divergence E: Prime Minister of India; H: bhaarat ke pradhaan mantrii (India-of Prime Minister) Lexical divergence E: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon

Language Divergence Theory: Syntactic Divergences Constituent Order divergence E: Singh, the PM of India, will address the nation today; H: bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…) Adjunction Divergence E: She will visit here in the summer; H: vah yahaa garmii meM aayegii (she here summer-in will come) Preposition-Stranding divergence E: Who do you want to go with?; H: kisake saath aap jaanaa chaahate ho? (who with…) Null Subject Divergence E: I will go; H: jaauMgaa (subject dropped) Pleonastic Divergence E: It is raining; H: baarish ho rahii haai (rain happening is: no translation of it)

Entropy considerations Work of Chirag and Venkatesh, ongoing

Language Typology

Parallel Corpora EnglishHindiMarathi Jaipur, popularly known as the Pink City, is the capital of Rajasthan state, India. जयपुर, भारत वर्ष के राजस्थान राज्य की राजधानी, गुलाबी नगर के नाम से लोकप्रिय है । जयपुर, हे शहर पिंक सिटी या नावाने सुप्रसिध्द असून, ते भारतातील राजस्थान राज्याची राजधानी आहे. Until the war of 1982, the rainy, windswept Falkland Islands were a forgotten remnant of the old British Empire के युद्ध तक बरसाती हवा से बहे हुए फाल्कलैण्ड द्वीप पुराने ब्रिटिश अम्पायर के भूले हुए अवशेष हैं । १९८२ च्या युद्धापर्यंत, पावसाळी आणि वादळी फोल्कलॅंड़ बेटे ही जुन्या ब्रिटिश साम्राज्याचे विस्मृतीत गेलेले भाग होते. Spanish rule was administered from a distance, leaving the various regions to develop separately from the capital, Caracas, which was founded by Diego de Losada in स्पेनी प्रशासन दूर ही से चलता रहा, और राजधानी कारकस, जिसकी स्थापना 1567 में डीगो डे लारेसाडा द्वारा की गई थी, से विभिन्न प्रांतों को अलग से विकसित होने के लिए छोड़ दिया । डिएगो दे लोसादा ने १५६७ मध्ये स्थापित केलेल्या कॅराकस राजधानीशी संबंध न ठेवता अनेक प्रांतांना स्वतंत्रपणे विकसित होऊ देत, स्पेनी कारभार तटस्थपणे चालवला.

Phrase Table Entries Hindi-English Phrase Table Entries प्रस्तुत ||| a ||| 0.1 प्रस्तुत ||| afford ||| 0.1 प्रस्तुत ||| offer ||| 0.5 प्रस्तुत ||| offers ||| 0.3 Contribution to entropy = Hindi-Marathi Phrase Table Entries प्रस्तुत ||| अधिक असे देऊ ||| 0.05 प्रस्तुत ||| उपलब्ध ||| 0.2 प्रस्तुत ||| काहींचे ||| 0.05 प्रस्तुत ||| देऊ ||| 0.6 प्रस्तुत ||| सादर ||| 0.1 Contribution to entropy = 0.503

Entropy Evaluation The phrase table gives a probability distribution over the possible translations for each source phrase. We use the probability of the source phrase itself to get a distribution for the entire phrase table. Entropy is evaluated as per the standard formula Hindi-Marathi Phrase Table Entropy : Hindi English Phrase Table Entropy : 9.770

Handling Divergence through Indicative Translation (Microsoft Techvista Award, Ananthakrishnan 2007)

Indicative Translation – what and why? Native speaker acceptable translation not possible especially considering English-Hindi (Indian languages) divergence Compromises human-aided translation (post-editing) narrow domain (weather reports) rough translation  Indicative MT Goal: understandable rather than perfect output Purpose: assimilation rather than dissemination (translation on the web)

27 Divergence between English and Hindi Divergence: differences in lexical and syntactic choices that languages make in expressing ideas MaTra: Structural transfer SVO to SOV post-modifiers to pre-modifiers Lexical transfer: WSD + lexicon lookup inflections case-markers.

28 Divergence between Natural and Indicative Hindi: some examples E: We eat the rotten canteen food every night. H: हम हर रात कैन्टीन का सड़ा हुआ खाना खाते हैं I: हम हर रात सड़ा हुआ कैन्टीन खाना खाते हैं E: The batsman who had been scoring heavily against them has to be removed early. H: जो बल्लेबाज़ उनके विरुद्ध ज़ोरदार स्कोर कर रहा था उसे जल्दी निकालना होगा I: बल्लेबाज़, जो उनके विरुद्ध ज़ोरदार स्कोर कर रहा था, जल्दी निकालना होगा

29 Categorial divergence E: I am feeling hungry H: मुझे भूख लग रही है I: मैं भूखा महसूस कर रहा हूँ n-gram matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams: 0/3

30 Relation between words in noun- noun compounds E: The ten best Aamir Khan performances H: आमिर ख़ान की दस सर्वोत्तम पर्फ़ार्मन्सस I: दस सर्वोत्तम आमिर ख़ान पर्फ़ार्मन्सस n-gram matches: unigrams: 5/5; bigrams: 2/4; trigrams: 0/3; 4-grams: 0/2

31 Lexical divergence E: Food, clothing and shelter are a man's basic needs. H: रोटी, कपड़ा और मकान एक मनुष्य की बुनियादी ज़रूरतें हैं I: खाना, कपड़ा, और आश्रय एक मनुष्य की बुनियादी ज़रूरतें हैं n-gram matches: unigrams: 8/10; bigrams: 6/9; trigrams: 4/8; 4-grams: 3/7

32 Pleonastic Divergence E: It is raining H: बारिश हो रही है I: यह बारिश हो रही है n-gram matches: unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2 E: There was a great king H: एक महान राजा था I: वहाँ एक महान राजा था

33 Stylistic differences E: The Lok Sabha has 545 members. H: लोक सभा में ५४५ सदस्य हैं I: लोक सभा के पास ५४५ सदस्य हैं n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5; 4-grams: 0/4 Other differences: word order, sentence length

34 Transliteration and WSD errors E: I purchased a bat. H: मैने एक बल्ला खरीदा I: मैने एक बैट खरीदा मैने एक चमगादड़ खरीदा n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams: 0/1

35 Divergence/ problem Average BLEU precision Translation acceptable? Categorial0Yes Noun-noun compounds 0.38Yes Lexical0.6Yes Transliteration0.27Yes Pleonastic0.68No Stylistic0.35No WSD error0.27No

Advantages of a hybrid Rule- based + SMT system What SMT brings to the table If data available, then no need for linguistic resources Quick adaptation to new domains (tourism, health) new language pairs (English-Gujarati/Marathi) See improvements by adding data What rule-based systems bring to the table Capture small set of systematic difference well SVO  SOV (do we need to learn this?) Better handle on correcting specific cases

Preprocessing rules + SMT for English-Indian language MT Lack of linguistic resources for Indian languages Lots of resources available for English Morphology is rich for Indian languages Wider systematic syntactic differences between English and Indian languages

Placed within the Vauquois Triangle

Previous work on factored MT

Previous work {ney:04} show that the use of morpho-syntactic information drastically reduces the need for bilingual training data {ney:06} report the use of morphological and syntactic restructuring information for Spanish-English and Serbian- English translation

Previous work (contd) Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors

Previous work (contd) Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors

Previous work (contd) Avramidis and Koehn {koehn:08} report work on translating from poor to rich morphology, namely, English to Greek and Czech translation Factored models with case and verb conjugation related factors determined by heuristics on parse trees Used only on the source side, and not on the target side

Previous work (contd) Melamed {melamed:04} proposes methods based on tree-to-tree mappings Imamura et al. {imamura:05} present a similar method that achieves significant improvements over a phrase-based baseline model for Japanese-English translation

Previous work (contd) Target language does not have parsing/clause-detecting tools Niessen and Ney {ney:04}: Reorder the source language data prior to the SMT training and decoding cycles German-English SMT Popovic and Ney {ney:06} :simple local transformation rules for Spanish-English and Serbian-English translation Collins et al. {collins:05}: German clause restructuring to improve German-English SMT Wang et al. {wang:07}: similar work for Chinese- English SMT Ananthakrishnan and Bhattacharyya {anand:08}: syntactic reordering and morphological suffix separation for English-Hindi SMT