CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011
Key difference between Statistical/ML- based NLP and Knowledge- based/linguistics-based NLP Stat NLP: speed and robustness are the main concerns KB NLP: Phenomena based Example: Boys, Toys, Toes To get the root remove “s” How about foxes, boxes, ladies Understand phenomena: go deeper Slower processing
Perspective on Statistical MT What is a good translation? Faithful to source Fluent in target fluency faithfulness
Word-alignment example (1) (2) (3) (4) Ram has an apple राम के पास एक सेब है (1) (2)(3) (4) (5) (6) Ram of near an apple is
Kinds of MT Systems (point of entry from source to the target text) fwdfwd
Why is MT difficult? Classical NLP problems Ambiguity Lexical: Went to the bank to withdraw money Structural: Saw the boy with a telescope Ellipsis: I wanted a book and John a pen Co-reference Anaphora: John said he likes music Hypernymic: John’s house is a robust structure
Why is MT Difficult Language Divergence Lexico-Semantic Divergence Structural Divergence
Language Divergence (English Hindi: Noun to Adjective) The demands on sportsmen today can lead to burnout at an early age. (noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) खिलाड़यों से जो आज अपेक्षाएं हैं, वे उन्हें कम उम्र में अक्रियाशील कर सकती हैं Sportsmen-from, which today demands exist, that (correlative) them early age in inactive do can (aspectual) V-AUX.
Language Divergence (English Hindi: Noun to Verb) Every concert they gave us was a sell-out. (an event for which all the tickets have been sold) उनके हर संगीत - कार्यक्रम के सभी टिकट बिक गए थे। Their every concert-of all ticket sell- past-passive-plural (were sold out).
Language Divergence (English Hindi: Adjective to Adverb) The children were watching in wide- eyed amazement. (with eyes fully open because of fear, great surprise, etc) बच्चे आश्चर्य से आँखें फाड़े देख रहे थे। Children amazement-with eyes opening widely seeing were.
Language Divergence (English Hindi: Adjective to Verb) He was in a bad mood at breakfast and wasn't very communicative. (able and willing to talk and give information to other people) नाश्ते के समय वह खराब मूड में था और ज्यादा बात - चीत नहीं कर रहा था। Breakfast-of time he bad mood-in was and much conversation not do-past- progressive-sing (was doing).
Language Divergence (English Hindi: Preposition to Adverb) It gets cooler toward evening. (near a point in time) शाम होते - होते ठंडक बढ़ जाती है। Evening happening-happening (reduplication; typical Indian language phenomenon) cold increase-goes (verb compound; polar vector).
Language Divergence (English Hindi: idiomatic usage) Given her interest in children, teaching seems the right job for her. (when you consider sth) बच्चों के प्रति ( में ) उसकी दिलचस्पी देखते हुए, अध्यापन उसके लिए उचित लगता है। Children-towards her interest having seen, teaching for her appropriate seems.
Language Divergence is ubiquitous (Marathi-Hindi-English: case marking and postpositions transfer: works!) Not only for languages from distant families, but also within close cousins प्रथम ताख्यात वर्तमान (simple present) तो जातो. वह जाता है। He goes. स्थिरसत्य (universal truth) पृथ्वी सूर्याभोवती फिरते. पृथ्वी सूर्य के चारों ओर घूमती है। The earth revolves round the sun.
Language Divergence (Marathi-Hindi-English: case marking and postpositions: works again!) ऐतिहासिक सत्य (historical truth) कृष्ण अर्जुनास सांगतो... कृष्ण अर्जुन से कहते हैं... Krushna says to Arjuna… अवतरण (quoting) दामले म्हणतात,... दामले कहते हैं,... Damle says,...
Language Divergence (Marathi-Hindi-English: case marking and postpositions: does not work!) संनिहित भूत (immediate past) कधी आलास ? हा येतो इतकाच ! कब आये ? बस अभी आया । When did you come? Just now (I came). निःसंशय भविष्य (certainty in future) आता तो मार खातो खास ! अब वह मार खायगा ही ! He is in for a thrashing. आश्वासन (assurance) मी तुम्हाला उद्या भेटतो. मैं आप से कल मिलता हूँ। I will see you tomorrow.
Language Divergence Theory: Lexico- Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002) Conflational divergence E: stab; H: churaa se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan Structural divergence E: SVO; H: SOV Categorial divergence Change is in POS category (many examples discussed) Head swapping divergence E: Prime Minister of India; H: bhaarat ke pradhaan mantrii (India-of Prime Minister) Lexical divergence E: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon
Language Divergence Theory: Syntactic Divergences Constituent Order divergence E: Singh, the PM of India, will address the nation today; H: bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…) Adjunction Divergence E: She will visit here in the summer; H: vah yahaa garmii meM aayegii (she here summer-in will come) Preposition-Stranding divergence E: Who do you want to go with?; H: kisake saath aap jaanaa chaahate ho? (who with…) Null Subject Divergence E: I will go; H: jaauMgaa (subject dropped) Pleonastic Divergence E: It is raining; H: baarish ho rahii haai (rain happening is: no translation of it)
Entropy considerations Work of Chirag and Venkatesh, ongoing
Language Typology
Parallel Corpora EnglishHindiMarathi Jaipur, popularly known as the Pink City, is the capital of Rajasthan state, India. जयपुर, भारत वर्ष के राजस्थान राज्य की राजधानी, गुलाबी नगर के नाम से लोकप्रिय है । जयपुर, हे शहर पिंक सिटी या नावाने सुप्रसिध्द असून, ते भारतातील राजस्थान राज्याची राजधानी आहे. Until the war of 1982, the rainy, windswept Falkland Islands were a forgotten remnant of the old British Empire के युद्ध तक बरसाती हवा से बहे हुए फाल्कलैण्ड द्वीप पुराने ब्रिटिश अम्पायर के भूले हुए अवशेष हैं । १९८२ च्या युद्धापर्यंत, पावसाळी आणि वादळी फोल्कलॅंड़ बेटे ही जुन्या ब्रिटिश साम्राज्याचे विस्मृतीत गेलेले भाग होते. Spanish rule was administered from a distance, leaving the various regions to develop separately from the capital, Caracas, which was founded by Diego de Losada in स्पेनी प्रशासन दूर ही से चलता रहा, और राजधानी कारकस, जिसकी स्थापना 1567 में डीगो डे लारेसाडा द्वारा की गई थी, से विभिन्न प्रांतों को अलग से विकसित होने के लिए छोड़ दिया । डिएगो दे लोसादा ने १५६७ मध्ये स्थापित केलेल्या कॅराकस राजधानीशी संबंध न ठेवता अनेक प्रांतांना स्वतंत्रपणे विकसित होऊ देत, स्पेनी कारभार तटस्थपणे चालवला.
Phrase Table Entries Hindi-English Phrase Table Entries प्रस्तुत ||| a ||| 0.1 प्रस्तुत ||| afford ||| 0.1 प्रस्तुत ||| offer ||| 0.5 प्रस्तुत ||| offers ||| 0.3 Contribution to entropy = Hindi-Marathi Phrase Table Entries प्रस्तुत ||| अधिक असे देऊ ||| 0.05 प्रस्तुत ||| उपलब्ध ||| 0.2 प्रस्तुत ||| काहींचे ||| 0.05 प्रस्तुत ||| देऊ ||| 0.6 प्रस्तुत ||| सादर ||| 0.1 Contribution to entropy = 0.503
Entropy Evaluation The phrase table gives a probability distribution over the possible translations for each source phrase. We use the probability of the source phrase itself to get a distribution for the entire phrase table. Entropy is evaluated as per the standard formula Hindi-Marathi Phrase Table Entropy : Hindi English Phrase Table Entropy : 9.770
Handling Divergence through Indicative Translation (Microsoft Techvista Award, Ananthakrishnan 2007)
Indicative Translation – what and why? Native speaker acceptable translation not possible especially considering English-Hindi (Indian languages) divergence Compromises human-aided translation (post-editing) narrow domain (weather reports) rough translation Indicative MT Goal: understandable rather than perfect output Purpose: assimilation rather than dissemination (translation on the web)
27 Divergence between English and Hindi Divergence: differences in lexical and syntactic choices that languages make in expressing ideas MaTra: Structural transfer SVO to SOV post-modifiers to pre-modifiers Lexical transfer: WSD + lexicon lookup inflections case-markers.
28 Divergence between Natural and Indicative Hindi: some examples E: We eat the rotten canteen food every night. H: हम हर रात कैन्टीन का सड़ा हुआ खाना खाते हैं I: हम हर रात सड़ा हुआ कैन्टीन खाना खाते हैं E: The batsman who had been scoring heavily against them has to be removed early. H: जो बल्लेबाज़ उनके विरुद्ध ज़ोरदार स्कोर कर रहा था उसे जल्दी निकालना होगा I: बल्लेबाज़, जो उनके विरुद्ध ज़ोरदार स्कोर कर रहा था, जल्दी निकालना होगा
29 Categorial divergence E: I am feeling hungry H: मुझे भूख लग रही है I: मैं भूखा महसूस कर रहा हूँ n-gram matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams: 0/3
30 Relation between words in noun- noun compounds E: The ten best Aamir Khan performances H: आमिर ख़ान की दस सर्वोत्तम पर्फ़ार्मन्सस I: दस सर्वोत्तम आमिर ख़ान पर्फ़ार्मन्सस n-gram matches: unigrams: 5/5; bigrams: 2/4; trigrams: 0/3; 4-grams: 0/2
31 Lexical divergence E: Food, clothing and shelter are a man's basic needs. H: रोटी, कपड़ा और मकान एक मनुष्य की बुनियादी ज़रूरतें हैं I: खाना, कपड़ा, और आश्रय एक मनुष्य की बुनियादी ज़रूरतें हैं n-gram matches: unigrams: 8/10; bigrams: 6/9; trigrams: 4/8; 4-grams: 3/7
32 Pleonastic Divergence E: It is raining H: बारिश हो रही है I: यह बारिश हो रही है n-gram matches: unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2 E: There was a great king H: एक महान राजा था I: वहाँ एक महान राजा था
33 Stylistic differences E: The Lok Sabha has 545 members. H: लोक सभा में ५४५ सदस्य हैं I: लोक सभा के पास ५४५ सदस्य हैं n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5; 4-grams: 0/4 Other differences: word order, sentence length
34 Transliteration and WSD errors E: I purchased a bat. H: मैने एक बल्ला खरीदा I: मैने एक बैट खरीदा मैने एक चमगादड़ खरीदा n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams: 0/1
35 Divergence/ problem Average BLEU precision Translation acceptable? Categorial0Yes Noun-noun compounds 0.38Yes Lexical0.6Yes Transliteration0.27Yes Pleonastic0.68No Stylistic0.35No WSD error0.27No
Advantages of a hybrid Rule- based + SMT system What SMT brings to the table If data available, then no need for linguistic resources Quick adaptation to new domains (tourism, health) new language pairs (English-Gujarati/Marathi) See improvements by adding data What rule-based systems bring to the table Capture small set of systematic difference well SVO SOV (do we need to learn this?) Better handle on correcting specific cases
Preprocessing rules + SMT for English-Indian language MT Lack of linguistic resources for Indian languages Lots of resources available for English Morphology is rich for Indian languages Wider systematic syntactic differences between English and Indian languages
Placed within the Vauquois Triangle
Previous work on factored MT
