Machine Translation: Techniques, Technologies and Challenges

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The Structure of Sentences Asian 401
Adverbs and Adjectives
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Syntax. Definition: a set of rules that govern how words are combined to form longer strings of meaning meaning like sentences.
Part-Of-Speech Tagging and Chunking using CRF & TBL
LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Constituency Tests Phrase Structure Rules
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
THE PARTS OF SYNTAX Don’t worry, it’s just a phrase ELL113 Week 4.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
CS : Speech, Natural Language Processing and the Web/Topics in Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12: Deeper.
Overview Project Goals –Represent a sentence in a parse tree –Use parses in tree to search another tree containing ontology of project management deliverables.
Chapter 15 Natural Language Processing (cont)
Grammar Review Name___________ Title____________ Author _________ Parts of Speech COPY A SENTENCE FROM YOUR BOOK. Label the parts of speech of each word.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
CS460/IT632 Natural Language Processing/Language Technology for the Web Guest Lecture (31/03/06) Prof. Niladri Chatterjee IIT Delhi Guest Lecture on Machine.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Grammar Race!. What is a sentence? Sentences express complete thoughts; they have a subject and a predicate. Subjects are nouns or pronouns (or phrases.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Natural Language Processing
CSE573 Autumn /23/98 Natural Language Processing Administrative –PS3 due today –PS4 out Wednesday, due Friday 3/13 (last day of class) special.
Parts of Speech Review. A Noun is a person, place, thing, or idea.
SYNTAX.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
TYPES OF PHRASES REPRESENTING THE INTERNAL STRUCTURE OF PHRASES 12/5/2016.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
NATURAL LANGUAGE PROCESSING
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Natural Language Processing Vasile Rus
Natural Language Processing Vasile Rus
The theory of word classes in modern grammar studies
Parts of Speech Review.
Bank Exam Online Coaching ENGLISH AWARENESS
Approaches to Machine Translation
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
English Week 20 Day 1.
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
SYNTAX.
Макет заголовкаМакет заголовка Підзаголовок. The noun is the central lexical unit of language. It is the main nominative unit of speech. As any other.
Part I: Basics and Constituency
CS : Speech, NLP and the Web/Topics in AI
Syntax.
CS 388: Natural Language Processing: Syntactic Parsing
Probabilistic and Lexicalized Parsing
Natural Language - General
Introduction to Linguistics
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Approaches to Machine Translation
PREPOSITIONAL PHRASES
Linguistic Essentials
Everything you need to know!
David Kauchak CS159 – Spring 2019
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Machine Translation: Techniques, Technologies and Challenges Presented By Bibekananda Kundu, CDAC Kolkata

Why we need a Machine Translation System?

Translation ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Translation ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Why is Translation Difficult? Ambiguity Language Divergence

Morphological Ambiguity আমার এই জামাই চাই আমার এই জামাই চাই

ACK: http://ttt.org/theory/mt4me/mtambiguity.html Lexical Ambiguity The pen was in the box. The box was in the pen. ACK: http://ttt.org/theory/mt4me/mtambiguity.html

ACK: http://specgram.com/CLIII.4/08.phlogiston.cartoon.zhe.html Lexical Ambiguity So we have two possible readings, represented here pictorially. And we have two different kinds of ambiguity. First, there is semantic ambiguity. “Flies” could be a noun, meaning insects, or it could be a verb, meaning travel by air. And “like” could be a verb, meaning love, or it could be a preposition, meaning similar to. [“like” is a remarkable word: it can be used as a noun, verb, adverb, adjective, preposition, particle, conjunction, or interjection.] And then we also have a syntactic ambiguity. It could be that “fruit flies” is the subject, and “like a banana” is the predicate. Or it could be that “fruit” is the subject, and “flies like a banana” is the predicate. These kind of ambiguities show up all over the place, and as humans, we’re so good at resolving them that usually we don’t even notice them consciously. But sometimes, unintentional ambiguities produce a quite comic effect. [>] ACK: http://specgram.com/CLIII.4/08.phlogiston.cartoon.zhe.html

Syntactic and Semantic Ambiguity syntactic ambiguity NP NP VP S NP NP PP semantic ambiguity So we have two possible readings, represented here pictorially. And we have two different kinds of ambiguity. First, there is semantic ambiguity. “Flies” could be a noun, meaning insects, or it could be a verb, meaning travel by air. And “like” could be a verb, meaning love, or it could be a preposition, meaning similar to. [“like” is a remarkable word: it can be used as a noun, verb, adverb, adjective, preposition, particle, conjunction, or interjection.] And then we also have a syntactic ambiguity. It could be that “fruit flies” is the subject, and “like a banana” is the predicate. Or it could be that “fruit” is the subject, and “flies like a banana” is the predicate. These kind of ambiguities show up all over the place, and as humans, we’re so good at resolving them that usually we don’t even notice them consciously. But sometimes, unintentional ambiguities produce a quite comic effect. [>] Fruit flies like a banana Fruit flies like a banana ACK : nlp.stanford.edu/~wcmac/papers/20110526-symsys-100-nlp.pptx

ACK: http://specgram.com/CLIII.4/08.phlogiston.cartoon.zhe.html

Pronoun Reference Ambiguity ACK: http://specgram.com/CLIII.4/08.phlogiston.cartoon.zhe.html

Pronoun Reference Ambiguity

Why is Translation Difficult? Ambiguity Language Divergence

Agglutination Finnish: istahtaisinkohan English: I wonder if I should sit down for a while

Agglutination Finnish: istahtaisinkohan • ist + "sit", verb stem • ahta + verb derivation morpheme, "to do something for a while" • isi + conditional affix • n + 1st person singular suffix • ko + question particle • han a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives) English: I wonder if I should sit down for a while

The excellent novel has been published in the book fair. Free Word Order বইমেলায় অসাধারণ উপন্যাসটা প্রকাশিত হয়েছে ৷ The excellent novel has been published in the book fair. বইমেলায় প্রকাশিত হয়েছে অসাধারণ উপন্যাসটা ৷ অসাধারণ উপন্যাসটা বইমেলায় প্রকাশিত হয়েছে ৷ অসাধারণ উপন্যাসটা প্রকাশিত হয়েছে বইমেলায় ৷

Examples of English-Bangla Divergence Ram is a very good boy. রাম [একটি ]খুব ভাল ছেলে [হয় ]৷ I was in foreign then. আমি তখন বিদেশে [ছিলাম ]৷ A girl with beautiful eyes সুন্দর চোখের একটি মেয়ে A boy with high fever প্রচন্ড জ্বরে আক্রান্ত একটি ছেলে He wrote with a pen সে কলম দিয়ে লিখেছিল । He has two books. তার দুটো বই গুলো আছে ।

Examples of English-Bangla Divergence I eat rice. আমি ভাত খাই ৷ I drink water. আমি জল খাই ৷ He smoked cigarette. সে সিগারেটটা খেয়েছিল৷ It is raining. বৃষ্টি হচ্ছে । There is a tiger in the forest. বনে বাঘ আছে৷

Looking inside of a Machine Translation System

Issues to Handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Parts of Speech

Named Entity Recognition Issues to Handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Parts of Speech Named Entity Recognition

Issues to Handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Parts of Speech Named Entity Recognition Word Sense Disambiguation

Issues to Handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Parts of Speech Named Entity Recognition Word Sense Disambiguation Co-reference

Issues to Handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. Parts of Speech Named Entity Recognition Word Sense Disambiguation Co-reference Subject Drop

Machine Translation Trinity MT Trinity Level of Transfer Interlingua based Semantic Level Syntactic Level Direct Level Rule based Statistical Example based Hybrid Approach Language Pair English-Bangla Bangla-Hindi English-Hindi ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Vauquois Triangle for MT Level of Transfer Interlingua based MT Trinity Semantic Level Syntactic Level Direct Level Bangla-Hindi Rule based Statistical English-Bangla English-Hindi Example based Hybrid Language Pair Approach ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Vauquois Triangle for MT Level of Transfer Interlingua based MT Trinity Semantic Level Syntactic Level Direct Level Bangla-Hindi Rule based Statistical English-Bangla English-Hindi Example based Hybrid Language Pair Ram play + s football Approach Ram plays football

Vauquois Triangle for MT Level of Transfer Interlingua based MT Trinity Semantic Level Syntactic Level Direct Level Bangla-Hindi Rule based Statistical English-Bangla English-Hindi Ram/PN play/VBF + s football/CN Example based Hybrid CN Language Pair Ram play + s football Approach Ram plays football

Vauquois Triangle for MT MT Trinity S Bangla-Hindi S NP VP NP VP English-Bangla NP English-Hindi NP Ram/PN play/VBF + s football/CN CN Language Pair রাম/PN ফুটবল /CN খেল/VBF + ে / বাজা/VBF + য় / নাটক/CN Ram play + s football রাম ফুটবল খেল + ে / বাজা + য় Ram plays football রাম ফুটবল খেলে / বাজায়

Vauquois Triangle for MT Physical object Play animate inanimate human thing instrument MT Trinity খেলে বাজায় Subject: Human Subject: Human Subject Object Object Object: thing Object: instrument Ram Football Guitar S Bangla-Hindi S NP VP NP VP English-Bangla NP English-Hindi NP Ram/PN play/VBF + s football/CN CN Language Pair রাম/PN ফুটবল /CN খেল/VBF + ে / বাজা/VBF + য় / নাটক/CN Ram play + s football রাম ফুটবল খেল + ে Ram plays football রাম ফুটবল খেলে

Vauquois Triangle for MT <aff {sub_np ( ramnoun dont_care singular third [human] [rAma:m 8] [] [] ) } {obj1_np ( football noun neuter singular third [thing] [Putabala:m 8] [] [] ) } k1 {main_vp_active ( play_1 verb_2 normal normal dont_care singular third [Kela] 5 [] [] ) } > . MT Trinity Bangla-Hindi English-Bangla English-Hindi Language Pair Ram plays football রাম ফুটবল খেলে

Major Machine Translation Systems in Indian Language AnglaBharati IIT Kanpur, CDAC Kol, Noida, Hyderabad, Tvm Rule-based and Interlingua based English-Hindi, Bangla, Urdu, Punjabi, Telugu, Malayalam, Assamese Sampark IIIT Hyderabad, University of Hyderabad, IIT Bombay, IIT Kharagpur, CDAC Noida Rule-based and dictionary-based algorithms with statistical machine learning. It uses Computational Paninian Grammar. Hindi to Punjabi, Punjabi to Hindi, Telugu to Tamil and Urdu to Hindi Anuvadaksh C-DAC Pune, IISc Bangalore, IIIT Hyderabad, C-DAC Mumbai, IIT Bombay, IIIT Allahabad Four Machine Translation Technologies: TAG (Tree-Adjoining-Grammar based MT), SMT (Statistical based MT), AnalGen (Rules-Based MT) and EBMT (Example Based MT) English-Hindi, Marathi, Bengali, Urdu, Oriya, Tamil, Gujarati and Bodo Mantra-RajBhasha CDAC Pune Rule-based system using Augmented Transition Network (ATN) and Tree Adjoining Grammar (TAG) formalisms. Translates documents pertaining to Personnel Administration, Finance, Small Scale Industries, Agriculture, Information Technology, HealthCare, Education and Banking domains from English to Hindi.

Building blocks of a Rule-based Machine Translation System

Block Diagram Input Sentence Pre Processor Exception Handler Knowledge Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Courtesy: Sudipta Debnath, CDAC Kolkata

Input > Input Sentence Pre Processor Exception Handler Knowledge Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100.

Input > Pre Processor > Input Sentence Pre Processor Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 rrr01/100 টাকা

Exception Handler > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 Exception Handler > rrr01/100 টাকা

Exception Handler > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 Exception Handler > Phrase Marker > the price of access_data_base_management is rrr01 rrr01/100 টাকা

Exception Handler > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 Exception Handler > Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the Price: NOUN, neuter, singular, third, finance, দাম, … Of: of access_data_base_management: NOUN, neuter, singular, third,activity,অ্যাকসেস ডেটাবেস ম্যানেজমেন্ট… is: is rrr01: ADJECTIVE, *,NIL,rrr01, … rrr01/100 টাকা

Sentence Analyzer > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 : Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the price: NOUN, neuter, singular, third, finance, দাম, … of: of access_data_base_management: NOUN, neuter, singular, third,activity,অ্যাকসেস ডেটাবেস ম্যানেজমেন্ট… is: is rrr01: ADJECTIVE, *,NIL,rrr01, … Sentence Analyzer > <aff {sub_np ( accessdatabasemanagement noun neuter singular third [activity] [Akasesa detAbesa myAnejamenta:m 8] [] [] ) ( of prep [ of ] ) ( the det [] [anda] [A] ) ( price noun neuter singular third [finance] [xAma:m 8] [] [] ) } ( ggg01 adjective any [NIL] [ggg01] [] [] ) {main_vp v1 } > . Sviram rrr01/100 টাকা

Sentence Analyzer > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 : Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the Price: NOUN, neuter, singular, third, finance, দাম, … Of: of access_data_base_management: NOUN, neuter, singular, third,activity,অ্যাকসেস ডেটাবেস ম্যানেজমেন্ট… is: is rrr01: ADJECTIVE, *,NIL,rrr01, … Sentence Analyzer > <aff {sub_np ( accessdatabasemanagement noun neuter singular third [activity] [Akasesa detAbesa myAnejamenta:m 8] [] [] ) ( of prep [ of ] ) ( the det [] [anda] [A] ) ( price noun neuter singular third [finance] [xAma:m 8] [] [] ) } ( ggg01 adjective any [NIL] [ggg01] [] [] ) {main_vp v1 } > . Sviram rrr01/100 টাকা

Sentence Analyzer > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 : Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the Price: NOUN, neuter, singular, third, finance, দাম, … rrr01: ADJECTIVE, *,NIL,rrr01, … Sentence Analyzer > <aff {sub_np ( accessdatabasemanagement noun neuter singular : ( price noun neuter singular third [finance] [xAma:m 8] [] [] ) } ( ggg01 adjective any [NIL] [ggg01] [] [] ) {main_vp v1 } > . Sviram Output Generator > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম rrr01 rrr01/100 টাকা

Sentence Analyzer > Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 : Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the price: NOUN, neuter, singular, third, finance, দাম, … rrr01: ADJECTIVE, *,NIL,rrr01, … Sentence Analyzer > <aff {sub_np ( accessdatabasemanagement noun neuter singular : ( price noun neuter singular third [finance] [xAma:m 8] [] [] ) } ( ggg01 adjective any [NIL] [ggg01] [] [] ) {main_vp v1 } > . Sviram Output Generator > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম rrr01 Post Processor > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম ১০০ টাকা rrr01/100 টাকা

The price of access data base management is Rs. 100. Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. Pre Processor > the price of access data base management is rrr01 : Phrase Marker > the price of access_data_base_management is rrr01 Word Analyzer > the: the Price: NOUN, neuter, singular, third, finance, দাম, … Sentence Analyzer > <aff {sub_np ( accessdatabasemanagement noun neuter singular : Output Generator > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম rrr01 Post Processor > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম ১০০ টাকা rrr01/100 টাকা

The price of access data base management is Rs. 100. Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input > The price of access data base management is Rs. 100. : Output > অ্যাকসেস ডেটাবেস ম্যানেজমেন্টের দাম ১০০ টাকা ৷

Basic Steps Involves in Statistical Machine Translation

Parallel Aligned Sentences ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Word Alignment ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Initial Probability

Expected Count

Revised Probability

Expected Count

Re-Revised Probability

Learning Phrase Tables from Word Alignments Prof C.N.R. Rao was honored with the Bharat Ratna प्रोफेसर सी.एन.आर राव को भारतरत्न से सम्मानित किया गया Central Idea: A consecutive sequence of aligned words constitutes a “phrase pair”

Example of Phrase Based Machine Translation Ram ate rice with the spoon राम खाये धान के साथ यह चमचा राम ने खा लिया चावल से वह चम्मच राम को खा लिया है एक राम से चम्मच चम्मच से चम्मच के साथ ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Example of Phrase Based Machine Translation खा लिया राम ने चम्मच से चावल चावल खाये चम्मच ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Example of Phrase Based Machine Translation Ram ate rice with the spoon राम ने चम्मच से चावल खाये ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Factored Translation Model Input output word word lemma lemma parts of speech parts of speech morphology morphology word class word class ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Factored Translation Model Input output ছেলেগুলো boys word word ছেলে boy lemma lemma CN parts of speech parts of speech CN Plural Plural morphology morphology Animate Animate word class word class ACK: http://www.cfilt.iitb.ac.in/publications/icon_2013_smt_tutorial_slides.pdf

Conclusion Why we need a Machine Translation System? Why is Translation Difficult? Looking inside of a Machine Translation System . How to judge a Machine Translation System?

Thank You

Input One Sentence At a time Input Sentence Pre Processor Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Input One Sentence At a time

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor Input : English sentence Output : modified sentence symbol table (will be used at Post Processoring phase) It detects word / patterns as per “Knowledge Base” and replaces them with predefined symbols. It also Stores those symbols with translation to the “symbol table” for future use.

Pre Processor (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor (Sample Input Output) Action Input Modified input Symbol table Initial/ Final hello, how are you ? how are you ? </হ্যালো,/… May I know your name, please ? May I know your name ? >/, অনুগ্রহ করে /…

Pre Processor (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor (Sample Input Output) Action Input Modified input Symbol table Expand It’s 52 cm. long. It is 52 centimeter long. x You can talk with Prof. D. Gupta. You can talk with professor D. Gupta. He is d/o ram. He is daughter of ram.

Pre Processor (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor (Sample Input Output) Action Input Modified input Symbol table Date Meeting will be on 24 th April . Meeting will be on ddd01 . ddd01/২৪ শে এপ্রিল Meeting will be on 24/07/2015 . Meeting will be on ddd01 . ddd01/২৪/০৭/ ২০১৫ Meeting will be on April 24, 2015 . ddd01/এপ্রিল ২৪, ২০১৫

Pre Processor (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor (Sample Input Output) Type Input Modified input Symbol table Time Event will be on 9 pm. ttt01 . ttt01/রাত্রি ৯ টা Number I can give you 21 stick. I can give you nnn01 stick. nnn01/২১ Rupee I can give you Rs. 21. I can give you rrr01 . rrr01/২১ টাকা

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Pre Processor Similarly this module also finds following patterns as defined in the “Knowledge Base” : Acronym : IIT, CDAC etc. Bracket : ( { [ . Quotes : ‘ “. URL : sudipta.debnath@cdac.in, http://www.google.com etc. Slash : school/college, boy/girl etc. Dash : delhi – pune, 100 – 250 etc. …….. Etc.

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Exception Handler Input : Modified sentence Output : Translated output(s). Not _Found. This process searches in the “Knowledge Base” for matching. Matching may be total sentence or pattern. If matched it produces translated output otherwise “Not_Found” produces.

Exception Handler (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Exception Handler (Sample Input Output) Action Input Matched Rule Output Full your faithfully your faithfully#আপনার বিশ্বাসপাত্র আপনার বিশ্বস্ত good morning #সুপ্রভাত সুপ্রভাত

Exception Handler (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Exception Handler (Sample Input Output) Action Input Matched Rule Output Template Happy birthday to you. happy <np1([]),*,*,*> {to} <np2([human]),*,*,*> # <np2([human]),*,*,*> কে জানাই শুভ <np1([]),*,*,*> তোমাকে জানাই শুভ জন্মদিন

Exception Handler (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Exception Handler (Sample Input Output) Action Input Matched Rule Output Template Depawali greetings and best wishes. <np1([]),*,*,*> greetings and <np2([]),*,*,*> # <np1([]),*,*,*> এর শুভকামনা এবং <np2([]),*,*,*> দীপাবলীর শুভকামনা এবং সবচেয়ে ভালো শুভেচ্ছা

Exception Handler (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Exception Handler (Sample Input Output) Action Input Matched Rule Output Full + Template I have a pen. x Not_Found Book the ticket for me.

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Phrase Marker Input : Modified sentence Output : Phrase clubbed sentence It detects and join multiple words which must be treated as phrasal unit to get proper translation. Detection is done as per “Knowledge Base”. Word wise translation will be incorrect otherwise.

Phrase Marker (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Phrase Marker (Sample Input Output) Type Input Output NP (Noun Phrase) Abul salam international award for journalism will be announced this week. Abul_salam_international_award_for_journalism will be announced this week. We can use access data base management for this purpose. We can use access_data_base_management for this purpose.

Phrase Marker (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Phrase Marker (Sample Input Output) Type Input Output VP (Verb Phrase) This budget tries to bring down the inflation. This budget tries to bring_down the inflation. He wants to cut down on extra steps. He wants to cut_down on extra steps. After being addicted he falls apart. After being addicted he falls_apart.

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Word Analyzer Input : Modified sentence Output : List of Token (word) detail This process searches in the “knowledge Base” for each token (word / combined word block) and produces details of each. Some important details are : Parts Of Speech Tense (present/past/future) Number(singular/plural) Meaning Target Language etc..

Word Analyzer (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Word Analyzer (Sample Input Output) Actual Input Ram is a good boy. Modified uuu01 is a good boy. Output unu01: NOUN, *, singular, third, human, uuu01, … is : is a : a good: ADJ, positive, ভালো ,…. boy: NOUN, masculine, singular, third, human, ছেলে,….

Word Analyzer (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Word Analyzer (Sample Input Output) Actual Input I have a pen. Modified i have a pen. Output i : PRONOUN, *, singular, first, human, আমি , … have : have a : a pen : NOUN, neuter, singular, third, thing, কলম , …

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sentence Analyzer Input : List of Token (word) detail Output : PLIL ( Pseudo Lingual for Indian Languages) It parses the sentence using the list of word detail. It produces output in a predefined textual structure called PLIL. PLIL is conceptually like a tree with arrangement as per Indian language structure (SOV: Subject Object Verb).

Sentence Analyzer (Sample Input Output) Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sentence Analyzer (Sample Input Output) Sentence Input Ram is a good boy. Input (at this module) unu01: NOUN, *, singular, third, human, uuu01, … is : is a : a good: ADJ, positive, ভালো ,…. boy: NOUN, masculine, singular, third, human, ছেলে,…. Output PLIL <aff {sub_np ( unu01 noun dont_care singular third [human] [unu01:m 8] [] [] ) } {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( good adjective positive [NIL] [BAla] [] [] ) ( boy noun masculine singular third [human] [Cele:m 8] [] [] ) } {main_vp v1 } > . sviram

Sentence Analyzer (Sample Input Output) Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sentence Analyzer (Sample Input Output) Sentence Input Ram is a good boy. Input (at this module) unu01: NOUN, *, singular, third, human, uuu01, … is : is a : a good: ADJ, positive, ভালো ,…. boy: NOUN, masculine, singular, third, human, ছেলে,…. Output PLIL <aff {sub_np ( unu01 noun dont_care singular third [human] [unu01:m 8] [] [] ) } {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( good adjective positive [NIL] [BAla] [] [] ) ( boy noun masculine singular third [human] [Cele:m 8] [] [] ) } {main_vp v1 } > . sviram

Sentence Analyzer (Sample Input Output) Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sentence Analyzer (Sample Input Output) Sentence Input I have a pen. Input (at this module) i : PRONOUN, *, singular, first, human, আমি , … have : have a : a pen : NOUN, neuter, singular, third, thing, কলম , … Output PLIL <aff {sub_np ( i noun dont_care singular first [human] [Ami:m 8] [] [] ) } ke_pas {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( pen noun neuter singular third [thing] [kalama:m 8] [] [] ) } {main_vp aux_have } > . sviram

Sentence Analyzer (Sample Input Output) Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sentence Analyzer (Sample Input Output) Sentence Input I have a pen. Input (at this module) i : PRONOUN, *, singular, first, human, আমি , … have : have a : a pen : NOUN, neuter, singular, third, thing, কলম , … Output PLIL <aff {sub_np ( i noun dont_care singular first [human] [Ami:m 8] [] [] ) } ke_pas {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( pen noun neuter singular third [thing] [kalama:m 8] [] [] ) } {main_vp aux_have } > . sviram

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Generate Output(s) Input : PLIL Output : translated output(s) This process generates target language output(s) as per PLIL value. Some specific tasks are like following : Generate actual translated value for NOUN, PRONOUN, VERB, PREPOSITION / POSTPOSITION etc. Movement of word for more accuracy and language specific style. etc..

Generate Output(s) (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Generate Output(s) (Sample Input Output) English Input Ram is a good boy. PLIL <aff {sub_np ( unu01 noun dont_care singular third [human] [unu01:m 8] [] [] ) } {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( good adjective positive [NIL] [BAla] [] [] ) ( boy noun masculine singular third [human] [Cele:m 8] [] [] ) } {main_vp v1 } > . sviram Output

Generate Output(s) (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Generate Output(s) (Sample Input Output) English Input Ram is a good boy. PLIL <aff {sub_np ( unu01 noun dont_care singular third [human] [unu01:m 8] [] [] ) } {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( good adjective positive [NIL] [BAla] [] [] ) ( boy noun masculine singular third [human] [Cele:m 8] [] [] ) } {main_vp v1 } > . sviram Output unu01 ^একটি | {} ~ ভাল ছেলে {}

Generate Output(s) (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Generate Output(s) (Sample Input Output) English Input I have a pen. PLIL <aff {sub_np ( i noun dont_care singular first [human] [Ami:m 8] [] [] ) } ke_pas {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( pen noun neuter singular third [thing] [kalama:m 8] [] [] ) } {main_vp aux_have } > . sviram Output

Generate Output(s) (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Generate Output(s) (Sample Input Output) English Input I have a pen. PLIL <aff {sub_np ( i noun dont_care singular first [human] [Ami:m 8] [] [] ) } ke_pas {pp } {obj1_np ( a det [ekati/{}] [tamil_a] [telugu_a] ) ( pen noun neuter singular third [thing] [kalama:m 8] [] [] ) } {main_vp aux_have } > . sviram Output আমার ^একটি | {} ~ কলম আছে

Input Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Post Processor Input : translated output, symbol table Output : Final output(s) It placed final changes (if required) at each translated output. Some important tasks are as following : Replace symbols from symbol table, placed by preprocessor (if any). Movement of word for more accuracy and language specific style. etc..

Post Processor (Sample Input Output) Sentence Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Post Processor (Sample Input Output) Action Input Symbol table Final Output Initial/ আপনি কেমন আছেন </হ্যালো,/… হ্যালো, আপনি কেমন আছেন Symbol unu01 ^একটি | {} ~ ভাল ছেলে {} unu01/রাম /… রাম^একটি | {} ~ ভাল ছেলে {}

Sum up Input Sentence Pre Processor Exception Handler Knowledge Base Output Sentence(s) Pre Processor Exception Handler Phrase Marker Word Analyzer Sentence Analyzer Output Generator Post Processoror Symbol table Knowledge Base PLIL Sum up