BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy 110050078.

Slides:



Advertisements
Similar presentations
There are eight Ways of Knowing : 1.Language 2.Sense Perception 3.Emotion 4.Reason 5.Imagination 6.Faith 7.Intuition 8.Memory.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Typography: The basic building block of any printed page.
I hope you like & find the following useful. Please press enter, or click anywhere on the screen to continue. You can navigate from the bottom left hand.
ITEC 1010 Information and Organizations Artificial Intelligence.
BT101: Hermeneutics Introduction. A. Description of Hermeneutics 1. General Hermeneutics The study of the activity of interpretation;
Chapter 9 Perception and Attribution. Objectives  Define perception and explain the perceptual process  Identify the sources of misinterpretation in.
1 HRT 383 Written Communication. 2 Thank You to… Noel Cullen, author of Life Beyond the Line Gary Yukl, author of Leadership in Organizations Carol Roberts,
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
What is science? Science: is a process by which we gain knowledge deals only with the natural world collects & organizes information (data/evidence) gives.
Sensation and Perception Chapter 3. Psychophysics This is how we experience our physical world. Classroom demo judging weight of pill bottles. Which one.
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
* What is reading? * Challenges for older readers and writers * What can I do to help? * What is available to support me? * Questions * Reading and writers.
IIUM Research, Invention and Innovation Exhibition 2010 ‘ Enhancing Quality Research and Innovation for Societal Development’ Asadullah Shah 1, Aznan Zuhid.
Bi-Weekly BTP presentation Bhargava Reddy , Tuesday.
Introduction to Education Support Reading. Learning to Read Children usually begin to ‘read’ familiar sight words before they begin to write. Reading.
Entropy in Machine Transliteration & Phonology Bhargava Reddy B.Tech Project.
Vision and Perception Input-Process-Output (S)IPDE Process Time Relevant.
Standards Certification Education & Training Publishing Conferences & Exhibits Human Factors Engineering Why Smart People do Dumb Things.
Logo Design. UNTITLED Cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mind: aoccdrnig to a rscheearch.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
NSDL/NSTA Web Seminar: Learning By Doing—Computational Science LIVE INTERACTIVE YOUR DESKTOP
Writing to Achieve Kindergarten Day 2 Debbie Jura Nov. 5, 2009.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
Neural Networks AI – Week 21 Sub-symbolic AI One: Neural Networks Lee McCluskey, room 3/10
I CAN: Explain the Relationship Between Perception and Sensation? Copyright © Allyn and Bacon 2006 Perception brings meaning to sensation, so perception.
Science Communication LOLO Session 2 The Nature of Science Jack Holbrook University of Tartu.
 The nugger was flinp.  The nugger was flinp and wugnet.  The nugger was flinp, wugnet and manple in my waslet.  What was flinp?  How else does the.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
T HE H UMAN M IND. The phaonmneal pwer of the hmuan mnid Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deson’t mttaer in what oredr the ltteers.
Human-to-Human Communication A model for human-computer interaction? Important scope limitation: problem solving Why look at human-human interface? – The.
Teaching reading.
Technical Reading Presented by Beatrice Moore Luchin NUMBERS Mathematics Professional Development NUMBERSmpd.com.
Sensation & Perception How do we construct our representations of the external world? To represent the world, we must detect physical energy (a stimulus)
Reading Strategies Skimming and Scanning. 5 minute self test What do you find difficult about reading at university? Tick the boxes below:  Finding enough.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Editing Documents Dr. Anatoliy Tmanov Pennsylvania State University.
The phenomenal power of the human mind   I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg.The phaonmneal pweor of the hmuan mnid!
© aSup-2007 FREQUENCY DISTRIBUTION, GRAPH, and PERCENTILE   1.
Last Time Normal Distribution –Density Curve (Mound Shaped) –Family Indexed by mean and s. d. –Fit to data, using sample mean and s.d. Computation of Normal.
Discover the Possibilities: Leadership Coaching 2004 Parks and Recreation Conference.
Improving Patient Safety using a Human Factors and Ergonomic approach
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
P ROOFREADING Lucy Sefton. A IMS AND OBJECTIVES Aim : Practice proofreading skills Objectives:  Take part in discussions regarding proofreading  Proofread.
The human brain … … tricks us whenever it can!. The human mind is so non-literal! I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg.
Lesson 80 - Mathematical Induction HL2 - Santowski.
Illusions and Other Visual Defects CITA 6016 Food Sensory Analysis University of Puerto Rico Food Science & Technology.
COMM THEORY: On the Nature of Theory John A. Cagle, Ph.D. Communication California State University, Fresno.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Read each slide for directions
1.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Sen sati on & Per cep tio n How do we construct our representations of the external world? To represent the world, we must detect physical energy (a stimulus)
Inspiring Youth to Live their Dreams! Scott Shickler Founder & CEO.
Language (Verbal Communication)
Aoccdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist.
What Is Neuropsychological (Neurocognitive) Testing?
Elements of Effective Literacy Instruction in Grades 5-8
I CAN: Explain the Relationship Between Perception and Sensation?
OPTICAL ILLUSIONS.
NSDL/NSTA Web Seminar: Learning By Doing—Computational Science
Even though the next page may look weird, you can still read it!
Science and the Scientific Method
Science and the Scientific Method
Science and the Scientific Method
Year R Reading Workshop
Presentation transcript:

BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy

Contents What is Machine Transliteration? Classical Methods of Transliteration CRF and its use in Transliteration Study of Transliteration through Bridging Languages Study of Entropy in Information Theory Entropy of a Language Transliterability and Transliteration Performance Conclusion and Future Works

Definition of Machine Transliteration Conversion of a given name in source language to a name in target language such that the target language is: 1.Phonemically equivalent to the source language 2.Conforms to the phonology of the target language 3.Matches the user intuition of the equivalent of the source language name in the target language, considering the culture and orthographic character usage in the target language We need to note that all are equivalent in it’s own kind Ref: Report of NEWS 2012 Machine Transliteration Shared Task

Classical Machine Transliteration Models Work on machine transliteration started between classical Machine Transliteration Models have been proposed: 1.Grapheme Based Transliteration Model (ψ G ) 2.Phoneme Based Transliteration Model (ψ P ) 3.Hybrid Transliteration Model (ψ H ) 4.Correspondence Based Transliteration Model (ψ C ) Ref: A comparison of Different Machine Transliteration Models (2006), A Comparison of Different Machine Transliteration Models.

Grapheme and Phoneme Phoneme: Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference. Grapheme: Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.

Phoneme-based Transliteration Model This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation This model was first proposed by Knight and Graehl in 1997 They modelled it for English – Japanese and Japanese – English Transliteration In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme

Katakana Words and Japanese Language Katakana words are those words which are imported from other languages (primarily English) This language has a lot of issues with when pronunciation is concerned In Japanese the words L,R are pronounced the same. (Pronounced as something in between) Same goes with H,F either

Katakana Words Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ Johnson is pronounced as jyo-n-s-o-n --- ジョンソン Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム What have we observed in the transliteration? We can say that there has been a lot of information loss in the process of conversion from English to Japanese So when we do the back-transliteration we may fall into trouble

The steps to convert from English to Katakana

Fixing Back-Transliteration

Example for Back Transliteration

Study of MT through Bridging Languages Data is available between a language pair due to one of the following three reasons: 1.Politically related languages: Due to the political dominance of English it is easy to obtain parallel names data between English and most languages 2.Genealogically related languages: Languages sharing the same origin. Might have significant overlap between their phonemes and graphemes 3.Demographically related languages: Hindi and Telugu. Might not have the same origin but due to the shared culture and demographics there will be similarities Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Bridge Transliteration Methodology Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Results for the Bridge System We must remember that Machine Transliteration is a lossy conversion In the bridge system we can assume that we will get loss in information and thus the accuracy score will drop down The results have shown that there has been a drop in accuracy of about 8-9%(ACC1) and about 1-3%(Mean F-score) NEWS 2009 was used as a dataset for this training and evaluation of the results Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Stepping though an intermediate language Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

What is Entropy Entropy is the amount of information obtained in each message received It characterizes our uncertainty about our source of information (Randomness) Expected value function of information content in random variable Based on Shannon's: A Mathematical Theory of Communication

Properties and Mathematical Formulation Based on Shannon's: A Mathematical Theory of Communication

The Formula for Entropy Based on Shannon's: A Mathematical Theory of Communication

Motivation for Entropy in English Aoccdrnig to rseearchat at Elingsh uinervtisy, it deosn't mttaer in waht odrer the ltteers ina wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer is at the rghit pclae. The rset canbe a toatl mses and youcansitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.

Entropy of Language If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number or binary digits required per letter of the original language. The redundancy measures the amount of constraint imposed on a text in the language due to its statistical structure Ex: In english: The frequency of letter E, Strong tendency that H follows T or U to follow Q Based on Shannon's: Prediction and Entropy of Printed English

Entropy calculation from the statistics of English Based on Shannon's: Prediction and Entropy of Printed English

Entropy of English Based on Shannon's: Prediction and Entropy of Printed English

Interpretation of the equation Based on Shannon's: Prediction and Entropy of Printed English

Calculations of the F N Based on Shannon's: Prediction and Entropy of Printed English

Letter Frequencies in English Source: Wikipedia’s article on letter frequency in English

Calculation of higher F N Similar calculations for F 3 gives the value as 3.3 bits The tables of N-gram frequencies are not available for N>3 as a result F 4,F 5,F 6 cannot be calculated the same way Word frequencies are used to calculate to assist in such situations Let us look at the log-log paper of the probabilities of words against frequency rank Based on Shannon's: Prediction and Entropy of Printed English

Word Frequencies Based on Shannon's: Prediction and Entropy of Printed English

Inferences

Entropy for various world languages From the data we can infer that english languages has the least entropy and Finnish language has the highest entropy But all the languages have a comparable entropy when we take Shannon’s experiment into consideration Finnish (fi), German (de), Swedish (sv), Dutch (nl), English (en), Italian (it), French (fr), Spanish (es), Portuguese (pt) and Greek (el) Based on Word-length entropies and correlations of natural language written texts. 2014

Ziff like plots for various world languages Based on Word-length entropies and correlations of natural language written texts. 2014

Entropy of Telugu Language Indian languages are highly phonetic which makes the computation of the entropy to be a rather difficult task Thus entropy for Telugu has been calculated by converting it into english language and using Shannon’s experiment. The entropy is calculated in 2 ways: 1.Converting into English and then considering them as English letters 2.Converting into English and then considering them as Telugu letters Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011

Telugu Language Entropy Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011

Inferences The entropy of Telugu is higher than that of English, which means that Telugu is more succinct than English and each syllable in Telugu(as in other Indian languages) contains more information compared to English

Indus Script Very less has been known about the script from the ancient time But no inferences has been made about weather it is a linguistic script or not But from the diagram besides we can check that Indus script lies somewhere near most of the world languages We can thus infer that the Indus script is a one which can be noted as a linguistic script but we have no solid proof for it Based on Entropy, the Indus Script and Language: A Reply to R.Sproat

Transliterability and Transliteration performance Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Transliterability Measure The measure with the desirable qualities which could measure the ease of Transliterability among languages: 1.Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora) 2.Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings) 3.Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely The Transliterability measure Weighted Average Entropy (WAVE), does out work Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

WAVE Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Motivation From the adjacent table we can conclude that frequency of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’ Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’ The term ‘frequency(i)’ captures this effect Table IV shows the mappings from the source to target languages We can observe that the uni-gram c has mapping to 2 characters स and क Whereas p has only one which is प The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Plot Between WAVE and Transliteration Quality The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka We can see that as the value of WAVE decreases the accuracy is decreasing exponentially The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model Based on these observations we can term two languages with small WAVE 1 measure as more easily transliterable.

Conclusions In this presentation we introduced the concept of Machine Transliteration. We have looked over the classical methods of Transliteration starting from the phoneme based transliteration models to Combined CRF models. We have introduced the concept of Entropy and its usefulness in Transliteration model. We studied how phonology and syllabification helps in creating chunks which would be useful for transliteration. We introduced the concept of WAVE which determines the ease of Transliteration between a pair of languages.

Future Work As a part of future work, I would implement transliteration performance between languages based on entropy measures between then. Find a measure like WAVE which could be used to measure the transliterability performance without actual transliteration

References Report of NEWS 2012 Machine Transliteration Shared Task (2012), Min Zhang, haizhou Li, A Kumaran and Ming Lui. ACL 2012 A comparison of Different Machine Transliteration Models (2006), A Comparison of Different Machine Transliteration Models. Machine Transliteration (1997). Kevin Knight and Jonathan Graehl, Journal Computational Linguistics Improving back-transliteration by combining information sources. (2004). Bilac S., & Tanaka, H. In Proceedings of IJCNLP2004, pp. 542–547 An English-Korean transliteration model using pronunciation and contextual rules. (2002). Oh, J. H., & Choi, K. S. In Proceedings of COLING2002, pp. 758–764 Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. (2010). Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs. (2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW Hindi to English Machine transliteration model of named entities using CRFs (2012). Manikrao, Shantanu and Tushar. International Journal of Computer Applications( ) on June 2012

References (2) Linguistics, An Introduction to Language and Communication (5 th Edition), Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish. MIT University Press A Mathematical Theory of Communication (1948), C.E.Shannon, The Bell System Technical Journal, July 1948 Prediction and Entropy of Printed English. C.E.Shannon. The Bell System Technical Journal. January 1951 Word-length entropies and correlations of natural language written texts. Maria Kalimeri, Vassilios Constantoudis, Constantinos Papadimitrious. ArXiV conference 2014 Entropy of Telugu. Venkata Ravinder Paruchuri Entropy, the Indus Script and Language: A Reply to R.Sproat. Rajesh PN Rao, Nisha Yadav, Mayank Vahia, Hrishikesh. Computational Linguistics 36(4) Compositional Machine Transliteration, Transactions on Asian Language Information Processing (TALIP Journal), A.Kumaran, Mitesh M. Khapra and Pushpak Bhattacharyya,, September 2010 Wiki articles

Extra Slides

Explanation of property 3 1/2 1/6 1/2 1/3 1/2 1/3 2/3 1/2 1/6 1/3 Based on Shannon's: A Mathematical Theory of Communication