Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

The sound patterns of language
Phonolog y The sound patterns of language: Phonology Phonemes
The Sound Patterns of Language: Phonology
PHONOTACTICS AND SYLLABLE. THE PHONEME Speech – continuous stream of sounds Speech – continuous stream of sounds Study of speech – dividing the stream.
Phonology Phonology is essentially the description of the systems and patterns of speech sounds in a language. It is, in effect, based on a theory of.
CROSS ENTROPY INFORMATION METRIC FOR QUANTIFICATION AND CLUSTER ANALYSIS OF ACCENTS Alireza Ghorshi Brunel University, London.
PHONETICS AND PHONOLOGY
Introduction to Linguistics
Syllabification Principles
Lecture 3Part 1 Phonology Suprasegmental phonology the syllable
The sound patterns of language
Chapter three Phonology
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
Chapter7 Phonemic Analysis PHONOLOGY (Lane 335). What is Phonology? It’s a field of linguistics which studies the distribution of sounds in a language.
Redundancy Ratio: An Invariant Property of the Consonant Inventories of the World’s Languages Animesh Mukherjee, Monojit Choudhury, Anupam Basu and Niloy.
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
Last minute Phonetics questions?
BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy
Natural Language Understanding
MTP I Stage Project Presentation Guided by- Presented by- Prof. Pushpak Bhattacharyya Abhijeet Padhye Department of Computer Science and Engineering Indian.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Phonology, phonotactics, and suprasegmentals
…not the study of telephones!
Phonetics and Phonology
Chapter 2 Speech Sounds Phonetics and Phonology
An Introduction to Linguistics
Phonology, part 4: Distinctive Features
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Phonology The sound patterns of language Nuha Alwadaani March, 2014.
1.Selvi Risma Andani ( ) 2.Nur Fitriani ( ) 3.Afifah Mudawwamah ( ) The Sound Patterns of Language.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
English Linguistics: An Introduction
Introduction to Linguistics Ms. Suha Jawabreh Lecture 9.
An overview of the first four chapters. Chapter 1 Linguistics is the scientific study of language. “What makes a field a science is if it involves constructing.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Introduction to Linguistics Ms. Suha Jawabreh Lecture # 8.
The Goals of Phonology: to note and describe the sound patterns in language(s) to detect and taxonomize (classify) general patterns to explain these patterns.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Lecture 2 Phonology Sounds: Basic Principles. Definition Phonology is the component of linguistic knowledge concerned with rules, representations, and.
English Phonetics 许德华 许德华. Objectives of the Course This course is intended to help the students to improve their English pronunciation, including such.
Hello, Everyone! Part I Review Review questions 1.In what ways can English consonants be classified? 2. In what ways can English vowels be classified?
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Chapter Five Language Description language study and linguistic study 1Applied Linguistics Chapter 5 by TIAN Bing.
PHONETIC 1 MGSTER. RAMON GUERRA by: Mgster. Ramon Guerra.
THE SOUND PATTERNS OF LANGUAGE
Words Which Way? CURR 511. What are you wondering? How does WTW work? Is it an assessment or a program? How do WTW levels relate to GR/DRA levels? What.
Mutual Information, Joint Entropy & Conditional Entropy
Phonology. Phonology is… The study of sound systems within a language The study of how speech sounds pattern The study of how speech sounds vary The study.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
How We Organize the Sounds of Speech 김종천 김완제 위이.
11 How we organize the sounds of speech 12 How we use tone of voice 2009 년 1 학기 담당교수 : 홍우평 언어커뮤니케이션의 기 초.
Syllable.
HOW WE TRANSMIT SOUNDS? Media and communication 김경은 김다솜 고우.
English Vowels and diphthongs
Introduction to Linguistics
King Faisal University [ ] 1 E-learning and Distance Education Deanship Department of English Language College of Arts King Faisal University Introduction.
Introduction to Linguistics
Phonetics Unit 1.
Phonology Indah Lestari.
an Introduction to English
1. Phonetics 1.1 Introduction
Introduction to Linguistics
Why sonority and intra-oral pressure?
1.2 Phonemes Phonology is not specifically concerned with the physical properties of the speech production system. Phoneticians are concerned with how.
Review.
The normal distribution
PHONETICS AND PHONOLOGY INTRODUCTION TO LINGUISTICS Lourna J. Baldera BSED- ENGLISH 1.
Presentation transcript:

Entropy in Machine Transliteration & Phonology Bhargava Reddy B.Tech Project

Contents Entropy (Information Theory) Mathematical Formulation Cross Entropy Transliterability and Transliteration Performance WAVE Phonology Syllables Some Syllabification rules

What is Entropy Entropy is the amount of information obtained in each message received It characterizes our uncertainty about our source of information (Randomness) Expected value function of information content in random variable Based on Shannon's: A Mathematical Theory of Communication

Properties and Mathematical Formulation Based on Shannon's: A Mathematical Theory of Communication

Explanation of property 3 1/2 1/6 1/2 1/3 1/2 1/3 2/3 1/2 1/6 1/3 Based on Shannon's: A Mathematical Theory of Communication

The Formula for Entropy Based on Shannon's: A Mathematical Theory of Communication

Properties Based on Shannon's: A Mathematical Theory of Communication

The Notion of Cross Entropy

Transliterability and Transliteration performance Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Transliterability Measure The measure with the desirable qualities which could measure the ease of Transliterability among languages: 1.Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora) 2.Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings) 3.Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely The Transliterability measure Weighted Average Entropy (WAVE), does out work Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

WAVE Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Motivation From the adjacent table we can conclude that frequency of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’ Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’ The term ‘frequency(i)’ captures this effect Table IV shows the mappings from the source to target languages We can observe that the uni-gram c has mapping to 2 characters स and क Whereas p has only one which is प The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Plot Between WAVE and Transliteration Quality The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka We can see that as the value of WAVE decreases the accuracy is decreasing exponentially The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model Based on these observations we can term two languages with small WAVE 1 measure as more easily transliterable.

Phonology Phonetics: Concerned with how speech sounds are produced in a vocal tract as well as with the with the physical properties of the speech sound waves generated by the larynx and vocal tract Phonology: Reference to the abstract principles that govern the distribution of sounds in a language It is the subfield of linguistics that studies the structure and systematic patterning of sounds in human language Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Views of Phonology Phonology broadly has 2 views: Description of the sounds of a particular language and the rules governing the distribution of these sounds Ex: Phonology of English, German or other language Part of the general theory of human language that is concerned with universal properties of natural language sound system English languages has 44 phonetic sounds (20 vowel sounds and 24 consonant sounds) These phonemes can be generalized such that it can be adapted to many languages Linguistics, An Introduction to Language and Communication, Adrian Akmajian

English Language Phonemes Generalized

Syllables A syllable is a unit of organization for a sequence of speech sounds They are often considered the phonological “building blocks” of words Syllabic writing began several hundred years before the first letters. A word that consists of a single syllable is called monosyllable. Similar terms include disyllable for a word of 2 syllables, trisyllable for a word of 3 syllables and polysyllable which may refer to more than 3 syllables Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Syllable A syllable has the following structure: Across the world’s languages the most common type of syllable has the structure CV(C), that is, a single consonant C followed by a single vowel V, followed in turn (optionally) by a single consonant Onset O Syllable (σ) Nucleus N Coda C

Syllable Grouping Consider the word napkin whose splitting can be done as “nap-kin” napkin σ1σ1 σ2σ2 OnOn NæNæ CpCp OkOk NiNi CnCn Linguistics, An Introduction to Language and Communication, Adrian Akmajian

Some Syllabification Rules Aspiration Rule: Phonemes with the features [-continuant, -voiced] are aspirated in syllable-initial position /p/ is a [-continuant, -voiced] phoneme If the intervocalic consonant p in the sequence apa is the onset of the second syllable it will be aspirated. If it is the coda of the first syllable it will not be aspirated As you pronounce the sequence aps, place your hand in front of your mouth. You will feel a small puff of air that accompanies the release of the p, regardless of weather you stress the first a or the second The presence of aspiration is the evidence we need to conclude that the world apartment is syllabified as “a-part-ment”

Maximal Onset Principle The Principle: The sequence of consonants that combine to form an onset with vowel on the right are those that correspond to the maximal sequence that is available at the beginning of a syllable anywhere in the language Illustration: Consider the word “constructs” which is bisyllabic Between the 2 vowels is the sequence n-s-t-r which is to be split Since the maximal sequence that occurs at the beginning of a syllable in English is str- we need to split it as “n-str” Therefore the word is syllabified as “con-structs” Why not other: Assume it is “ns-tr” then the t would appear in syllable initial position which should be aspirated which is not true over here. Other can be ruled out similarly

References A Mathematical Theory of Communication (1948), C.E.Shannon, The Bell System Technical Journal, July 1948 Compositional Machine Transliteration (2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya Linguistics, An Introduction to Language and Communication, Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish Wiki articles on entropy, phonology and transliteration