Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)

Slides:



Advertisements
Similar presentations
Research-Based Instruction in Reading Dr. Bonnie B. Armbruster University of Illinois at Urbana-Champaign Archived Information.
Advertisements

Letters and Sounds Information for Parents Summer 2015 Welcome.
Normalizing Microtext Zhenzhen Xue, Dawei Yin and Brian D. Davison Lehigh University.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
© 2005 by Prentice Hall Chapter 9 Structuring System Requirements: Logic Modeling Modern Systems Analysis and Design Fourth Edition.
Generative Grammar(Part ii)
Programming Logic and System Analysis
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
BTP Stage 1 Machine Transliteration & Entropy Final Presentation Bhargava Reddy
Chapter 9 Structuring System Requirements: Logic Modeling
Reception Curriculum Evening
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Communicative Language Teaching Vocabulary
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.
Information for Parents November 2011 Welcome
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Information for Parents
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Korea Maritime and Ocean University NLP Jung Tae LEE
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
Linguistics The first week. Chapter 1 Introduction 1.1 Linguistics.
Language Model Grammar Conversion Wesley Holland Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent SPACE symposium - 6/2/091 Language modelling (word FST) Operational model for categorizing mispronunciations.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Wednesday 23rd September
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
For Friday Finish chapter 24 No written homework.
Intermediate 2 Computing Unit 2 - Software Development Topic 2 - Software Development Languages and Environments.
Letters and Sounds Information for Parents Autumn 2015 Welcome.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Phonics How to help at home What is Phonics? Phonics is the link between letters and the sounds they make. The full range of letter/ sound correspondences.
Levels of Linguistic Analysis
Finite State Machines (FSM) OR Finite State Automation (FSA) - are models of the behaviors of a system or a complex object, with a limited number of defined.
Welcome to Reception Reading Meeting 2015
© 2005 by Prentice Hall Chapter 9 Structuring System Requirements: Logic Modeling Modern Systems Analysis and Design Fourth Edition Jeffrey A. Hoffer Joey.
Phonics Welcome. Please help yourself to refreshments.
Direct Method.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Language Model Grammar Conversion Wesley Holland, Julie Baca, Dhruva Duncan, Joseph Picone Center for Advanced Vehicular Systems Mississippi State University.
Phonetics and Phonology.
Algorithms and Pseudocode
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Introduction : describing and explaining L2 acquisition Ellis, R Second Language Acquisition (3 – 14)
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Leveraging supplemental transcriptions and transliterations via re-ranking Aditya Bhargava April 19, 2011.
Knowledge Representation. A knowledge base can be organised in several different configurations to facilitate fast inferencing Knowledge Representation.
Approaches to Machine Translation
Linguistic knowledge for Speech recognition
What is an ANN ? The inventor of the first neuro computer, Dr. Robert defines a neural network as,A human brain like system consisting of a large number.
Chapter 10: Process Implementation with Executable Models
Chapter 9 Structuring System Requirements: Logic Modeling
Approaches to Machine Translation
Levels of Linguistic Analysis
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Chapter 9 Structuring System Requirements: Logic Modeling
Information Retrieval
Presentation transcript:

Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)

What is Machine Transliteration  It is the conversion of text from one script to another  Not every word in a particular language has its alternative in other language  We call such words as Out of Vocabulary words  Machine Transliteration would be a useful tool in machine translation when dealing with OOV words Tirupati  తిరుపతి

Machine Transliteration Models  4 Machine Transliteration Models have been proposed so far: 1. Grapheme Based Transliteration Model (ψ G ) 2. Phoneme Based Transliteration Model (ψ P ) 3. Hybrid Transliteration Model (ψ H ) 4. Correspondence Based Transliteration Model (ψ C )

Grapheme and Phoneme  Phoneme: Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference.  Grapheme: Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.

Grapheme Based Transliteration Model (ψ G )  Machine directly converts the source language graphemes to target language graphemes  This method need not have any knowledge about phonetic knowledge of the source and target languages  4 methods are implemented for this scenario: 1. Source Channel Model 2. Decision Tree model 3. Transliteration network 4. Joint source-channel model

Source Channel Model  English language words are converted to chunks of English graphemes  Next all possible chunks of other language corresponding to the chunk of English language are produced  Most relevant sequence of the target language graphemes are identified  Advantage: It considers a chunk of graphemes representing a phonetic property of the source language word  Disadvantage: Errors in first step propagate to the subsequent steps making it difficult to produce the correct transliteration  Time complexity is a major issue in this case. As it is a time consuming task

Decision Tree Model  Decision trees that transform each source grapheme into target graphemes are learned and the directly applied to MT  Advantage: Considers a wide range of contextual information, say the left three and right three contexts  Disadvantage: Unlike the source channel model does not consider phonetic aspects

Transliteration Network  The network consists of arcs and nodes  Node represents a chunk of source graphemes and its corresponding target graphemes  Arc represents a possible link between the nodes and has a weight showing their strengths  Method considers phonetic aspects in the formation of graphemes  Segmenting a chunk and identification of most relevant sequence in done in one step  This means the errors are not propagated from one step to the next

Phoneme-based Transliteration Model  This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation  This model was first proposed by Knight and Graehl in 1997  They used Weighted Finite State Transducers (WFST’s)  They modelled it for English – Japanese and Japanese – English Transliteration  Similar methods have come up for Arab-English and English-Chinese transliteration

Knight and Graehl’s Work  In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme  Katakana words are those words which are imported from other languages (primarily English)  This language has a lot of issues with when pronunciation is concerned  In Japanese the words L,R are pronounced the same  Same goes with H,F either

Katakana Words  Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ  Johnson is pronounced as jyo-n-s-o-n --- ジョンソン  Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム  What have we observed in the transliteration?  We can say that there has been a lot of information loss in the process of conversion from English to Japanese  So when we do the back-transliteration we may fall into trouble

Trouble in Back-Transliteration  There are several forms of writing the word switch which are acceptable by the Japanese language rules  But when converting it from the Japanese language to English we need to be very strict we cannot have any other word than ‘switch’  Back transliteration is harder than Romanization. Converting the word Angela ( アンジェ ラ ) would give us the word anjera in English which is no where acceptable  The words are many times compressed. The word ‘word processing’ is transliterated as ‘waapuro’ which is not at all easy to back-transliterate

The steps to convert from English to Katakana

Fixing Back-Transliteration

Algorithms for extracting the best transliteration

Example for Back Transliteration

BTP Work  I would be working under PhD. Student Arjun Atre for the project  We would be trying to develop Machine Transliteration tools for Indian Languages  I would be trying to develop a bridging language which can be used to transliterate text from one Indian language to other  This contributes a lot to the NLP society and would be a leading step to develop OOV words which are many in our native languages

THANK YOU

References  A comparison of Different Machine Transliteration Tools (2006), Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara  Machine Transliteration (1997), Kevin Knight and Jonathan Graehl. Phoneme based transliteration model 