Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion Feilong Bao, Guanglai Gao, Xueliang Yan, Hongwei Wang 2015-4-21.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
On Mongolian Network Layout
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Language Model Based Arabic Word Segmentation By Saleh Al-Zaid Software Engineering.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Real-Time Decentralized Articulated Motion Analysis and Object Tracking From Videos Wei Qu, Member, IEEE, and Dan Schonfeld, Senior Member, IEEE.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Reference Materials: Dictionary, thesaurus and glossary
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
A k-Nearest Neighbor Based Algorithm for Multi-Label Classification Min-Ling Zhang
FEATURE EXTRACTION FOR JAVA CHARACTER RECOGNITION Rudy Adipranata, Liliana, Meiliana Indrawijaya, Gregorius Satia Budhi Informatics Department, Petra Christian.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Absorb-Type Activity - Learning Hiragana Symbols Yuki Moore.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Transliteration System
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
MONGOLIAN TAGSET and CORPUS TAGGING J.Purev and Ch. Odbayar CRLP Center for Research on Language Processing National University of Mongolia (NUM)
Wancai Zhang, Hailong Sun, Xudong Liu, Xiaohui Guo.
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Towards an Intelligent Multilingual Keyboard System Tanapong Potipiti, Virach Sornlertlamvanich, Kanokwut Thanadkran Information Research and Development.
Who?  English Language Learners  Learners of English  Students scoring below the 40 percentile on standardized tests  Students with language based.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Evaluating Network Security with Two-Layer Attack Graphs Anming Xie Zhuhua Cai Cong Tang Jianbin Hu Zhong Chen ACSAC (Dec., 2009) 2010/6/151.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Yuya Akita , Tatsuya Kawahara
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Doc.: IEEE /0205r0 Submission Jan 2015 Shiwen He, Haiming Wang Slide 1 Time Domain Multiplexed Pilots Design for IEEE802.11aj(45 GHz) SC PHY Authors/contributors:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
HCBE: Achieving Fine-Grained Access Control in Cloud-based PHR Systems Xuhui Liu [1], Qin Liu [1], Tao Peng [2], and Jie Wu [3] [1] Hunan University, China.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Sentimental feature selection for sentiment analysis of Chinese online reviews Lijuan Zheng 1,2, Hongwei Wang 2, and Song Gao 2 1 School of Business, Liaocheng.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
True/False questions (3pts*2)
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Presenter: Xudong Zhu Authors: Xudong Zhu, etc.
Invent yourself: Language Barriers
Chinese Pinyin / Characters Introduction
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Discriminative Frequent Pattern Analysis for Effective Classification
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Statistical n-gram David ling.
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Presentation transcript:

Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion Feilong Bao, Guanglai Gao, Xueliang Yan, Hongwei Wang

Outline Introduction Comparison LM based Conversion Approach Experiment Conclusions

Introduction Traditional Mongolian and Cyrillic Mongolian are both Mongolian languages that are respectively used in China and Mongolia. With similar oral pronunciation, their writing forms are totally different. A large part of Cyrillic Mongolian words have more than one corresponds in Traditional Mongolian.

Comparison-1 Tradition Mongolian is composed of 35 characters, in which 8 are vowels and 27 are consonants. Cyrillic Mongolian has also 35 characters. But 13 of them are vowels and 20 are consonants. Besides, it also includes a harden-character and soften-character.

Comparison-1

Comparison-2 Cyrillic Mongolian is a case-sensitive language while Traditional Mongolian is not. In Cyrillic Mongolian, the usage of case is similar to English. For the Traditional Mongolian, although it’s not sensitive to the case, its form will be different according to the position (top, middle or bottom) in a word.

Comparison-3 The written direction is different for Cyrillic Mongolian and Traditional Mongolian. For Cyrillic Mongolian, the words are written from left to right and the lines are changed top-down For Traditional Mongolian, the words are written top-down and the lines are changed from left to right.

Comparison-4 The degrees of unification between the written form and oral pronunciation are different for Cyrillic Mongolian and Traditional Mongolian. Cyrillic Mongolian is a well-unified language. It has a consistent correspondence between the written form and the pronunciation however, the Traditional Mongolian is not 1-to-1 mapping. Sometimes the vowel or consonant will be dropped, added or transformed when converting the written form to the pronunciation.

Comparison-5 In some cases, a Cyrillic Mongolian word would have more than one Traditional Mongolian word corresponded, as shown in Fig. 1, where the three Traditional Mongolian words are different but all correspond to the Cyril word "асар".

LM based Conversion Approach Generally speaking, Cyrillic Mongolian and Traditional Mongolian words, when converting, are one-to-one correspondence. However, a large part of Cyrillic Mongolian words have more than one corresponds in Traditional Mongolian.

LM based Conversion Approach Take the Cyrillic Mongolian sentence "Танай амар төвшинийг хамгаалхаар явсан юм." for example.

LM based Conversion Approach the conversion problem can be represented as finding the words sequence that satisfies (1): the conditional probability for T={t1t2...tm} can be decomposed as:

LM based Conversion Approach then formula (1) can be represented as: If we further assume the N-gram language model assumption, formulate (3) can then be further simplified as: We use the Maximum Likelihood Estimation to estimate the parameters in (4) and adopt Kneser-ney technique to overcome the sample sparseness problem.

Experiment-evaluation We take the Conversion Accurate Rate (CAR) as the evaluation metric, which is defined as: Where correct denotes the total number of words that are correctly converted and denotes the number of all the words need to be converted.

Experiment-data A dictionary that contains the Cyrillic Mongolian word to its multiple correspondences in Traditional Mongolian words is constructed for our experiment. This dictionary has 4679 Cyrillic Mongolian words in total. A Traditional Mongolian text corpus, which contains 154MB text in international standard coding, is adopted for n-gram language model training. We use a Cyrillic Mongolian corpus which contains sentences to test our approach. This corpus is composed of words, among which have more than one Traditional Mongolian words corresponded.

Experiment-data The data set for the rule-based approach is composed of: a mapping dictionary for Cyrillic Mongolian stem to Traditional Mongolian stem, which contains entries a dictionary for Cyrillic Mongolian static inflectional suffix to Traditional Mongolian static inflectional suffix, which contains 336 suffixes a dictionary for Cyrillic Mongolian verb suffix to Traditional Mongolian verb suffix, which contains 498 inflectional suffixes

Experiment-result The bigram achieved the best performance (CAR: 87.66%)

Experiment-result We also test the overall system performance of rule-based approach and the improved one on all the Mongolian words (both 1-to-1 and 1- to-N). The experimental results are illustrated in Fig 4. conversion correctness for the rule-based approach is 81.66% conversion correctness when it is integrated with the LM based approach is 88.14%

Conclusions When converting the Cyrillic Mongolian to the Traditional Mongolian, a lot of problem emerged. The proposed approach in this paper effectively settled this problem and thereby greatly improved the overall conversion system performance. However, there is still some issues to be considered, like the conversion problem for newly-added words and that for the words borrowed from other languages.

Thank you! Any question?