Chenchen Ding, Masao Utiyama, Eiichiro Sumita

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

NAEC A U F Foreign Languages Group Presentation. NAEC FLF L Languages & Number of Candidates English: 17,359 German: 4,849 French: 1,224 Russian: 7,637.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Languages of Asia Part 1: East and Southeast Asia ASIAN 401 Spring 2009 ASIAN 401 Spring 2009.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Towards an NLP `module’ The role of an utterance-level interface.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Languages around the world
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Language barriers and translation, implications for open access journals Hooman Momen World Health Organization.
Language.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
2A. She’s British. Countries & nationalities Point out what countries through their flags & which nationalities?
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Tokenization & POS-Tagging
30 March – 8 April 2005 Dipartimento di Informatica, Universita di Pisa ML for NLP With Special Focus on Tagging and Parsing Kiril Ribarov.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Communities in Transition: Asian Population Sabrina Ho AMAT API Committee Chair.
Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Find International Driving Document Translator Online
#APMP2016. Submitting proposals in more than one language: a survival guide Considering language and translation as a key component of your value proposition.
Court Interpreter Credentialing Process: Orientation to Testing May 22, 2016 NATIONAL CENTER FOR STATE COURTS.
Cross-language Projection of Dependency Trees Based on Constrained Partial Parsing for Tree-to-Tree Machine Translation Yu Shen, Chenhui Chu, Fabien Cromieres.
Language Identification and Part-of-Speech Tagging
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Approaches to Machine Translation
SCTB: A Chinese Treebank in Scientific Domain
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
David Mareček and Zdeněk Žabokrtský
Natural Language Processing (NLP)
10/13/2017.
Joint Training for Pivot-based Neural Machine Translation
Suggestions for Class Projects
International Collaboration for the Research on Language Technologies
Statistical NLP: Lecture 13
Construct State Modification in the Arabic Treebank
--Mengxue Zhang, Qingyang Li
Text Analytics Giuseppe Attardi Università di Pisa
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Capitalizing Proper Adjectives
The CoNLL-2014 Shared Task on Grammatical Error Correction
Approaches to Machine Translation
MATERIAL Resources for Cross-Lingual Information Retrieval
Unit 1: Vocabulary Section (pp 9-10)
Statistical Machine Translation Papers from COLING 2004
Part of Speech Tagging with Neural Architecture Search
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian Chenchen Ding, Masao Utiyama, Eiichiro Sumita Advanced Translation Technology Laboratory, ASTREC, NICT, Japan

Motivation For similar languages How to measure the similarity Specific and efficient approaches can be designed Techniques on well-studied languages can be applied to low-resourced ones How to measure the similarity Scripts: related or comparable writing systems → similar letters Vocabulary: etymologically related words → similar spellings Syntax: phrase / sentence structure → similar word orders

Outline Asian language treebank (ALT) project Similar languages and related processing Investigation and experiments Conclusion and future works

Motivation of Asian Language Treebank Compared with European languages Most Asian languages are low-resourced and understudied → NLP techniques cannot be developed and applied ALT can facilitate Tokenization / POS tagging / Parsing Cross-lingual processing → Establish a solid basis for Asian language processing

Details of Asian Language Treebank Treebanks for six Asian languages and English Burmese, Indonesian, Japanese, Khmer, Malay, Vietnamese April 2016 -- March 2019 Candidate languages in future Laotian, Tagalog, Thai All the raw parallel data are available http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/

Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal English sentences Italy have defeated Portugal 31-5 in Pool C of the 2007 Rugby World Cup at Parc des Princes, Paris, France. …

Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Indonesian and Malay translations Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby 2007 di Parc des Princes, Paris, Perancis. … Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi 2007 di Parc des Princes, Paris, Perancis.

Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Indonesian and Malay translations Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby 2007 di Parc des Princes, Paris, Perancis. … Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi 2007 di Parc des Princes, Paris, Perancis.

Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Laotian and Thai translations ອິຕາລີໄດ້ເສຍໃຫ້ປ໊ອກຕຸຍການ31ຕໍ່5ໃນພູລCຂອງການແຂ່ງຂັນຣັກບີ້ລະດັບ ໂລກປີ2007ທີ່ປາກເດແພຣັງປາຣີປະເທດຝຣັ່ງ. … อิตาลีได้เอาชนะโปรตุเกสด้วยคะแนน31ต่อ5ในกลุ่มcของการแข่งขันรักบี้เวิลด์คัพปี2007ที่สนามปาร์กเดแพร็งส์ที่กรุง ปารีสประเ

Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Laotian and Thai translations ອິຕາລີໄດ້ເສຍໃຫ້ປ໊ອກຕຸຍການ31ຕໍ່5ໃນພູລCຂອງການແຂ່ງຂັນຣັກບີ້ລະດັບ ໂລກປີ2007ທີ່ປາກເດແພຣັງປາຣີປະເທດຝຣັ່ງ. … อิตาลีได้เอาชนะโปรตุเกสด้วยคะแนน31ต่อ5ในกลุ่มcของการแข่งขันรักบี้เวิลด์คัพปี2007ที่สนามปาร์กเดแพร็งส์ที่กรุง ปารีสประเ

Processing Similar Languages in NLP Translation between Catalan and Spanish Can we translate letters? D. Vilar et al., 2007, WMT Translation between Japanese and Korean The last years’ WAT Character-based processing Apply SMT techniques on Japanese to Burmese Empirical dependency-based head finalization for statistical Chinese-, English-, and French-to-Myanmar (Burmese) MT. C. Ding et al. 2014, IWSLT

Two Southeast Asian Language Pairs Thai-Laotian Tonal languages from the Tai-Kadai language family, mutually intelligible Abugida writing systems Etymologically related words Isolating in morphology, head-initial in syntax Malay-Indonesian From Austronesian languages family, mutually intelligible Using Latin scripts “Different registers of one language”

Data and Pre-processing Raw translations from ALT Sentences : train / dev / test → 18,000 / 1,000 / 1,000 Tokens: Simple tokenization for Malay and Indonesian Punctuation marks detached Unbreakable unit segmentation for Thai and Laotian Dependent diacritics attached to independent letters

Word Order Kendall’s tau on Thai and Laotian

Word Order Kendall’s tau on Malay and Indonesian

For Comparison Kendall’s tau on Japanese-English and English-French

Uncertainty in Token Correspondence X-axis: log probability of Thai tokens Y-axis: Entropy on corresponding Laotian tokens

Uncertainty in Token Correspondence X-axis: log probability of Laotian tokens Y-axis: Entropy on corresponding Thai tokens

Uncertainty in Token Correspondence X-axis: log probability of Malay tokens Y-axis: Entropy on corresponding Indonesian tokens

Uncertainty in Token Correspondence X-axis: log probability of Indonesian tokens Y-axis: Entropy on corresponding Malay tokens

For Comparison X-axis: log probability of Japanese characters Y-axis: Entropy on corresponding Korean characters

For Comparison X-axis: log probability of Japanese tokens Y-axis: Entropy on corresponding English words

Experimental Results from SMT Moses PB-based SMT The parallel data in ALT is not sufficient for a practical system → Experiments to investigate the reordering requirement in translation

Conclusion and Future Work The similarities between Thai-Laotian and Malay-Indonesian Have been investigated in this study Based on the ALT data → The Thai-Laotian pair is similar to Japanese-Korean pair → The Malay-Indonesian pair is extremely similar in word order Future Work Harmonious annotation of the language pairs in corpus construction Unified techniques for NLP tasks / applications