International Collaboration for the Research on Language Technologies Masao Utiyama @ NICT 8th Dec. 2017 ONA2017
Abstract Many languages exist in the world. No one can understand all languages. This is why we need the international collaboration for the research on language technologies. This talk presents the projects NICT participated in. I also introduce the collaboration between NIPTICT and NICT. I show how they work together in developing Khmer language technologies. I also introduce my research on developing machine translation (MT)
Outline Asian Language Treebank (ALT) Khmer ALT with NIPTICT U-STAR Khmer ASR (automatic speech recognition) with NIPTICT My own research on developing parallel corpora for MT
Open Collaboration for Developing and Using Asian Language Treebank Project members: Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah (BPPT) Aw Ai Ti, Sharifah Mahani Aljunied (I2R) Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thái (IOIT/UET) Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng (NIPTICT) Khin Mar Soe, Khin Thandar Nwet (UCSY) Masao Utiyama, Chenchen Ding (NICT) Chai Wutiwiwatchai, Thepchai Supnithi, Pranchya Boonkwan (NECTEC) Ria A. Sagum, Michael B. dela Fuente (PUP)
Current Status of Asian NLP resources No publicly available treebanks for most of Asian languages Development of Asian NLP is slow Difficult to compare research results among Asian NLP
Objective of Asian Language Treebank Provide Asian Language Treebank for free for research Cover many under-resourced Asian languages Facilitate the rapid development of Asian NLP Provide the common ground for comparison/evaluation of Asian NLP We will release ALT with a Creative Commons Attribution-NonCommercial-ShareAlike
What will be the Asian Language Treebank (ALT) Indonesian Japanese Khmer Malay Myanmar Vietnamese Thai Laos Filipino 20,000 English Wikinews sentences Translated into Annotated with Word segmentation, POS, Syntax, Word alignment
Samples (en, id, ja, km, ms, my, vi, th, lo, fil) Italy have defeated Portugal 31-5 in Pool C of the 2007 Rugby World Cup at Parc des Princes, Paris, France. Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby 2007 di Parc des Princes, Paris, Perancis. フランスのパリ、パルク・デ・プランスで行われた2007年ラグビーワールドカップのプー ルCで、イタリアは31対5でポルトガルを下した。 អ៊ីតាលីបានឈ្នះលើព័រទុយហ្គាល់ 31-5 ក្នុងប៉ូលCនៃពីធីប្រកួតពានរង្វាន់ពិភពលោកនៃកីឡាបាល់ឱបឆ្នាំ2007ដែលប្រព្រឹត្តនៅប៉ាសឌេសប្រីន ក្រុងប៉ារីស បារាំង។ Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi 2007 di Parc des Princes, Paris, Perancis. ပြင်သစ်နိုင်ငံ ပါရီမြို့ ပါ့ဒက်စ် ပရင့်စက် ၌ ၂၀၀၇ခုနှစ် ရပ်ဘီ ကမ္ဘာ့ ဖလား တွင် အီတလီ သည် ပေါ်တူဂီ ကို ၃၁-၅ ဂိုး ဖြင့် ရေကူးကန် စီ တွင် ရှုံးနိမ့်သွားပါသည် ။ Ý đã đánh bại Bồ Đào Nha với tỉ số 31-5 ở Bảng C Giải vô địch Rugby thế giới 2007 tại Parc des Princes, Pari, Pháp. อิตาลีได้เอาชนะโปรตุเกสด้วยคะแนน31ต่อ5 ในกลุ่มc ของการแข่งขันรักบี้เวิลด์คัพปี2007 ที่สนามปาร์กเดแพร็งส์ ที่กรุงปารีส ประเทศฝรั่งเศส ອິຕາລີໄດ້ເສຍໃຫ້ປ໊ອກຕຸຍການ 31 ຕໍ່ 5 ໃນພູລ C ຂອງ ການແຂ່ງຂັນຣັກບີ້ລະດັບໂລກປີ 2007 ທີ່ ປາກເດແພຣັງ ປາຣີ ປະເທດຝຣັ່ງ. Natalo ng Italya ang Portugal sa puntos na 31-5 sa Grupong C noong 2007 sa Pandaigdigang laro ng Ragbi sa Parc des Princes, Paris, France.
Project Goal (Final outcome) NICT will develop and release the parallel corpus for ALT Each member institute shall develop and release ALT for each language Each member institute shall decide the amount of ALT, which will be developed and released by that institute ALT will be used for research and development on Asian NLP
Results so far indicating fruitful collaboration First meeting was hosted by NIPTICT (Apr. 2016) Second meeting was hosted by BPPT (Oct. 2016) Third meeting was hosted by UCSY (Aug. 2017) Each member institute started developing each ALT ALT resources are available at the project page http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html Corporation with U-STAR Khmer SMT is released to U-STAR Parallel Corpora are used in U-STAR for Machine Translation
Progress of Khmer ALT at NICT with NIPTICT Chenchen Ding†, Hour Kaing‡, Masao Utiyama†, Vichet Chea‡, Eiichiro Sumita† †Advanced Translation Technology Laboratory, ASTREC, NICT, Japan ‡NIPTICT, Cambodia
Outline Progress on the Khmer data in ALT NICT with NIPTICT Resources under checking Tokenization and part-of-speech (POS) annotation guidelines for Khmer Tokenized and POS-tagged data: 90% data checked, by NOVA Issues in temporary Khmer data Orthographic errors Multi-form Khmerization Final outcome
Annotation Guidelines for Khmer Released on ALT home page http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/Khmer- annotation-guideline.pdf Updated along with the data preparing A temporary stable version Manual checking and correcting Around 90% of all 20,106 sentences have been checked Will be finished within 2017 FY Will be released in 2018 FY after further cleansing
Issues in Temporary Khmer Data Orthographic errors 5 general-error cases detected and corrected Other undetected cases need to be manual corrected Multi-form words (Khmerization) Names (country, person, etc) are written in different forms.
Orthographic Errors កណ្ដាល → ក ណ ្ ដ ា ល gone dal (correct pronunciation) កណ្តាល → ក ណ ្ ត ា ល gone tal បន្ដ → ប ន ្ ដ bone dor (correct pronunciation) បន្ត → ប ន ្ ត bone tor តន្ត្រី → ត ន ្ ត ្ រ ី (correct) តន្រ្តី → ត ន ្ រ ្ ត ី តន្រី្ត → ត ន ្ រ ី ្ ត 1. ្ ដ → ្ ត 2. ្ រ ្ (con.) → ្ (con.) ្ រ 3. (d. vowel) ្ (con.) → ្ (con.)(d. vowel)
Orthographic Errors តាំង → ត ា ំ ង (preference) តំាង → ត ំ ា ង តំាង → ត ំ ា ង ញ៉ាំ → ញ ៉ ា ំ (preference) ញុំា → ញ ុ ំ ា ញាំុ → ញ ា ំ ុ ញាុំ → ញ ា ុ ំ ញុាំ → ញ ុ ា ំ ញំាុ → ញ ំ ា ុ ញំុា → ញ ំ ុ ា 4. ំ ា → ា ំ ...
Orthographic Errors 5 rules ្ ដ => ្ ត ្ ដ => ្ ត ្ រ ្ (con.) => ្ (con.) ្ រ (d. vowel) ្ (con.) => ្ (con.) (d. vowel) ំ ា => ា ំ ុ ា ំ => ៉ ា ំ con. = consonant d. = dependent, always standing with a consonant
Multi-form Khmerization New Zealand Translation Frequency ញ៉ូវ ហ្សេឡែន 6 ញូវ ហ្សេឡែន 1 ញូវ ហ្សេលេន ញូវ ហ្សឺឡែន ញូវ ហ្សឺលែន 9 ញ៉ូវែល ហ្សែឡង ញ៉ូវែល ហ្សេឡង ញូវែល ហ្សេឡង់ Nouvelle-Zélande ???
Final outcome up to F.Y. 2018 Conditions and reasons Final outcome We don’t have yet the guideline for Syntax annotation. But, we already have temporary stable version of word segmented and POS tagged corpus. So, our plan is to making all the word segmentation and POS tagging first. Then, the word alignment will be started later. Lastly, it will be syntax annotation. Final outcome Word segmentation : 20,106 sentences POS tagging : 20,106 sentences Syntax annotation : 9,000 sentences
Universal Speech Translation Advanced Research “TO OVERCOME THE LANGUAGE BARRIERS AROUND THE WORLD” EST. SINCE 2010
HISTORY C-STAR A-STAR U-STAR 1991 ≈ 2005 C-STAR A-STAR 2006 2007 2008 2009 2011 2012 2013 2014 2015 2016 2017 U-STAR Japan, China, Korea, Indonesia, Thailand, India Vietnam, Singapore A-STAR Network-based S2ST France, Portugal, Turkey, England, Germany, Hungary, Poland, Belgium, Ireland Japan, US, Germany Korea, Italy, France, China, UK, Switzerland, Sweden, India Management transferred from NICT to I2R Bhutan, Mongolia, Nepal, Pakistan, Philippines, Sri Lanka Taiwan, Cambodia Standardization activities transferred to ITU-T SG16 from APT SNLP-EG ITU-T Recommendations F.745 and H.625 published VoiceTra4U for iOS released U-STAR Network-based S2ST VoiceTra4U for Android released The Consortium for Speech Translation Advanced Research (C-STAR) started out over 20 years ago to develop multilingual speech translation systems. Numerous post activities and workshops derived from C-STAR such as the International Workshop on Spoken Language Translation (IWSLT). The Asian Speech Translation Advanced Research (A-STAR) was formed in the Asian regions to develop a network-based speech-to-speech translation (S2ST) system. A-STAR initiated in standardizing international communication protocols, especially in the S2ST field, in association with the Asia-Pacific Telecommunity Standardization Program (ASTAP) and launched "the first Asian network-based speech-to-speech translation system", on July 29th, 2009. The system enabled real-time, location-free, multi-party communication between speakers using different Asian languages, and confirmed the feasibility of network-based S2ST protocol. In 2010, the standardizing procedures at ASTAP were transferred to International Telecommunication Union Standardization Sector (ITU-T) as A-STAR shifted to U-STAR, transforming not only its name but its organization to a worldwide consortium with the aim of establishing a more global system. U-STAR’s network-based S2ST is developed based on the ITU-T Recommendations F.745, and H.625, which were both published in October, 2010.
U-STAR Members: 33 Institutes from 26 countries/regions MEMBERS (as of April, 2017) U-STAR Members: 33 Institutes from 26 countries/regions signed the MOU (Memorandum of Understanding) which is valid until March 31st, 2019) U-STAR covers 95% (orange areas) of the world’s official languages
Main Activities R&D of Network-based Speech-to-Speech Translation System Communication Network HI→JA ASR JA S2ST TTS MT HI JA→HI Client Hindi Japanese R&D Tasks: -Collect speech and text data and dictionaries of local languages -Link local languages to multilingual communication systems -Construct standalone ASR/MT/TTS engines -Connect engines to multilingual systems using ITU-T standardization protocols via network - Run applications using the network-based S2ST - Utilize collected data from the application to improve performance - Extend developed technologies to commercial fields - Each respective member builds and operates servers for speech recognition, machine translation, and speech synthesis. - Users will select a set of languages to be translated within the iPhone application. - According to what the users have selected, the control server (operated by NICT) connects to S2ST servers (operated by each member). *Communication protocols and interfaces are implemented based on the ITU-T Recommendations F.745 and H.625. Workshops - Held once or twice a year to accelerate research collaborations and share progress. - Often held along with other international workshops (i.e. Interspeech, ICASSP, O-COCOSDA, etc.) so more people would be able to participate. Notable Invited Keynote Speakers: - Prof. Dr. Alexander Waibel (Carnegie Mellon University, USA / Karlsruhe Institute of Technology, Germany / Jibbigo, Chairman and Founder, USA & Germany / Advisory Board of the U-STAR Consortium) - Mr. Simão Campos (Counsellor, ITU-T Study Group 16, ITU Telecommunication Standardization Bureau) - Prof. Dekai Wu (Department of Computer Science and Engineering and the Human Language Technology Center at the Hong Kong University of Science and Technology) London, 2012 Gurgaon, 2013 Gurgaon, 2013 Lyon, 2013 Florence, 2014 Phuket, 2014
ORGANIZATION and ROLES Advisory Board: - Dr. Alex Waibel (CMU, USA / KIT, Germany) - Prof. Satoshi Nakamura (NAIST, Japan) - Prof. Shyam Agrawal (KIIT, India) * Give informed guidance and suggestions * Participate in workshops U-STAR (East Asia Region) Steering Committee - Dr. Bo Xu, CASIA (China) - Dr. Eiichiro Sumita, NICT (Japan) CASIA CITI, Academia Sinica HLT-SUDA MUST NICT NUM U-STAR (South Asia Region) Steering Committee - Dr. Chai Wutiwiwatchai, NECTEC (Thailand) - Dr. Haizhou Li, I2R (Singapore) BPPT CDAC, KOLKATA CDAC, NOIDA CU DITT HUST I2R IOIT KICS-UET LTK NECTEC NIPTICT NYUAD UCSC UCSY UTM UPD U-STAR (European Region) Steering Committee - Dr. Hugo Meinedo, INESC-ID (Portugal) - Dr. Thomas Hain, Sheffield University (UK) BME-TMIT CIS CNRS-LIMSI ESAT INESC-ID PJIIT PPKE TUBITAK UEDIN UUlm Technical Support: - Mr. Li Zhongwei, I2R (Singapore) *Construct, manage, and connect engines and servers between members *Troubleshooting Coordinator: - Dr. Haizhou Li, I2R (Singapore) *Host and chair workshops *Recruit new members NICT has taken the leading role in the establishment of the consortium as well as the management and development of the U-STAR S2ST system. In April 1, 2016, all of these were transferred to I2R (Singapore) who will be in charge for 3 years until March 31, 2019. U-STAR Secretariat: - Ms. Ai Ti and Ms. Sharifah Mahani, I2R (Singapore) *Coordinate between all affiliated members *Prepare for workshops, demonstrations *Prepare slides, internal and external reports *Prepare and coordinate contracts: MOU, TLA, etc. *Accountant works for workshops, sponsors *Help desk and customer support for publicly-released applications *Support communication between engineers of each member *Construct and manage internal and external websites *Handle public relations (flyers, posters, etc.) *Manage corpora, documents, and data
Khmer Speech Processing From Summarize Report of Soky Kak of NIPTICT when working at NICT as an intern
Lexical Data Construction Compound word Single word Chuon Nath dict. BTEC websites 34K Keywords Khmer words Pali and Sanskrit Loanwords 57 unique phones 21 consonant phones 36 vowel phones
Language Model All BTEC data is 96K sentences 3-gram language model is built by using SRILM toolkit. No. Place Sentence Word (Single Word) 1 Training 96K 11,520 (8,733) 2 Development 106 ~
Speech Data construction 4K sentences are selected from BTEC based on the balance of CC (consonant-consonant) that will be used to record voice data. Voice recording is conducted in 3 places. No. Place Device Speaker (female) Utterance 1 NICT (soundproof) Professional recorder 35 (20) 37.8K 2 NIPTICT (Office) Smartphone + headset 12 (9) 10.57K 3 Cambodia (Open place) Smartphone 31 (15) 12.55K Total 78 (44) 60.92K (~65 hours)
Data preparation Type Data size Note Training data ~61 hours Development ~2 hours Randomly selected 3% from Training data. Testing (open test) 5 speakers not in the training data Testing (close_test) ~2 hour Speakers in the training data
Experimentation Result (WER %) Testing Models OPEN_TEST CLOSE_TEST GMM-BMMI 9.09 5.19 DNN-CE 5.85 4.40 DNN-sMBR 5.44 4.09 GMM-BMMI: Gaussian mixture model based using boosted Maximum mutual information DNN-CE: Deep Neural Network model based using the cross entropy criterion DNN-sMBR: Deep Neural Network model based using state-level minimum Bayes risk criterion Decoding by NICT SprinTra Decoder. The texts are from BTEC. No noise.
Neural Machine Translation (NMT) needs Parallel Corpus
NMT works much better than SMT Japanese SMT NMT 近年のNMTの進展により、従来は自動翻訳が非常に困難だった日本語文章の英語への自動翻訳精度が顕著に向上してきた。 The development of NMT in recent years, conventional automatic translation was very difficult to machine translation accuracy of Japanese sentences in English has improved. Recent developments in NMT have significantly improved the accuracy of automatic translation of Japanese sentences, which have been extremely difficult to translate, into English.
Fundamental Structure of NMT Thang Luong; Hieu Pham; Christopher D. Manning. (2015) Effective Approaches to Attention-based Neural Machine Translation. EMNLP
NMT is much better with many parallel texts Translation Accuracy SMT 1 million sentences Number of parallel sentences
Automatic Parallel Corpus Construction Getting the best matching sentences between the Japanese and English texts Only needs parallel texts. Not needing parallel sentences Has produced over 500 million parallel sentences since Utiyama et al., 2003 English text Best matching sentences Masao Utiyama and Hitoshi Isahara. (2003) Reliable Measures for Aligning Japanese-English News Articles and Sentences. ACL-2003, pp. 72--79. Japanese texts
Parallel texts are gathered with collaboration Parallel corpora scattering N府 Y社 L県 X社 M都 Z社 B社 A社 Big Parallel Corpus 観光 金融 自動車 Challenge is Gathering
Open Collaboration for Developing Large Parallel Corpora are welcome