Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

Slides:

Advertisements

Similar presentations

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large-Scale Entity-Based Online Social Network Profile Linkage.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Language Model Based Arabic Word Segmentation By Saleh Al-Zaid Software Engineering.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.

Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan.

Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

1 The LIG Arabic / English Speech Translation System at IWSLT07 Laurent BESACIER, Amar MAHDHAOUI, Viet-Bac LE LIG*/GETALP (Grenoble, France)

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,

Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.

Korea Maritime and Ocean University NLP Jung Tae LEE

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

Approaches to Machine Translation

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Suggestions for Class Projects

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Using Translation Memory to Speed up Translation Process

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Approaches to Machine Translation

Machine Translation and MT tools: Giza++ and Moses

Memory-augmented Chinese-Uyghur Neural Machine Translation

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Machine Translation and MT tools: Giza++ and Moses

Speaker Identification:

Presentation transcript:

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research Council of Turkey (TÜBİTAK) Gebze, Kocaeli, Turkey {coskun, hamzaky, The TÜBİTAK-UEKAE Statistical Machine Translation System for IWSLT 2007

Outline System Description Training –Phrase table augmentation Decoding –Out of Vocabulary Words (OOV) Results Conclusion and Future Work

System Description Participated in translation tasks –Arabic-to-English –Japanese-to-English Built on phrase-based SMT software Moses Used only supplied data and Buckwalter Arabic Morphological Analyzer (BAMA)

System Description

Outline System Description Training –Phrase table augmentation Decoding –Out of Vocabulary Words (OOV) Results Conclusion and Future Work

Training Devset1-3 are included in the training with all 16 reference segments Train and Devset1-3 are given equal weight Language models –3-gram for AR-EN –4-gram for JP-EN –Trained with modified Kneser-Ney discounting and interpolation

Training Multi-sentence segments are split Before splittingAfter splitting AR-EN44,164 *49,318 JP-EN64,145 *71,435 * train segments + 16 * dev1-3 segments

Training Parameter tuning –Manually tested different set of parameters –Different data favored different parameters –Instead of selecting argmax, selected mode in a desirable interval to select a robust set of parameters

Phrase Table Augmentation Translation model is represented in a phrase table Bi-directional alignment and phrase extraction with grow-diag-final-and heuristics Source-language words without a one-word entry in phrase table are listed The words, which are in the list and have a lexical translation probability above a threshold in GIZA++ word alignment, are added to phrase list

Phrase Table Augmentation CorpusAR-ENJP-EN Source vocabulary size18,75112,699 Number of entries in the original phrase table 408,052606,432 Number of source vocabulary words without a one-word entry in the original phrase table 8,0356,302 Number of one-word bi-phrases added to the phrase table 21,43923,396 Number of entries in the augmented phrase-table 429,491629,828

Outline System Description Training –Phrase table augmentation Decoding –Out of Vocabulary Words (OOV) Results Conclusion and Future Work

Decoding Decoding is done on tokenized and punctuated data –Source-side punctuation insertion (for ASR data) –Target-side case restoration SRILM tools used for punctuation restoration

Decoding NDevset4Devset5 AR-EN JP-EN Merged 10 sentences to train punctuation restorer with more internal sentence boundaries

Out of Vocabulary Words Lexical Approximation –Find a set of candidate approximations –Select the candidate with least edit distance –In case of a tie, more frequently used candidate is chosen

Out of Vocabulary Words Arabic lexical approximation (2 pass) –Morphological root(s) of the word found by feature function using BAMA –If not, skeletonized version of the word is found by feature function Japanese lexical approximation (1 pass) –Right-truncations of the word is found by feature function

Out of Vocabulary Words Run-time Lexical Approximation AR-EN Devset4Devset5 # of OOVsBLEU# of OOVSBLEU Original After LA# After LA# JP-EN Devset4Devset5 # of OOVsBLEU# of OOVSBLEU Original After LA

Out of Vocabulary Words hl hw mjAny~ ?Is it mjAny~ ? Without lexical approximation mjAny~ is OOV

Out of Vocabulary Words Lexical approximation finds candidates –mjAnyP, mjAnY, mjAnA, kjm, mjAny, mjAnAF mjAny has an edit distance of 1, so it’s selected

Out of Vocabulary Words hl hw mjAny ?Is it free ? After lexical approximation

Outline System Description Training –Phrase table augmentation Decoding –Out of Vocabulary Words (OOV) Results Conclusion and Future Work

Results Clean TranscriptASR Output AR-EN JP-EN

Clean vs. ASR Possible causes of performance drop in ASR condition –Recognition errors of ASR –Punctuation restorer performance –Parameter tuning for clean transcript but not for ASR output

AR-EN vs. JP-EN Possible causes of higher performance drop in AR-EN than JP-EN –Lower accuracy of Arabic ASR data than Japanese data –Higher difficulty of punctuation insertion due to higher number of punctuation types –Less reliable punctuation insertion caused by higher recognition error rate

AR-EN vs. JP-EN Lexical approximation is sensitive to recognition errors Clean transcriptASR output Clean-to-ASR degradation Original source % After LA %

Devset4-5 vs. Evaluation Set There is a dramatic variation in the improvement obtained with the lexical approximation technique on the evaluation and development sets

Devset4-5 vs. Evaluation Set Devset4Devset5 Original source After LA# After LA# Improvment2.6%4.5% Evaluation set clean transcript Evaluation set ASR output Original source After LA Improvment27.9%15.6%

Devset4-5 vs. Evaluation Set 167 of 489 evaluation set segments have at least one reference which is a perfect match with a training segment Only 19 of 167 have the source segment exactly the same as in the training set Remaining 148 segments represents a potential to obtain a perfect match

Devset4-5 vs. Evaluation Set Number of segmentsDevset4Devset5Evaluation set Exact match of at least one reference with a segment in the training set Exact math of the source with a segment in the training set 1019 Total

Outline System Description Training –Phrase table augmentation Decoding –Out of Vocabulary Words (OOV) Results Conclusion and Future Work

Make the system more robust to ASR output. For this goal: –Using n-best/lattice ASR output –Tuning system for ASR output –Better punctuation performance

Thank you for your attention!