The University of Washington Machine Translation System for the IWSLT 2007 Competition Katrin Kirchhoff & Mei Yang Department of Electrical Engineering.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

Simplifications of Context-Free Grammars
Statistical Machine Translation
1 Unsupervised Morphological Segmentation With Log-Linear Models Hoifung Poon University of Washington Joint Work with Colin Cherry and Kristina Toutanova.
CALENDAR.
The 5S numbers game..
Media-Monitoring Final Report April - May 2010 News.
Break Time Remaining 10:00.
The basics for simulations
Turing Machines.
PP Test Review Sections 6-1 to 6-6
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Before Between After.
XYZ 10/12/2014 MIT Lincoln Laboratory The MITLL/AFRL IWSLT-2007 MT System Wade Shen, Brian Delaney, Tim Anderson and Ray Slyh 27 November 2006.
Static Equilibrium; Elasticity and Fracture
Clock will move after 1 minute
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1 The LIG Arabic / English Speech Translation System at IWSLT07 Laurent BESACIER, Amar MAHDHAOUI, Viet-Bac LE LIG*/GETALP (Grenoble, France)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
--Mengxue Zhang, Qingyang Li
Build MT systems with Moses
Statistical Machine Translation Papers from COLING 2004
Statistical NLP Spring 2011
Presentation transcript:

The University of Washington Machine Translation System for the IWSLT 2007 Competition Katrin Kirchhoff & Mei Yang Department of Electrical Engineering University of Washington, Seattle, USA

Overview Two systems, I-EN and AR-EN Main focus on exploring use of out-of-domain data and Arabic word segmentation Basic MT System Italian-English system Out-of-domain data experiments Arabic-English system Semi-supervised Arabic segmentation

Basic MT System Phrase-based SMT system Translation model: log-linear model Feature functions: 2 phrasal translation probabilities 2 lexical translation scores Word count penalty Phrase count penalty LM score Distortion score Data source feature

Basic MT System Word alignment HMM-based word-to-phrase alignment [Deng & Byrne 2005] as implemented in MTTK Comparable in performance to GIZA++ Phrase extraction: Method by [Och & Ney 2003]

Basic MT System Decoding/Rescoring: Minimum-error rate training to optimize weights for first pass (optimization criterion: BLEU) MOSES decoder First pass: up to 2000 hypotheses/sentence additional features for rescoring POS-based language model rank-based feature [Kirchhoff et al, IWSLT06] → Final 1-best hypothesis

Basic MT System Postprocessing: Hidden-ngram model [Stolcke & Shriberg 1996] for punctuation restoration : Noisy-channel model for truecasing SRILM disambig tool Event set E: words and punctuation signs

Basic MT System Spoken-language input: IWSLT 06: mixed results using confusion network input for spoken language No specific provisions for ASR output this year!

Italian-English System

Data Resources Training: BTEC (all data sets from previous years, 160k words) Europarl Corpus (parliamentary proceedings, 17M words) Development: IWSLT 2007 (SITAL) dev set (development only, no training on this set) Split into dev set (500 sentences) and held-out set (496 sentences) BTEC name list

Preprocessing Sentence segmentation into smaller chunks based on punctuation marks Punctuation removal and lowercasing Automatic re-tokenization on English and Italian side: For N most frequent many-to-one alignments (N = 20), merge multiply aligned words into one E.g. per piacere – please → per_piacere Improves noisy word alignments

Translation of names/numbers Names in the BTEC name list were translated according to list, not statistically Small set of hand-coded rules to translate dates and times

Out-of-vocabulary words To translate OOV words, map OOV forms to all words in training data that do not differ in length by more than 2 characters Mapping done by string alignment For all training words with edit distance < 2, reduplicate phrase table entries with word replaced by OOV form Best-matching entry chosen during decoding Captures misspelled or spoken-language specific forms: senz’, undic’, quant’, etc.

Using out-of-domain data Experience in IWSLT 2006: additional English data for language model did not help out-of-domain data for translation model (IE) was useful for text but not for ASR output Adding data: second phrase table trained from Europarl corpus Use phrase tables from different sources jointly each entry has data source feature: binary feature indicating corpus it came from

Using out-of-domain data Phrase coverage (%) and translation results on IE 2007 dev set 12345BLEU/PER BTEC /55.1 Europarl /55.4 combined /53.5

Using out-of-domain data Training set 500 sentences from SITALSmall amount of matched data BTEC training corpusModerate amount of domain-related by stylistically different data EuroparlLarge amount of mismatched data BTEC & EuroparlCombination of sources BTEC & Europarl & SITAL Diagnostic experiments: importance of matched data vs. data size

Using out-of-domain data Training setTextASR output 500 sentences from SITAL28.0/ /49.0 BTEC training corpus18.9/ /57.1 Europarl18.5/ /56.4 BTEC & Europarl20.7/ /55.3 BTEC & Europarl & SITAL30.1/ /44.8 Performance on held-out SITAL data (%BLEU/PER )

Effect of different techniques TechniqueBLEU(%)/PER Baseline18.9/55.1 OOD data20.7/53.4 Rescoring22.0/52.7 Rule-based number trans.22.6/52.2 Postprocessing21.2/50.4

Arabic-English System

Data Resources Training BTEC data, without dev4 and dev5 sets (160k words) LDC parallel text (Newswire, MTA, ISI) (5.5M words) Development BTEC dev4 + dev5 sets Buckwalter stemmer

Preprocessing Chunking based on punctuation marks Conversion to Buckwalter format Tokenization Linguistic tokenization Semi-supervised tokenization

Linguistic Tokenization Columbia University MADA/TOKAN tools: Buckwalter stemmer to suggest different morphological analyses for each word Statistical disambiguation based on context Splitting off of word-initial particles and definite article - Involves much human labour - difficult to port to new dialects/languages

Semi-supervised tokenization Based on [Yang et al. 2007]: segmentation for dialectal Arabic Affix list + set of segmented words Unsegmented text Initial segmen- tation New list of segmented words and stems

Semi-supervised tokenization Needs fewer initial resources Produces potentially more consistent segmentations Once trained, much faster than linguistic tools MethodBLEU/PER MADA/TOKAN 22.5/50.5 SemiSup – initialized on MSA 23.0/50.7 SemiSup – initialized on Iraqi Arabic 21.6/51.1

Effect of out-of-domain data BLEU/PER BTEC /50.5 News combined /47.5 Phrase coverage rate (%) and translation performance on dev5 set, clean text

Effect of different techniques TechniqueBLEU(%)/PER Baseline22.5/50.5 OOD data24.6/48.2 Rescoring24.6/47.5 Postprocessing23.4/48.5 Dev5 set, clean text

Evaluation results BLEU(%) IE – clean text26.5 IE – ASR output25.4 AE – clean text41.6 AE – ASR output40.9 Performance on eval 2007 sets

Conclusions Adding out-of-domain data helped in both clean text and ASR conditions Importance of stylistically matched data Semi-supervised word segmentation for Arabic comparable to supervised segmentation, uses fewer resources Cross-dialectal bootstrapping of segmenter possible