GIZA ++ A review of how to run GIZA++ By: Bridget McInnes

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.
GIZA ++ A review of how to run GIZA++ By: Bridget McInnes
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003.
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Natural Language Processing Expectation Maximization.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Building and Using an Inuktitut- English Parallel Corpus Joel Martin, Howard Johnson, Benoit Farley & Anna
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Linguistically-motivated Tree-based Probabilistic Phrase Alignment Toshiaki Nakazawa, Sadao Kurohashi (Kyoto University)
A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn)
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Getting the structure right for word alignment: LEAF Alexander Fraser and Daniel Marcu Presenter Qin Gao.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Statistical Machine Translation Part II: Word Alignments and EM
Indexing & querying text
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Alexander Fraser CIS, LMU München Machine Translation
Information Retrieval and Web Search
Representing Information as bit patterns
Statistical NLP Spring 2011
--Mengxue Zhang, Qingyang Li
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Alex Fraser Institute for Natural Language Processing
CSCI 5832 Natural Language Processing
Build MT systems with Moses
American and Canadian Culture
CSCI 5832 Natural Language Processing
Three classic HMM problems
N-Gram Model Formulas Word sequences Chain rule of probability
Expectation-Maximization Algorithm
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Statistical Machine Translation Papers from COLING 2004
Machine Translation(MT)
Domain Mixing for Chinese-English Neural Machine Translation
Improving IBM Word-Alignment Model 1(Robert C. MOORE)
The XMU SMT System for IWSLT 2007
Machine Translation and MT tools: Giza++ and Moses
A Path-based Transfer Model for Machine Translation
LNCS Publication : Author Related
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011
Presentation transcript:

GIZA ++ A review of how to run GIZA++ By: Bridget McInnes October 25, 2002 A review of how to run GIZA++ By: Bridget McInnes Email: bthomson@d.umn.edu

Packages Needed to Run GIZA ++ GIZA++ package developed by Franz Och www-i6.informatik.rwth-aachen.de/Colleagues/och mkcls package www.-i6.informatik.rwth-aachen.de/Colleagues/och Aligned Hansard of the Canadian Parliament provided by the Natural Language Group of the USC Information Sciences Institute www.isi.edu/natural-language/download/hansard/index.html

Step 1 Retrieve data: Aligned Hansard of the Parliament of Canada cat hansard.36.2.senate.debates.2000-04-04.42.e > english cat hansard.36.2.senate.debates.2000-04-04.42.f > french

Step 2 Create files needed for GIZA++: Run plain2snt.out located within the GIZA++ package ./plain2snt.out french english Files created by plain2snt english.vcb french.vcb frenchenglish.snt

Files Created by plain2snt english.vcb consists of: each word from the english corpus corresponding frequency count for each word an unique id for each word french.vcb each word from the french corpus frenchenglish.snt consists of: each sentence from the parallel english and french corpi translated into the unique number for each word

Example of .vcb and .snt files french.vcb: 2 Debates 4 3 du 767 4 Senate 5 (hansard) 1 english.vcb: 2 Debates 4 3 of 1658 4 the 3065 5 Senate 107 6 (hansard) 1 frenchenglish.snt 1 2 3 4 5 2 3 4 5 6 …

Step 3 Create mkcls files needed for GIZA++: Run _mkcls which is not located within the GIZA++ package ./_mkcls –pengish –Venglish.vcb.classes ./_mkcls –pfrench –Vfrench.vcb.classes Files created by _mkcls english.vcb.classes english.vcb.classes.cats french.vcb.classes french.vcb.classes.cats

Files Created by the mkcls package .vcb.classes files contains: an alphabetical list of all words (including punctuation) each words corresponding frequency count .vcb.classes.cats files contains a list of frequencies a set of words for that corresponding frequency .vcb.classes ex: .vcb.classes.cats ex: … 82: … “Candian, “sharp, 1993, … 87: “Clarity, “grants, 1215 , … 99: “A, 1913, Christian, … “A 99 “Canadian 82 “Clarity 87 “Do 78 “Forging 96 “General 81

Step 4 Run GIZA++: Run GIZA++ located within the GIZA++ package ./GIZA++ -S french.vcb –T english.vcb –C frenchenglish.snt Files created by GIZA++: Decoder.config ti.final actual.ti.final perp trn.src.vcb trn.trg.vcb tst.src.vcb tst.trg.vcb a3.final A3.final t3.final d3.final D4.final d4.final n3.final p0-3.final gizacfg

Files Created by the GIZA++ package Decoder.config file used with the ISI Rewrite Decoder developed by Daniel Marcu and Ulrich Germann www.isi.edu/~germann/software/ReWrite-Decoder trn.src.vcb list of french words with their unique id and frequency counts similar to french.vcb trn.trg.vcb list of english words with their unique id and frequency counts similar to english.vcb tst.src.vcb blank tst.trg.vcb

(cont ) Files Created by the GIZA++ package ti.final file contains word alignments from the french and english corpi word alignments are in the specific words unique id the probability of that alignment is given after each set of numbers ex: 3 0 0.237882 1171 1227 0.963072 actual.ti.final words alignments are the actual words not their unique id’s the probability of that is alignment is given after each set of words of NULL 0.237882 Austin Austin 0.963072

(cont ) Files Created by the GIZA++ package A3.final matches the english sentence to the french sentence and give the match an alignment score ex: #Sentence pair (1) source length 4 target length 5 alignment score : 0.000179693 Debates of the Senate (Hansard) Null ({3}) Debats ({1}) du ({2}) Senat ({4}) (hansard) ({5}) perp list of perplexity for each iteration and model #trnsz tstsz iter model trn-pp test-pp trn-vit-pp tst-vit-pp 2304 0 0 Model1 10942.2 N/A 132172 N/A trns – training size tstsz – test size iter – iteration trn-pp – training perplexity tst-pp – test perplexity trn-vit-pp – training viterbi perplexity tst-vit-pp – test viterbi perplexity

(cont ) Files Created by the GIZA++ package a3.final contains a table with the following format: i j l m p ( i / j, l, m) j = position of target sentence i = position of source sentence l = length of the source sentence m = length of the target sentence p( i / j, l, m) = is the probability that a source word in position i is moved to position j in a pair of sentences of length l and m ex: 0 1 1 60 5.262135e-06 0 – indicates position of target sentence 1 – indicates position of source sentence 1 – indicates length of source sentence 60 indicates length of target sentence 5.262135e-06 – is the probability that a source word in position 1 is moved position 0 of sentences of length 1 and 60 d3.final – similar to a3.final with positions i and j switched

(cont ) Files Created by the GIZA++ package n3.final contains the probability of the each source token having zero fertility, one fertility, … N fertility t3.final table after all iterations of Model 4 training d4.final translation table for Model 4 D4.final distortion table for IBM-4 gizacfg contains parameter settings that were used in this training. training can be duplicated exactly p_03.final probability of inserting null after a source word file contains: 0.781958

References MsC Project Weblog written by Chris Callison-Burch www-csli.stanford.edu/~ccb/msc_blog.html