Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Visual Recognition Tutorial
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.
Radial Basis Function Networks
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Natural Language Processing Expectation Maximization.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
1 Machine Translation (MT) Definition –Automatic translation of text or speech from one language to another Goal –Produce close to error-free output that.
Statistical Alignment and Machine Translation
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
Globalisation and machine translation Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and First Order Predicate.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
The Unreasonable Effectiveness of Data
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Statistical Machine Translation Part II: Word Alignments and EM
Statistical NLP: Lecture 7
Statistical NLP: Lecture 13
Data Mining Lecture 11.
Statistical NLP: Lecture 9
Statistical Machine Translation Papers from COLING 2004
Machine Translation(MT)
A Path-based Transfer Model for Machine Translation
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004S-NaturalLanguageProcessing/NLP_main_2004S.htm) References: Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993.

Machine Translation (MT) Definition Automatic translation of text or speech from one language to another Goal Produce close to error-free output that reads fluently in the target language Far from it ? Current Status Existing systems seem not good in quality Babel Fish Translation (before 2004) Google Translation A mix of probabilistic and non-probabilistic components

Issues Build high-quality semantic-based MT systems in circumscribed domains Abandon automatic MT, build software to assist human translators instead Post-edit the output of a buggy translation Develop automatic knowledge acquisition techniques for improving general-purpose MT Supervised or unsupervised learning

Different Strategies for MT English Text (word string) French Text English (syntactic parse) French (semantic representation) Interlingua (knowledge representation) word-for-word syntactic transfer semantic transfer knowledge-based translation

Word for Word MT Translate words one-by-one from one language to another Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) Need to look at the context larger than individual word (→ phrase or clause) 2. Languages have different word orders English French suit lawsuit (訴訟), set of garments (服裝) meanings

Syntactic Transfer MT Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree Solve the problems of word ordering Problems Syntactic ambiguity The target syntax will likely mirror that of the source text N V Adv German: Ich esse gern ( I like to eat ) English: I eat readily/gladly

Text Alignment: Definition Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages Thus can learn which words tend to be translated by which other words in another language Is not part of MT process per se But the obligatory first step for making use of multilingual text corpora Applications Bilingual lexicography Machine translation Multilingual information retrieval … bilingual dictionaries, MT , parallel grammars …

Text Alignment: Sources and Granularities Sources of parallel texts or bitexts Parliamentary proceedings (Hansards) Newspapers and magazines Religious and literary works Two levels of alignment Gross large scale alignment Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language Word alignment Learn which words tend to be translated by which words in another language The necessary step for acquiring a bilingual dictionary with less literal translation Parliamentary: 議會的,國會的 Hansard:英國國會議事錄 Religious:宗教的,宗教上的 Literary:文學的,文藝的 Orders of word or sentence might not be preserved.

Text Alignment: Example 1

Text Alignment: Example 2 a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment.

Sentence Alignment Crossing dependencies are not allowed here Word ordering is preserved ! Related work

Sentence Alignment Length-based Lexical-guided Offset-based

Sentence Alignment Length-based method Rationale: the short sentences will be translated as short sentences and long sentences as long sentences Length is defined as the number of words or the number of characters Approach 1 (Gale & Church 1993) Assumptions The paragraph structure was clearly marked in the corpus, confusions are checked by hand Lengths of sentences measured in characters Crossing dependences are not handled here The order of sentences are not changed in the translation Union Bank of Switzerland (UBS) corpus : English, French, and German s1 s2 s3 s4 . sI t1 t2 t3 t4 . tJ Ignore the rich information available in the text.

Sentence Alignment Length-based method source target Source s1 s2 s3 s4 . sI t1 t2 t3 t4 . tJ B1 B2 possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,…} Target B3 probability independence between beads Bk a bead

Sentence Alignment Length-based method Dynamic Programming The cost function (Distance Measure) Sentence is the unit of alignment Statistically modeling of character lengths Bayes’ Law is a normal distribution square difference of two paragraphs Ratio of texts in two languages The prob. distribution of standard normal distribution

Sentence Alignment Length-based method The priori probability Or P(α align) Source si si-1 si-2 tj-2 tj-1 tj Target

Sentence Alignment Length-based method A simple example s1 s2 s3 s4 t1 t2 t3 L1 alignment 2 L1 alignment 1 cost(align(s1, t1)) + cost(align(s2, t2)) cost(align(s3,Ø)) cost(align(s4, t3)) cost(align(s1, s2, t1)) cost(align(s3, t2))

Sentence Alignment Length-based method 4% error rate was achieved Problems: Can not handle noisy and imperfect input E.g., OCR output or file containing unknown markup conventions Finding paragraph or sentence boundaries is difficult Solution: just align text (position) offsets in two parallel texts (Church 1993) Questionable for languages with few cognates (同源) or different writing systems E.g., English ←→ Chinese eastern European languages ←→ Asian languages

Sentence Alignment Lexical method Approach 1 (Kay and Röscheisen 1993) First assume the first and last sentences of the text were align as the initial anchors Form an envelope of possible alignments Alignments excluded when sentences across anchors or their respective distance from an anchor differ greatly Choose word pairs their distributions are similar in most of the sentences Find pairs of source and target sentences which contain many possible lexical correspondences The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations

Word Alignment The sentence/offset alignment can be extended to a word alignment Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary Frequency of word correspondences Association measures ….

Statistical Machine Translation The noisy channel model Assumptions: An English word can be aligned with multiple French words while each French word is aligned with at most one English word Independence of the individual word-to-word translations Language Model Translation Model Decoder e: English f: French |e|=l |f|=m

Statistical Machine Translation Three important components involved Decoder Language model Give the probability p(e) Translation model translation probability normalization constant all possible alignments (the English word that a French word fj is aligned with)

An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12

Introduction Translations involve many cultural respects We only consider the translation of individual sentence, just acceptable sentences. Every sentence in one language is a possible translation of any sentence in the other Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

Statistical Machine Translation (SMT) Noise channel problem

Fundamental of SMT Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem) Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

Practical Challenges Computation of translation model Pr(f|e) Computation of language model Pr(e) Decoding (i.e., search for e that maximize Pr(f|e)  Pr(e))

Alignment of case 1 (one-to-many)

Alignment of case 2 (many-to-one)

Alignment of case 3 (many-to-many)

Formulation of Alignment(1) Let e = e1le1e2…el and f = f1m  f1f2…fm An alignment between a pair of strings e and f use a mapping of every word ei to some word fj In other words, an alignment a between e and f tells that the word ei, 1 i  l is generated by the word fj, aj{1,…,l} There are (l+1)m different alignments between e and f. (Including Null – no mapping ) e = e1e2…ei…el f = f1 f2… fj… fm aj =i fj = eaj = ei

Translation Model The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null). Pr(f,a|e,m)

IBM Model I (1) Pr(f,a|e,m) 

IBM Model I (2) The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l. Pr(f|e) = ∑Pr(f,a|e)

Constrained Maximization We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

Lagrange Multipliers (1) Method of Lagrange multipliers(拉格朗乘數法): Lagrange multipliers with one constraint If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the f(x,y) is called the objective function(目標函數). g(x,y) is called the constrained equation(條件限制方程式). F(x, y, ) is called the Lagrange function(拉格朗函數).  is called the Lagrange multiplier(拉格朗乘數).

Lagrange Multipliers (2) Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

Derivation (1) The partial derivative of h with respect to t(f|e) is where  is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

Derivation (2) We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,

Derivation (3) replacing e by ePr(f|e), then Equation (11) can be written very compactly as In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes

Derivation (4) For an expression that can be evaluated efficiently.

Derivation (5) Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)

Other References about MT

Chinese-English Sentence Alignment 林語君、高照明,結合統計與語言訊息的混合式中英雙語句對應演算法,ROCLING XVI,2004 問題:中文句子的界線並不清楚,句點與逗點都有可能是句子的界線。 方法 群組句對應模組 2 Stages Iterative DP 句對應參數 採用廣泛的字典翻譯詞以及Stop list 中文翻譯詞的部分比對(Partial Match) 句中重要標點符號序列相似度 共同數字詞、時間詞、原文詞之對應錨 群組句對應參數 句長相異度 句末標點符號要求相同

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) Preprocessing steps to calculate the following information: Lists of preferred POS patterns of collocation in both languages. Collocation candidates matching the preferred POS patterns. N-gram statistics for both languages, N = 1, 2. Log likelihood Ratio statistics for two consecutive words in both languages. Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. Content word alignment based on Competitive Linking Algorithm (Melamed 1997).

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta

Outline Introduction Probabilistic Word-Alignment Model Word-Alignment Algorithm Constraints Features

Introduction Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. E.g., translation lexicons, transfer rules Word-alignment problem Conventional approaches usually used co-occurrence models E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve To propose a probabilistic word-alignment model which allows easy integration of context-specific features. CISCO System Inc. 思科 系統

noise

Probabilistic Word-Alignment Model Given E = e1, e2, …, em , F = f1, f2, …, fn If ei and fj are translation pair, then link l(ei, fj) exists If ei has no corresponding translation, then null link l(ei, f0) exists If fj has no corresponding translation, then null link l(e0, fj) exists An alignment A is a set of links such that every word in E and F participates in at least one link Alignment problem is to find alignment A to maximize P(A|E, F) IBM’s translation model: maximize P(A, F|E)

Probabilistic Word-Alignment Model (Cont.) Given A = {l1, l2, …, lt}, where lk = l(eik, fjk), then consecutive subsets of A, lij = {li, li+1, …, lj} Let Ck= {E, F, l1k-1} represent the context of lk

Probabilistic Word-Alignment Model (Cont.) Ck = {E, F, l1k-1} is too complex to estimate FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik, fjk, FTk) Let Ck’ = {eik, fjk} ∪ FTk

An Illustrative Example

Word-Alignment Algorithm Input: E, F, TE TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions Constraints One-to-one constraint: every word participates in exactly one link Cohesion constraint: use TE to induce TF with no crossing dependencies

Word-Alignment Algorithm (Cont.) Features Adjacency features fta: for any word pair (ei, fj), if a link l(ei’, fj’) exists where -2  i’-i  2 and -2  j’-j  2, then fta(i-i’, j-j’, ei’) is active for this context. Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context.

Experimental Results Test bed: Hansard corpus Training: 50K aligned pairs of sentences (Och & Ney 2000) Testing: 500 pairs

Future Work The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.