Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Slides:

Advertisements

Similar presentations

Chapter 9 Greedy Technique. Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible - b feasible.

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Exact Inference in Bayes Nets

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

3.3 Spanning Trees Tucker, Applied Combinatorics, Section 3.3, by Patti Bodkin and Tamsen Hunter.

Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian.

GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Data Structures – LECTURE 10 Huffman coding

Syntactic Pattern Recognition Statistical PR:Find a feature vector x Train a system using a set of labeled patterns Classify unknown patterns Ignores relational.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.

Natural Language Processing Expectation Maximization.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

Functional Programming Universitatea Politehnica Bucuresti Adina Magda Florea

CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Supertagging CMSC Natural Language Processing January 31, 2006.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

John Lafferty Andrew McCallum Fernando Pereira

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.

Today Graphical Models Representing conditional dependence graphically

Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.

PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Organization of Programming Languages Meeting 3 January 15, 2016.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Andy Nguyen Christopher Piech Jonathan Huang Leonidas Guibas. Stanford University.

Statistical Machine Translation Part II: Word Alignments and EM

CSC 594 Topics in AI – Natural Language Processing

Greedy Technique.

Spatial Online Sampling and Aggregation

Training Tree Transducers

N-Gram Model Formulas Word sequences Chain rule of probability

A Path-based Transfer Model for Machine Translation

Dekai Wu Presented by David Goss-Grubbs

Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

David Kauchak CS159 – Spring 2019

Switching Lemmas and Proof Complexity

Rule Markov Models for Fast Tree-to-String Translation

Presentation transcript:

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe Wei Wang and Ignacio Thayar Presentation by: Nadir Durrani

GHKM : What’s in a Translation Rule? (Recap) Given a triple (f, π,a) –A source side sentence  f –A target-side parsed tree  π –An alignment between f and leaves of π  a A process of transforming π into f A minimal set of syntactically motivated transformation rules that explain human translation

Contributions of this paper Obtain multi-level rules –Acquire rules of arbitrary size that condition on more syntactic context –Multiple interpretation how unaligned words are accounted in a derivation Probability models for multi-level transfer rules –Assigning probabilities to very large rule sets

Rule Extraction Algorithm Compute a set of frontier nodes F of the alignment graph For each n ∈ F, compute the minimal frontier graph rooted at n

Rule Extraction Algorithm

Compute a set of frontier nodes F of the alignment graph A 3 step process – for each node ∈ G (direct Graph): –Label with its span –Label with its compliment span –Decide whether n ∈ F

Step-I : Label with Span Span : Indexes of first and last words reachable by a node n

Step-II : Label with Compliment Span Compliment Span (n) = Compliment Span (Parent (n)) + Span (Siblings (n)) Compliment Span (root) = ∅

Step-II : Label with Compliment Span

Computing Frontier Set A node n is in frontier set iff compliment_span (n) ∩ closure (span (n)) = ∅ Closure (span (n)) = Shortest contiguous span which is superset of span(n) –Example closure {2,3,5,7} = {2,3,4,5,6,7}

Computing Frontier Set

Rule Extraction Algorithm Compute a set of frontier nodes F of the alignment graph Compute the minimal frontier graph for all nodes in frontier set

Rule Extraction Algorithm Compute a set of frontier nodes F of the alignment graph Compute the minimal frontier graph for all nodes in frontier set Algorithm: For each node n ∈ F Expand n then as long as n’ ∉ F expand n’ if n’ ∈ F Replace n’ by a variable x i

Computing Frontier Set

Tree to String Transducers

Minimal Derivation Corresponding to Example

Acquiring Multi-level Rules GHKM : Extract minimal rules –Unique derivation for G –Rules defined over G cannot be decomposed further induced by the same graph This work: Extract multi-level rules –Multiple derivations per triple –Composition of 2 or more minimal rules to form larger rules

Example

Multiple Interpretations of Unaligned Words Highly frequent phenomenon in Chinese-English –24.1% of Chinese words in 179 million word are unaligned –84.8% of Chinese sentences contain at least 1 unaligned word GHKM : Extract minimal rules –Attach unaligned words with certain constituent of π

Example Heuristic: Attach unaligned words with highest attachment

Multiple Interpretations of Unaligned Words This Work –No prior assumption about “correct” way of assigning unaligned words to a constituent –Consider all possible derivations that are consistent with G –Use corpus evidence find more probable unaligned word attachments

6 Minimal Derivations for the Working Example

Representing Derivations as Forest Rather than enumerating all possible derivation represent as derivation forest –Time and space efficient For each derivation each unaligned item appears only once in the rules of that derivation –To avoid biased estimates by disproportional representation

Representing Derivations as Forest

Derivation Building and Rule Extraction Algo Preprocessing Step –Assign spans –Assign complement spans –Compute frontier set F –Extract minimal rules Each n ∈ F has q o (open queue) and q c (closed queue) of rules –q o is initialized with minimal rules for each node n ∈ F For each node n ∈ F –Pick the smallest rule ‘r’ from q o –For each variable of ‘r’ discover new rules by composition –If q o becomes empty or threshold on rule size or number of rules in q c is reached Connect new OR-node to all rules extracted for n Add to or-dforest – table to store OR-nodes with format [x, y, c]

Contributions of this paper Obtain multi-level rules –Acquire rules of arbitrary size that condition on more syntactic context –Multiple interpretation how unaligned words are accounted in a derivation Probability models for multi-level transfer rules –Assigning probabilities to very large rule sets

Probability Models Using noisy-channel approach Incorporating dependencies on target-side syntax Monolingual Language ModelTranslation Model Syntax Based Translation ModelSyntactic Parsing Model Τ(e) is set of all target-trees that yield e

Syntax Based Translation Model Θ is set of all derivations constructible from G = ( π,f,a) A derivation θ i = r 1 o …o r n –Independence assumption Λ is set of all sub-tree decompositions of corresponding to derivations in Θ –Normalization factor to keep the distribution tight i.e. sum to 1 over all strings f i ∈ F derivable from π

Example

p1 = ½ = 0.5 p2 = ½ = 0.5 p3 = 1 p4 = 1

Example p1 = ½ = 0.5 p2 = ½ = 0.5 p3 = 1 p4 = 1 Total probability mass distributed across two source strings a,b,c and a’,b’,c’ = p (a’,b’,c’ | π) + p(b’,a’,c’) = [p1 + (p3. p4)] + [p2] = 2

Problems with Relative Frequency Estimator Biased estimates when extracting only minimal rules

Problems with Relative Frequency Estimator

Joint Model Conditioned on Root

EM Training Which derivation in the derivation forest is true? –Score each derivation with it rule probabilities and find the most likely derivation How do we get good rules? –Collect the rule counts from the most likely derivation Chicken or the Egg problem – Calls for EM training

EM Training Algorithm 1.Initialize each derivation with uniform rule probabilities 2.Score each derivation θ i ∈ Θ with rule probabilities 3.Normalize to find probability p i of each derivation 4.Collect the weighted rule counts from each derivation θ i with weight p i 5.Normalize the rule counts to obtain new rule probabilities 6.Repeat 2–5 until converge

Evaluation Three models C m, C 3 and C 4 (minimal derivation, composed rules with 3 and 4 internal nodes in lhs respectively) NIST million word English-Chinese corpus 1 best derivation per sentence pair based on GIZA alignments (Figure 4) Observations –C 4 makes extensive use of composed rules –Best derivation (Figure 4) using C 4 is more probable than C m –C m is using less syntactic context than C 4 which incorporates more linguistic evidence

Evaluation

Conclusion Acquire larger rules – condition on more syntactic context –3.63 BLEU point gain over baseline minimal rules system Using multiple derivations including multiple interpretations of unaligned words in derivation forest Probability models to score multi-level transfer rules