Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

Linear Classifiers (perceptrons)

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Bayesian Learning Rong Jin.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Online Learning Algorithms

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Machine Learning Chapter 3. Decision Tree Learning

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 1-4 Shauna Eggers.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Machine Learning Queens College Lecture 2: Decision Trees.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Machine Learning II 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호

Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Gradient descent David Kauchak CS 158 – Fall 2016.

CSC 594 Topics in AI – Natural Language Processing

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Data Science Algorithms: The Basic Methods

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Statistical Machine Translation Part IV – Log-Linear Models

CS 4/527: Artificial Intelligence

Probabilistic and Lexicalized Parsing

Machine Learning Chapter 3. Decision Tree Learning

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CSCI 5832 Natural Language Processing

Machine Learning Chapter 3. Decision Tree Learning

ML – Lecture 3B Deep NN.

Support Vector Machines and Kernels

Language Model Approach to IR

Machine Learning in Practice Lecture 6

CSCI 5832 Natural Language Processing

Machine learning overview

A Path-based Transfer Model for Machine Translation

PRESENTATION: GROUP # 5 Roll No: 14,17,25,36,37 TOPIC: STATISTICAL PARSING AND HIDDEN MARKOV MODEL.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines 2

CS249: Neural Language Model

Presentation transcript:

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman

Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

“Tricky” Syntactic Features Constituent Alignment Features Markov Assumption Bi-gram model over elementary trees

Constituent Alignment Features Hypothesis: target sentences should be rewarded for having similar syntactic structure to source sentence Experiments: Tree to String Penalty Tree to Tree Penalty Constituent Label Probability Uses: Parse trees in both languages Word Alignments Is that hypothesis really true?

Tree to String Penalty Penalize target words that cross source constituents

Tree to Tree Penalty Penalize target constituents that don’t align to source constituents

Constituent Label Probability Learns Pr( target node label | source node label ) e.g.: Pr( target = VP | source = NP ) = 0.019. Align tree nodes by finding minimum common ancestor of aligned leaves Training: ML counts from training data Also: Pr( target label, target leaf count | source label, source leaf count )

Results Possible problems: noisy parses noisy alignments not stat. significant Possible problems: noisy parses noisy alignments insensitivity of BLEU

Markov Assumption for Tree Models Tree-based translation models from Chapter 4: too slow—they only had time to parse 300 out of 1000-best limited reordering among higher levels of tree Solution: Split trees into independent tree fragments Lesson: Difficult to get complicated features to work when there is so much noise in system.

Markov Example

TAG Elementary Trees Break into tree fragments by head word Build n-gram model on tree fragments Unigram model: Bi-gram model: where ei and fi are source and target word, tei and tfi are source and target tree Give intuition: simple bi-gram model, but basic unit is syntactically motivated

Finding TAG Elementary Trees Heuristically assign head nodes

Finding TAG Elementary Trees Split at head words

Results Why does performance improve w/ independence assumption? Just because coverage increases?

Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

Why rerank? Successful in POS tagging and parsing Easy incorporation of new features Decrease decoding complexity But the entire report is about reranking. Why would perceptron be better than MER on Log-linear model?

Reranking with a linear classifier Log-linear and Perceptron both give linear reranking rule: Difference is in training: Log-linear MER: optimize BLEU of 1-best Perceptron: attempt to separate “good” from “bad” in 1,000-best note: lambda_m trained with MER on dev set

Training Data For a single sentence in dev set: sort 1,000 best translations by BLEU score count top 1/3 of 1,000-best as good count bottom 1/3 as bad Finds features that on average lead to translations with higher BLEU scores This gives us our training data

Separating in feature space For a single sentence:

Separating in feature space For many sentences: Separator could be in different place but restrict to same direction for all sentences

Reranking Distance from hyperplane is score of translation Perceptron is trying to predict BLEU score

Features Baseline: 12 features used in Och’s baseline Full: Baseline + all features from workshop POS sequence: a 0/1 feature for every possible POS sequence Parse trees: a 0/1 feature for every possible subtree last 2 need explaining

Results Log-linear reranking: Perceptron reranking: baseline 31.6 workshop features 33.2 Perceptron reranking: baseline 30.9 workshop features 31.6 POS sequence 30.9 Parse tree 30.5

Analysis Dev set was too small to optimize for POS and tree features Using only POS sequence gives results as good as much more complicated baseline

Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

An Example Say our 4-best list is: MAP: choose single most likely “Ball at the from Bob” .31 “Bob hit the ball” .24 “Bob struck the ball” .23 “Bob hit a ball” .22 MAP: choose single most likely MBR: weight by similarity w/ other hypotheses If they only get one slide from this section, this should be it.

MBR, formally where f = source sentence e, e’ = target sentence a, a’ = word alignment T, T’, T(f) = parse trees L((e’,a’,T’),(e,a,T);f,T(f)) = similarity between translations e and e’

Making MBR tractable Restrict e to 1,000-best list Use translation model score for P(e,a | f )

Loss functions: translation similarity “Tier 1.” String based: L(e, e’) = BLEU score between e and e’ “Tier 2.” Syntax based: L((e, T),(e, T’) = tree kernel between T and T’ “Tier 3.” Alignment based: L((e’,a’,T’),(e,a,T);f,T(f)) = number of source to target node alignments that are the same in T and T’

Results Using loss based on a particular measurement leads to better performance on that measurement Blue score goes up by 0.3

Outline “Tricky” Syntactic Features Reranking with Perceptrons Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

What Works Baseline 31.6 Model 1 (on 250M words) 32.5 All features from workshop 33.2 Human upper bound 35.7 Simple Model 1 is only feature to give statistically significant improvement in isolation Complex features helpful in aggregate

Their wish list Better evaluation metrics Better parameter tuning for individual sentences specific issues, such as missing content words Better parameter tuning larger dev set Take divergence into account in syntax Better quality parses confidence measure in parse