Beam-Width Prediction for Efficient Context-Free Parsing Nathan Bodenstab, Aaron Dunlop, Keith Hall, Brian Roark June 2011.

Slides:



Advertisements
Similar presentations
Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.
Advertisements

Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.
HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.
On-line learning and Boosting
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Pattern Recognition and Machine Learning
Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010.
Multiple Aspect Ranking using the Good Grief Algorithm Benjamin Snyder and Regina Barzilay at MIT Elizabeth Kierstead.
Dependency Parsing Some slides are based on:
Support Vector Machines
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Lecture 14 – Neural Networks
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Approximate Factoring for A* Search Aria Haghighi, John DeNero, and Dan Klein Computer Science Division University of California Berkeley.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Cube Pruning as Heuristic Search Mark Hopkins and Greg Langmead Language Weaver, Inc.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
K Nearest Neighbors Classifier & Decision Trees
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Classification: Feature Vectors
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
1 Chart Parsing Allen ’ s Chapter 3 J & M ’ s Chapter 10.
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.
Linear Models for Classification
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Dependency Parsing Parsing Algorithms Peng.Huang
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Incremental Text Structuring with Hierarchical Ranking Erdong Chen Benjamin Snyder Regina Barzilay.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.
Global Neural CCG Parsing with Optimality Guarantees
Boosting and Additive Trees (2)
From: Estimating predictive stimulus features from psychophysical data: The decision image technique applied to human faces Journal of Vision. 2010;10(5):22.
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Quality-aware Aggregation & Predictive Analytics at the Edge
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Probabilistic Parsing
The Voted Perceptron for Ranking and Structured Classification
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Beam-Width Prediction for Efficient Context-Free Parsing Nathan Bodenstab, Aaron Dunlop, Keith Hall, Brian Roark June 2011

OHSU Beam-Search Parser (BUBS) 2 Standard bottom-up CYK Beam-search per chart cell Only “best” are retained

Ranking, Prioritization, and FOMs f() = g() + h() Figure of Merit –Caraballo and Charniak (1997) A* search –Klein and Manning (2003) –Pauls and Klein (2010) Other –Turrian (2007) –Huang (2008) Apply to beam-search 3

Beam-Width Prediction Traditional beam-search uses constant beam-width Two definitions of beam-width : –Number of local competitors to retain (n-best) –Score difference from best entry Advantages –Heavy pruning compared to CYK –Minimal sorting compared to global agenda Disadvantages –No global pruning – all chart cells treated equal –Conservative to keep outliers within beam 4

5 Beam-Width Prediction How often is gold edge ranked in top N per chart cell –Exhaustively parse section 22 + Berkeley latent variable grammar Gold rank <= N Cumulative Gold Edges

6 Beam-Width Prediction How often is gold edge ranked in top N per chart cell –Exhaustively parse section 22 + Berkeley latent variable grammar Gold rank <= N Cumulative Gold Edges

7 Beam-Width Prediction Beam-search + C&C Boundary ranking: –How often is gold edge ranked in top N per chart cell: Gold rank <= N Cumulative Gold Edges To maintain baseline accuracy, beam- width must be set to 15 with C&C Boundary ranking (and 50 using only inside score)

8 Beam-Width Prediction Beam-search + C&C Boundary ranking: –How often is gold edge ranked in top N per chart cell: Gold rank <= N Cumulative Gold Edges To maintain baseline accuracy, beam- width must be set to 15 with C&C Boundary ranking (and 50 using only inside score) Over 70% of gold edges are already ranked first in the local agenda 14 of 15 edges in these cells are unnecessary We can do much better than a constant beam-width Over 70% of gold edges are already ranked first in the local agenda 14 of 15 edges in these cells are unnecessary We can do much better than a constant beam-width

Beam-Width Prediction Method: Train an averaged perceptron (Collins, 2002) to predict the optimal beam-width per chart cell Map each chart cell in sentence S spanning words w i … w j to a feature vector representation: x: Lexical and POS unigrams and bigrams, relative and absolute span y:1 if gold rank > k, 0 otherwise (no gold edge has rank of -1) Minimize the loss: H is the unit step function 9 k k

Beam-Width Prediction Method: Use a discriminative classifier to predict the optimal beam- width per chart cell Minimize the loss: L is the asymmetric loss function: If beam-width is too large, tolerable efficiency loss If beam-width is too small, high risk to accuracy Lambda set to 10 2 in all experiments 10 k

11 Beam-Width Prediction Special case: Predict if chart cell is open or closed to multi-word constituents

12 Beam-Width Prediction A “closed” chart cell may need to be partially open Binarized or dotted-rule parsing creates new “factored” productions:

13 Beam-Width Prediction Method 1: Constituent Closure

14 Beam-Width Prediction Constituent Closure is a per-cell generalization of Roark & Hollingshead (2008) –O(n 2 ) classifications instead of O(n)

15 Beam-Width Prediction Method 2: Complete Closure

16 Beam-Width Prediction Method 3: Beam-Width Prediction

17 Beam-Width Prediction Method 3: Beam-Width Prediction Use multiple binary classifiers instead of regression (better performance) Local beam-width taken from classifier with smallest beam-width prediction Best performance with four binary classifiers: 0, 1, 2, 4 –97% of positive examples have beam-width <= 4 –Don’t need a classifier for every possible beam- width value between 0 and global maximum (15 in our case)

18 Beam-Width Prediction

19 Beam-Width Prediction

20 Beam-Width Prediction Section 22 development set results Decoding time is seconds per sentence averaged over all sentences in Section 22 Parsing with Berkeley latent variable grammar (4.3 million productions) ParserSecs/SentSpeedupF1 CYK CYK + Constituent Closure x89.3 CYK + Complete Closure x89.3

21 Beam-Width Prediction ParserSecs/SentSpeedupF1 CYK CYK + Constituent Closure x89.3 CYK + Complete Closure x89.3 Beam + Inside FOM (BI) BI + Constituent Closer x89.2 BI + Complete Closure x89.3 BI + Beam-Predict x89.3

22 Beam-Width Prediction ParserSecs/SentSpeedupF1 CYK CYK + Constituent Closure x89.3 CYK + Complete Closure x89.3 Beam + Inside FOM (BI) BI + Constituent Closer x89.2 BI + Complete Closure x89.3 BI + Beam-Predict x89.3 Beam + Boundary FOM (BB) BB + Constituent Closure x89.2 BB + Complete Closure x89.3 BB + Beam-Predict x89.3

Beam-Width Prediction 23 ParserSecs/SentSpeedupF1 CYK CYK + Constituent Closure x89.3 CYK + Complete Closure x89.3 Beam + Inside FOM (BI) BI + Constituent Closer x89.2 BI + Complete Closure x89.3 BI + Beam-Predict x89.3 Beam + Boundary FOM (BB) BB + Constituent Closure x89.2 BB + Complete Closure x89.3 BB + Beam-Predict x89.3 Most recent numbers x89.x

24 Beam-Width Prediction Section 23 test results Only MaxRule is marginalizing over latent variables and performing non-Viterbi decoding ParserSecs/SentF1 CYK Berkeley CTF MaxRule Petrov and Klein (2007) Berkeley CTF Viterbi Beam + Boundary FOM (BB) Caraballo and Charniak (1998) BB + Chart Constraints Roark and Hollingshead (2008; 2009) BB + Beam-Prediction

Thanks. 25

26 Beam-Width Prediction

27 FOM Details C&C FOM Details –FOM(NT) = Outside left * Inside * Outside right –Inside = Accumulated grammar score –Outside left = Max POS [ POS forward prob * POS-to-NT transition prob ] –Outside right = Max POS [ NT-to-POS transition prob * POS bkwd prob ]

28 FOM Details C&C FOM Details