1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

An Introduction to Artificial Intelligence

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

On-line learning and Boosting

Fast Algorithms For Hierarchical Range Histogram Constructions

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Pattern Recognition and Machine Learning

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Review of : Yoav Freund, and Robert E

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

1 Prepared and presented by Roozbeh Farahbod Voted Perceptron: Modified for NP-Chunking A Re-ranking Method.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

Support Vector Machines and Kernel Methods

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

This time: Outline Game playing The minimax algorithm

Branch and Bound Searching Strategies

Sparse vs. Ensemble Approaches to Supervised Learning

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

1 Branch and Bound Searching Strategies 2 Branch-and-bound strategy 2 mechanisms: A mechanism to generate branches A mechanism to generate a bound so.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

An SVM Based Voting Algorithm with Application to Parse Reranking Paper by Libin Shen and Aravind K. Joshi Presented by Amit Wolfenfeld.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Online Learning Algorithms

Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Learning: Nearest Neighbor Artificial Intelligence CMSC January 31, 2002.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Graphical models for part of speech tagging

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Querying Structured Text in an XML Database By Xuemei Luo.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Christopher Moh 2005 Competition Programming Analyzing and Solving problems.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

1 Branch and Bound Searching Strategies Updated: 12/27/2010.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

NTU & MSRA Ming-Feng Tsai

Branch and Bound Searching Strategies

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Analysis and design of algorithm

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

6.001 SICP Interpretation Parts of an interpreter

Presentation transcript:

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

2 Discriminative methods for parsing have shown a remarkable performance compared to traditional generative models, e.g., PCFG two approaches  re-ranking [Collins 00, Collins 02] discriminative machine learning algorithms are used to rerank n-best outputs of generative/conditional parsers.  dynamic programming Max margin parsing [Tasker 04]

3 Reranking Let x be an input sentence, and y be a parse tree for x Let G(x) be a function that returns a set of n- best results for x A re-ranker gives a score to each sentence and selects the result which has the highest score x: I buy cars with money …. y1y1 y2y2 y3y3 n-best results G(x)

4 Scoring with linear model is a feature function that maps output y into space is a parameter vector (weights) modeled with training data

5 Two issues in linear model [1/2] How to estimate the weights ?  try to minimize a loss for given training data  definition of loss: ME SVMs Boosting

6 Two issues in linear model [2/2] How to define the feature set ?  use all subtrees Pros: - natural extension of CFG rules - can capture long contextual information Cons: naïve enumerations give huge complexities

7 A question for all subtrees Do we always need all subtrees?  only a small set of subtrees is informative  most subtrees are redundant Goal: automatic feature selection from all subtrees  can perform fast parsing  can give good interpretation to selected subtrees Boosting meets our demand!

8 Why Boosting? Different regularization strategies for  L1 (Boosting) better when most given features are irrelevant can remove redundant features  L2 (SVMs) better when most given features are relevant uses features as much as they can Boosting meets our demand, because most subtrees are irrelevant and redundant

9 RankBoost [Freund03] Current weights Next weights Update feature k with an increment δ select the optimal pair that minimizes the Loss

10 How to find the optimal subtree?  Set of all subtrees is huge  Need to find the optimal subtree efficiently A variant of Branch-and-Bound Define a search space in which the whole set of subtrees is given Find the optimal subtree by traversing this search space Prune the search space by proposing a criterion

11 Ad-hoc techniques Size constraints  Use subtrees whose size is less than s (s = 6~8) Frequency constraints  Use subtrees that occur no less than f times in training data (f = 2 ~ 5) Pseudo iterations  After several 5- or 10-iterations of boosting, we alternately perform 100- or 300 pseudo iterations, in which the optimal subtee is selected from the cache that maintains the features explored in the previous iterations.

12 Relation to previous work Boosting vs Kernel methods [Collins 00] Boosting vs Data Oriented Parsing [Bod 98]

13 Kernels [Collins 00] Kernel methods reduce the problem into the dual form that only depends on dot products of two instances (parsed trees) Pros  No need to provide explicit feature vector  A dynamic programming is used to calculate dot products between trees, which is very efficient! Cons  Require a large number of kernel evaluations in testing  Parsing is slow  Difficult to see which features are relevant

14 DOP [Bod 98] DOP is not based on re-ranking DOP deals with the all the subtrees representation explicitly like our method Pros  high accuracy Cons  exact computation is NP-complete  cannot always provide sparse feature representation  very slow since the number of subtrees the DOP uses is huge

15 Kernels vs DOP vs Boosting KernelDOPBoosting How to enumerate all the subtrees? implicitlyexplicitly Complexity in training polynomialNP-hard (worst case) Branch-and-bound Sparse feature representations No Yes Parsing speed slow fast Can see relevant features? NoYes, but difficult because of redundant features Yes

16 Experiments WSJ parsing Shallow parsing

17 Experiments WSJ parsing  Standard data: training: 2-21, test 23 of PTB  Model2 of Collins 99 was used to obtain n-best results  exactly the same setting as [Collins 00 (Kernels)] Shallow parsing  CoNLL 2000 shared task  training:15-18, test: 20 of PTB  CRF-based parser [Sha 03] was used to obtain n-best results

18 Tree representations WSJ parsing  lexicalized tree  each non-terminal has a special node labeled with a head word Shallow parsing  right-branching tree where adjacent phrases are child/parent relation  special node for right/left boundaries

19 Results: WSJ parsing LR/LP = labeled recall/precision. CBs is the average number of cross brackets per sentence. 0 CBs, and 2CBs are the percentage of sentences with 0 or 2 crossing brackets, respectively Comparable to other methods Better than kernel method that uses all subtree representations with different parameter estimation

20 Results: Shallow parsing Comparable to other methods Our method is also comparable to Zhang’s method even without extra linguistic features Fβ=1 is a harmonic mean between precision and recall

21 Advantages Compact feature set  WSJ parsing: ~ 8,000  Shallow parsing: ~ 3,000  Kernels implicitly use a huge number of features Parsing is very fast  WSJ parsing: sec./sentence  Shallow parsing: sec./sentence (n-best parsing time is NOT included)

22 Advantages, cont’d Sparse feature representations allow us to analyze which kinds of subtrees are relevant Shallow parsing positive subtrees negative subtrees positive subtrees WSJ parsing

23 Conclusions All subtrees are potentially used as features Boosting  L1 norm regularization performs automatic feature selection Branch and bound  enables us to find the optimal subtrees efficiently Advantages:  comparable accuracy to other parsing methods  fast parsing  good interpretability

24 Efficient computation

25 Right most extension [Asai02, Zaki02] Extend a given tree of size (n-1) by adding a new node to obtain trees of size n  a node is added to the right-most-path  a node is added as the rightmost sibling b a c a b 56 c 3 b a c ab 56 c 3 b a c ab 56 c 3 b a c ab 56 c 3 rightmost- path t 7 7 7

26 Right most extension, cont. Recursive applications of right most extensions create a search space

27 Pruning For all propose an upper bound such that Can prune the node t if, where is a suboptimal gain Pruning strategy μ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4

28 Upper bound of the gain

29