Margin-based Decomposed Amortized Inference

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

ECE Longest Path dual 1 ECE 665 Spring 2005 ECE 665 Spring 2005 Computer Algorithms with Applications to VLSI CAD Linear Programming Duality – Longest.

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.

1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.

The number of edge-disjoint transitive triples in a tournament.

Global and Local Wikification (GLOW) in TAC KBP Entity Linking Shared Task 2011 Lev Ratinov, Dan Roth This research is supported by the Defense Advanced.

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Lecture 10: Support Vector Machines

Standard EM/ Posterior Regularization (Ganchev et al, 10) E-step: M-step: argmax w E q log P (x, y; w) Hard EM/ Constraint driven-learning (Chang et al,

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.

Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.

SVM by Sequential Minimal Optimization (SMO)

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

INFERENCE Shalmoli Gupta, Yuzhe Wang. Outline Introduction Dual Decomposition Incremental ILP Amortizing Inference.

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.

Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

A Fast Finite-state Relaxation Method for Enforcing Global Constraints on Sequence Decoding Roy Tromble & Jason Eisner Johns Hopkins University.

Tightening LP Relaxations for MAP using Message-Passing David Sontag Joint work with Talya Meltzer, Amir Globerson, Tommi Jaakkola, and Yair Weiss.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

Tommy Messelis * Stefaan Haspeslagh Burak Bilgin Patrick De Causmaecker Greet Vanden Berghe *

Chapter 1 Algorithms with Numbers. Bases and Logs How many digits does it take to represent the number N >= 0 in base 2? With k digits the largest number.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

CS4234 Optimiz(s)ation Algorithms L2 – Linear Programming.

Lecture 7: Constrained Conditional Models

Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos

Clustering Data Streams

Efficient Inference on Sequence Segmentation Models

Data Driven Resource Allocation for Distributed Learning

Integer Programming An integer linear program (ILP) is defined exactly as a linear program except that values of variables in a feasible solution have.

Dan Roth Department of Computer and Information Science

Part 2 Applications of ILP Formulations in Natural Language Processing

Analysis of Algorithms

Dan Roth Department of Computer and Information Science

Moran Feldman The Open University of Israel

On Dataless Hierarchical Text Classification

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

EMGT 6412/MATH 6665 Mathematical Programming Spring 2016

CIS 700 Advanced Machine Learning for NLP Inference Applications

Jan Rupnik Jozef Stefan Institute

Chapter 6. Large Scale Optimization

Relational Inference for Wikification

Instructor: Shengyu Zhang

Integer Programming (정수계획법)

Integer Programming (정수계획법)

EE5900 Advanced Embedded System For Smart Infrastructure

Attention for translation

Chapter 1. Formulations.

Dan Roth Department of Computer Science

Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos

Chapter 6. Large Scale Optimization

Presentation transcript:

Margin-based Decomposed Amortized Inference Gourab Kundu, Vivek Srikumar, Dan Roth Amortized Inference General Recipe Key Observations: (1) In NLP, we solve a lot of inference problems, at least one per sentence. (2) Redundancy of structures: The number of observed structures (blue solid line) is much smaller than the number of inputs (red dotted line). Moreover, the distribution of observed structures is highly skewed (inset). (Eg. for POS, a small number of tag sequences are much more frequent than the rest.)  Pigeon Principle Applies. Research Question: Can we solve the k-th inference instance much fast than the 1st? Amortized inference (Srikumar et al 2012) shows how computation from earlier inference instances can be used to speed up inference for new, previously unseen instances. If CONDITION(problem cache, new problem) then (no need to call the solver) SOLUTION(new problem) = old solution Else Call base solver and update cache End + A theorem guaranteeing correctness ILP formulations 2. Decomposed Amortized Inference using Lagrangian Relaxation 1. Margin based Amortization Given a problem: max cqT y s.t. MT y <= b: Partition constraints into two sets, say C1 and C2 (with constraints M2T y – b2) Define L(λ): max y ϵ C1 cqT y – λT (M2T y – b2), solved using any amortized algorithm Dual problem: min λ >= 0 L(λ), solved using gradient descent over dual variables λ. The dual can be decomposed into multiple smaller problems if no constraint in C1 has active variables from two different smaller problems. Even otherwise, it has fewer constraints than the original problem. Considering the dual can have two advantages from the perspective of amortization: (1) Smaller problems (2) Higher chance of cache hits Let B be the solution yp for inference problem p with objective cp, denoted by the red hyperplane and let A be the second best assignment. For a new objective function cq (blue), if the margin δ is greater than the sum of the decrease in the objective value of yp and the maximum increase in the objective of another solution (Δ), then the solution to the new inference problem will still be yp. ILP (Integer Linear Programming) is a general formulation for inference in structured prediction tasks [Roth & Yih, 04, 07] Inference using ILP has been successful in NLP tasks e.g. SRL, Dep. Parsing, Information extraction and more. ILP can be expressed as: max cx max2x1 + 3x2 s.t. Ax ≤ b x1 + x2 <= 1 x integer Amortized inference depends on having an ILP formulation; but multiple solvers might be used. More redundancy among smaller structures Redundancy in components of structures: We extend amortization to cases where the full structured output is not repeated by storing partial computation for future inference problems. Formally , the condition is given by: - ( cq - cp)T yp + Δ <= δ At the caching stage, δ is computed for each ILP and stored in cache. At test time, computing Δ exactly requires solving an ILP. Instead, we compute an approximate Δ by solving a relaxed problem which can be done efficiently. Experimental Setup Semantic Role Labeling Entities and Relations [Roth & Yih 2004] See also [Punyakanok, et al 2008] We simulate a long-running NLP process by caching problems and solutions from Gigaword corpus. We used a database engine to cache ILPs and their solutions along their and structured margin. We compare our approaches to a state-of-the-art ILP solver (Gurobi) and also to Theorem 1 from (Srikumar et al.2012). This work continues the original work on Amortization: Srikumar, Kundu and Roth. On Amortizing Inference Cost for Structured Prediction. EMNLP, 2012 Solve only one in four problems Solve only one in six problems Wall clock improvements too This research is sponsored by the Army Research Laboratory (ARL) under agreement W911NF-09-2-0053, Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0181, DARPA under agreement number FA8750-13-2-0008, Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20155.