Approximation-aware Dependency Parsing by Belief Propagation September 19, 2015 TACL at EMNLP 1 Matt Gormley Mark Dredze Jason Eisner.

Slides:



Advertisements
Similar presentations
Preliminary Results (Synthetic Data) We generate a random 4-ary MRF and we sample training and test data. We forget the structure and start learning with.
Advertisements

Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
CS590M 2008 Fall: Paper Presentation
Exact Inference in Bayes Nets
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
ICCV 2007 tutorial Part III Message-passing algorithms for energy minimization Vladimir Kolmogorov University College London.
Supervised Learning Recap
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
Pearl’s Belief Propagation Algorithm Exact answers from tree-structured Bayesian networks Heavily based on slides by: Tomas Singliar,
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Markov Networks.
Belief Propagation on Markov Random Fields Aggeliki Tsoli.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Conditional Random Fields
Belief Propagation, Junction Trees, and Factor Graphs
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Artificial Neural Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Graphical models for part of speech tagging
Multiple-Layer Networks and Backpropagation Algorithms
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Belief Propagation. What is Belief Propagation (BP)? BP is a specific instance of a general class of methods that exist for approximate inference in Bayes.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Tokyo Institute of Technology, Japan Yu Nishiyama and Sumio Watanabe Theoretical Analysis of Accuracy of Gaussian Belief Propagation.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Distributed cooperation and coordination using the Max-Sum algorithm
1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation.
Multiple-Layer Networks and Backpropagation Algorithms
Learning Deep Generative Models by Ruslan Salakhutdinov
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Section 4: Incorporating Structure into Factors and Variables
Matt Gormley Lecture 16 October 24, 2016
Conditional Random Fields for ASR
Inside-Outside & Forward-Backward Algorithms are just Backprop
Markov Networks.
Goodfellow: Chap 6 Deep Feedforward Networks
Artificial Neural Network & Backpropagation Algorithm
Overview of Machine Learning
CSCI 5832 Natural Language Processing
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Expectation-Maximization & Belief Propagation
Markov Networks.
Derivatives and Gradients
Presentation transcript:

Approximation-aware Dependency Parsing by Belief Propagation September 19, 2015 TACL at EMNLP 1 Matt Gormley Mark Dredze Jason Eisner

Motivation #1: Approximation-unaware Learning Problem: Approximate inference causes standard learning algorithms to go awry (Kulesza & Pereira, 2008) 2 Can we take our approximations into account? with exact inference: with approx. inference:

Motivation #2: Hybrid Models Graphical models let you encode domain knowledge Neural nets are really good at fitting the data discriminatively to make good predictions 3 Could we define a neural net that incorporates domain knowledge? … … …

Our Solution Key idea: Treat your unrolled approximate inference algorithm as a deep network 4 … … …

Talk Summary 5 Loopy BP Dynamic Prog.+ Structured BP= ERMA / Back-BP= Loopy BP +Backprop. This Talk= +Backprop. Loopy BP Dynamic Prog.+ (Smith & Eisner, 2008) (Eaton & Ghahramani, 2009) (Stoyanov et al., 2011)

6 This Talk= +Backprop. Loopy BP Dynamic Prog.+ = +Neural Networks Graphical Models Hypergraphs+ The models that interest me If you’re thinking, “This sounds like a great direction!” Then you’re in good company And have been since before 1995

7 This Talk= +Backprop. Loopy BP Dynamic Prog.+ = +Neural Networks Graphical Models Hypergraphs+ The models that interest me So what’s new since 1995? Two new emphases: 1.Learning under approximate inference 2.Structural constraints

An Abstraction for Modeling 8 Mathematical Modeling y2y2 y1y1 ψ 12 Factor Graph (bipartite graph) variables (circles) factors (squares) ψ2ψ2

Factor Graph for Dependency Parsing 9 Y 2,1 Y 1,2 Y 3,2 Y 2,3 Y 3,1 Y 1,3 Y 4,3 Y 3,4 Y 4,2 Y 2,4 Y 4,1 Y 1,4 Y 0,1 Y 0,3 Y 0,4 Y 0,2

Factor Graph for Dependency Parsing Juan_Carlos su abdica reino (Smith & Eisner, 2008) Left arc Right arc Y 2,1 Y 1,2 Y 3,2 Y 2,3 Y 3,1 Y 1,3 Y 4,3 Y 3,4 Y 4,2 Y 2,4 Y 4,1 Y 1,4 Y 0,1 Y 0,3 Y 0,4 Y 0,2

Factor Graph for Dependency Parsing 11 ✔      ✔   ✔      ✔ Juan_Carlos su abdica reino (Smith & Eisner, 2008) Left arc Right arc

Factor Graph for Dependency Parsing 12 ✔      ✔   ✔   ✔ Juan_Carlos su abdica reino Unary: local opinion about one edge (Smith & Eisner, 2008)

Factor Graph for Dependency Parsing 13 ✔      ✔ ✔  ✔   ✔ Juan_Carlos su abdica reino Unary: local opinion about one edge (Smith & Eisner, 2008)

Factor Graph for Dependency Parsing 14 ✔      ✔   ✔   ✔ Juan_Carlos su abdica reino PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge (Smith & Eisner, 2008)

Factor Graph for Dependency Parsing 15 ✔      ✔   ✔   ✔ Juan_Carlos su abdica reino PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge (Smith & Eisner, 2008)

Factor Graph for Dependency Parsing 16 ✔      ✔   ✔      ✔ Juan_Carlos su abdica reino PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge Grandparent: local opinion about grandparent, head, and modifier (Smith & Eisner, 2008)

Factor Graph for Dependency Parsing 17 ✔      ✔   ✔      ✔ Juan_Carlos su abdica reino PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge Sibling: local opinion about pair of arbitrary siblings Grandparent: local opinion about grandparent, head, and modifier (Riedel and Smith, 2010) (Martins et al., 2010)

Factor Graph for Dependency Parsing (Riedel and Smith, 2010) (Martins et al., 2010) Now we can work at this level of abstraction. Y 2,1 Y 1,2 Y 3,2 Y 2,3 Y 3,1 Y 1,3 Y 4,3 Y 3,4 Y 4,2 Y 2,4 Y 4,1 Y 1,4 Y 0,1 Y 0,3 Y 0,4 Y 0,2

Why dependency parsing? 1.Simplest example for Structured BP 2.Exhibits both polytime and NP-hard problems 19

The Impact of Approximations 20 Linguistics Model Learning Inference (Inference is usually called as a subroutine in learning) NP-hard pθ(pθ() = 0.50

The Impact of Approximations 21 Linguistics Model Learning Inference (Inference is usually called as a subroutine in learning) Poly-time approximation! pθ(pθ() = 0.50

The Impact of Approximations 22 Linguistics Model Learning Inference (Inference is usually called as a subroutine in learning) Poly-time approximation! pθ(pθ() = 0.50 Does learning know inference is approximate?

Machine Learning Conditional Log-likelihood Training 1.Choose model Such that derivative in #3 is ea 2.Choose objective: Assign high probability to the things we observe and low probability to everything else 23 3.Compute derivative by hand using the chain rule 4.Replace exact inference by approximate inference

Conditional Log-likelihood Training 1.Choose model (3. comes from log-linear factors) 2.Choose objective: Assign high probability to the things we observe and low probability to everything else 24 3.Compute derivative by hand using the chain rule 4.Replace exact inference by approximate inference Machine Learning

What’s wrong with CLL? How did we compute these approximate marginal probabilities anyway? 25 By Structured Belief Propagation of course! Machine Learning

Everything you need to know about: Structured BP 1.It’s a message passing algorithm 2.The message computations are just multiplication, addition, and division 3.Those computations are differentiable 26

Structured Belief Propagation 27 ✔      ✔   ✔      ✔ Juan_Carlos su abdica reino This is just another factor graph, so we can run Loopy BP What goes wrong? Naïve computation is inefficient We can embed the inside- outside algorithm within the structured factor This is just another factor graph, so we can run Loopy BP What goes wrong? Naïve computation is inefficient We can embed the inside- outside algorithm within the structured factor (Smith & Eisner, 2008) Inference

Algorithmic Differentiation Backprop works on more than just neural networks You can apply the chain rule to any arbitrary differentiable algorithm Alternatively: could estimate a gradient by finite-difference approximations – but algorithmic differentiation is much more efficient! 28 That’s the key (old) idea behind this talk.

29 … Model parameters Factors … Unary factor: vector with 2 entries Binary factor: (flattened) matrix with 4 entries Feed-forward Topology of Inference, Decoding and Loss

30 … Model parameters Factors … … Messages at time t=1 … Messages at time t=0 Messages from neighbors used to compute next message Leads to sparsity in layerwise connections

Feed-forward Topology of Inference, Decoding and Loss 31 … Model parameters Factors … … Messages at time t=1 … Messages at time t=0 Arrows in This Diagram: A different semantics given by the algorithm Arrows in Neural Net: Linear combination, then a sigmoid

Feed-forward Topology of Inference, Decoding and Loss 32 … Model parameters Factors … … Messages at time t=1 … Messages at time t=0 Arrows in This Diagram: A different semantics given by the algorithm Arrows in Neural Net: Linear combination, then a sigmoid

Feed-forward Topology 33 … Model parameters Decode / Loss Factors … … Beliefs Messages at time t=3 … Messages at time t=2 … … Messages at time t=1 … Messages at time t=0

Feed-forward Topology 34 … Model parameters Decode / Loss Factors … … Beliefs Messages at time t=3 … Messages at time t=2 … … Messages at time t=1 … Messages at time t=0 Messages from PTree factor rely on a variant of inside- outside Arrows in This Diagram: A different semantics given by the algorithm

Feed-forward Topology 35 … … … … … … … Messages from PTree factor rely on a variant of inside- outside Chart parser:

Approximation-aware Learning 1.Choose model to be the computation with all its approximations 2.Choose objective to likewise include the approximations 3.Compute derivative by backpropagation (treating the entire computation as if it were a neural network) 4.Make no approximations! (Our gradient is exact) 36 Machine Learning Key idea: Open up the black box!

Experimental Setup Goal: Compare two training approaches 1.Standard approach (CLL) 2.New approach (Backprop) Data: English PTB – Converted to dependencies using Yamada & Matsumoto (2003) head rules – Standard train (02-21), dev (22), test (23) split – TurboTagger predicted POS tags Metric: Unlabeled Attachment Score (higher is better) 37

Results Speed-Accuracy Tradeoff New training approach yields models which are: 1.Faster for a given level of accuracy 2.More accurate for a given level of speed 38 Faster More accurate Dependency Parsing

Results Increasingly Cyclic Models As we add more factors to the model, our model becomes loopier Yet, our training by Backprop consistently improves as models get richer 39 Richer Models More accurate Dependency Parsing

See our TACL paper for… 1) Results on 19 languages from CoNLL 2006 / ) Results with alternate training objectives 3) Empirical comparison of exact and approximate inference

Comparison of Two Approaches 1. CLL with approximate inference – A totally ridiculous thing to do! – But it’s been done for years because it often works well – (Also named “surrogate likelihood training” by Wainright (2006)) 41 Machine Learning

Comparison of Two Approaches 2. Approximation-aware Learning for NLP – In hindsight, treating the approximations as part of the model is the obvious thing to do (Domke, 2010; Domke, 2011; Stoyanov et al., 2011; Ross et al., 2011; Stoyanov & Eisner, 2012; Hershey et al., 2014) – Our contribution: Approximation-aware learning with structured factors – But there's some challenges to get it right (numerical stability, efficiency, backprop through structured factors, annealing a decoder’s argmin) – Sum-Product Networks are similar in spirit (Poon & Domingos, 2011; Gen & Domingos, 2012) 42 Machine Learning Key idea: Open up the black box!

Takeaways New learning approach for Structured BP maintains high accuracy with fewer iterations of BP, even with cycles Need a neural network? Treat your unrolled approximate inference algorithm as a deep network 43

Questions? Pacaya - Open source framework for hybrid graphical models, hypergraphs, and neural networks Features: – Structured BP – Coming Soon: Approximation-aware training Language: Java URL: