Sequential Learning with Dependency Nets

Slides:

Advertisements

Similar presentations

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.

Advertisements

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

CS479/679 Pattern Recognition Dr. George Bebis

Exact Inference in Bayes Nets

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Markov Networks.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Conditional Random Fields

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Bayesian Learning Rong Jin.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Bayesian Networks Alan Ritter.

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Graphical models for part of speech tagging

Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.

Priors, Normal Models, Computing Posteriors

Directed - Bayes Nets Undirected - Markov Random Fields Gibbs Random Fields Causal graphs and causality GRAPHICAL MODELS.

Markov Random Fields Probabilistic Models for Images

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Conditional Markov Models: MaxEnt Tagging and MEMMs

Markov Random Fields & Conditional Random Fields

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Belief Propagation for Image Understanding David Rosenberg.

IE With Undirected Models: the saga continues

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Advanced Statistical Computing Fall 2016

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Conditional Random Fields

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Bayesian inference Presented by Amir Hadadi

CRFs for SPLODD William W. Cohen Sep 8, 2011.

Markov Networks.

Latent Dirichlet Analysis

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Statistical Models for Automatic Speech Recognition

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

N-Gram Model Formulas Word sequences Chain rule of probability

Bayesian Inference for Mixture Language Models

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence Fall 2008

Expectation-Maximization & Belief Propagation

Topic models for corpora and for graphs

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Learning From Observed Data

IE With Undirected Models

CS 188: Artificial Intelligence Fall 2007

NER with Models Allowing Long-Range Dependencies

The Voted Perceptron for Ranking and Structured Classification

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Markov Networks.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

Sequential Learning with Dependency Nets William W. Cohen 2/22

CRFs: the good, the bad, and the cumbersome… Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training) Complexity (do we need all this math?) Amount of context: Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. Strong commitment to maxent-style learning Loglinear models are nice, but nothing is always best.

Dependency Nets

Proposed solution: parents of node are the Markov blanket like undirected Markov net capture all “correlational associations” one conditional probability for each node X, namely P(X|parents of X) like directed Bayes net–no messy clique potentials

Example – bidirectional chains Y1 Y2 … Yi … When will dr Cohen post the notes

DN chains … … How do we do inference? Iteratively: Yi … When will dr Cohen post the notes How do we do inference? Iteratively: Pick values for Y1, Y2, …at random Pick some j, and compute Set new value of Yj according to this Go back to (2) Current values

This an MCMC process Transition probability General case … … Markov Chain Monte Carlo: a randomized process that doesn’t depend on previous y’s changes y(t) to y(t+1) One particular run … … How do we do inference? Iteratively: Pick values for Y1, Y2, …at random: y(0) Pick some j, and compute Set new value of Yj according to this: y(1) Go back to (2) and repeat to get y(1) , y(2) , …, y(t) , … Current values (t)

This an MCMC process … … Claim: suppose Y(t) is drawn from some distribution D such that Then Y(t+1) is also drawn from D (i.e., the random flip doesn’t move us “away from D”

This an MCMC process … … “Burn-in” Claim: if you wait long enough then for some t, Y(t) will be drawn from some distribution D such that …under certain reasonable conditions (e.g., graph of potential edges is connected, …). So D is a “sink”.

averaged for prediction This an MCMC process … … “burn-in” - discarded averaged for prediction An algorithm: Run the MCMC chain for a long time t, and hope that Y(t) will be drawn from the target distribution D. Run the MCMC chain for a while longer and save sample S = { Y(t) , Y(t+1) , …, Y(t+m) } Use S to answer any probabilistic queries like Pr(Yj|X)

More on MCMC This particular process is Gibbs sampling Transition probabilities are defined by sampling from the posterior of one variable Yj given the others. MCMC is very general-purpose inference scheme (and sometimes very slow) On the plus side, learning is relatively cheap, since there’s no inference involved (!) A dependency net is closely related to a Markov random field learned by maximizing pseudo-likelihood Identical? Statistical relation learning community has some proponents of this approach: Pedro Domingos, David Jensen, … A big advantage is the generality of the approach Sparse learners (eg L1 regularized maxent, decision trees, …) can be used to infer Markov blanket (NIPS 2006)

Examples Y1 Y2 … Yi … When will dr Cohen post the notes

Examples … … POS? … … BIO Z1 Z2 Zi Y1 Y2 Yi will dr post the notes When will dr Cohen post the notes

Examples Y1 Y2 … Yi … When will dr Cohen post the notes

Dependency nets The bad and the ugly: Inference is less efficient –MCMC sampling Can’t reconstruct probability via chain rule Networks might be inconsistent ie local P(x|pa(x))’s don’t define a pdf Exactly equal, representationally, to normal undirected Markov nets

Dependency nets The good: Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) Inference can be speeded up substantially over naïve Gibbs sampling.

Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

Toutanova, Klein, Manning, Singer Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger

Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11} Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

Results with model

Results with model

Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”

Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 (Ratnaparki) (Lafferty et al ICML2001)

Other comments Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere