11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.

Slides:



Advertisements
Similar presentations
Natural Language Processing COMPSCI 423/723
Advertisements

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
An Introduction to Variational Methods for Graphical Models.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
What is Statistical Modeling
Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9,
Bayesian Belief Networks
Conditional Random Fields
Today Logistic Regression Decision Trees Redux Graphical Models
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Feb, 27, 2015 Slide Sources Raymond J. Mooney University of.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Read R&N Ch Next lecture: Read R&N
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Final review LING572 Fei Xia Week 10: 03/11/
1 CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin.
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
1 CS 343: Artificial Intelligence Probabilistic Reasoning and Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Introduction to Bayesian Networks
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Classification Techniques: Bayesian Classification
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Lecture 2: Statistical learning primer for biologists
John Lafferty Andrew McCallum Fernando Pereira
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Web-Mining Agents Part: Data Mining
Read R&N Ch Next lecture: Read R&N
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Read R&N Ch Next lecture: Read R&N
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University of Texas at Austin

2 Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The marginal probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. circlesquare red blue circlesquare red blue0.20 positive negative

3 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible vector value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i = 1…m Could be done using joint distribution but this requires estimating an exponential number of parameters.

4 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint.

5 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals: P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Still need to make some sort of independence assumptions about the features to make learning tractable.

6 Naïve Bayes Generative Model Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category

7 Naïve Bayes Inference Problem Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category lg red circ ??

8 Naïve Bayesian Categorization If we assume features of an instance are independent given the category (conditionally independent). Therefore, we then only need to know P(X i | Y) for each possible pair of a feature-value and a category. If Y and all X i and binary, this requires specifying only 2n parameters: –P(X i =true | Y=true) and P(X i =true | Y=false) for each X i –P(X i =false | Y) = 1 – P(X i =true | Y) Compared to specifying 2 n parameters without any independence assumptions.

9 Generative vs. Discriminative Models Generative models and are not directly designed to maximize the performance of classification. They model the complete joint distribution P(X,Y). Classification is then done using Bayesian inference given the generative model of the joint distribution. But a generative model can also be used to perform any other inference task, e.g. P(X 1 | X 2, …X n, Y) –“Jack of all trades, master of none.” Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X). By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data.

Logistic Regression Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is: Equivalent to a one-layer backpropagation neural net. –Logistic regression is the source of the sigmoid function used in backpropagation. –Objective function for training is somewhat different.

Logistic Regression as a Log-Linear Model Logistic regression is basically a linear model, which is demonstrated by taking logs. Also called a maximum entropy model (MaxEnt) because it can be shown that standard training for logistic regression gives the distribution with maximum entropy that is consistent with the training data.

Logistic Regression Training Weights are set during training to maximize the conditional data likelihood : where D is the set of training examples and Y d and X d denote, respectively, the values of Y and X for example d. Equivalently viewed as maximizing the conditional log likelihood (CLL)

Logistic Regression Training Like neural-nets, can use standard gradient descent to find the parameters (weights) that optimize the CLL objective function. Many other more advanced training methods are possible to speed convergence. –Conjugate gradient –Generalized Iterative Scaling (GIS) –Improved Iterative Scaling (IIS) –Limited-memory quasi-Newton (L-BFGS)

Preventing Overfitting in Logistic Regression To prevent overfitting, one can use regularization (a.k.a. smoothing) by penalizing large weights by changing the training objective: This can be shown to be equivalent to MAP parameter estimation assuming a Guassian prior for W with zero mean and a variance related to 1/λ. Where λ is a constant that determines the amount of smoothing

Multinomial Logistic Regression (MaxEnt) Logistic regression can be generalized to multi-class problems (where Y has a multinomial distribution). Create feature functions for each combination of a class value y´ and each feature X j and another for the “bias weight” of each class. –f y´, j (Y, X) = X j if Y= y´ and 0 otherwise –f y´ (Y, X) = 1 if Y= y´ and 0 otherwise The final conditional distribution is: (normalizing constant) (λ k are weights)

Graphical Models If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference. –No realistic amount of training data is sufficient to estimate so many parameters. If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted. Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated. –Bayesian Networks: Directed acyclic graphs that indicate causal structure. –Markov Networks: Undirected graphs that capture general dependencies.

Bayesian Networks Directed Acyclic Graph (DAG) –Nodes are random variables –Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls

Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). –Roots (sources) of the DAG that have no parents are given prior probabilities. Burglary Earthquake Alarm JohnCalls MaryCalls P(B).001 P(E).002 BEP(A) TT.95 TF.94 FT.29 FF.001 AP(M) T.70 F.01 AP(J) T.90 F.05

Joint Distributions for Bayes Nets A Bayesian Network implicitly defines a joint distribution. Example

Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y X1X1 X2X2 … XnXn Priors P(Y) and conditionals P(X i |Y) for Naïve Bayes provide CPTs for the network.

Markov Networks Undirected graph over a set of random variables, where an edge represents a dependency. The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X). Every node in a Markov Net is conditionally independent of every other node given its Markov blanket.

Distribution for a Markov Network The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions), φ k, for each clique, k, in the graph. For each joint assignment of values to the variables in clique k, φ k assigns a non-negative real value that represents the compatibility of these values. The joint distribution of a Markov network is then defined by: Where x {k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1.

Sample Markov Network Burglary Earthquake Alarm JohnCalls MaryCalls BA 11 TT100 TF1 FT1 FF200 EA 22 TT50 TF10 FT1 FF200 MA 44 TT50 TF1 FT10 FF200 JA 33 TT75 TF10 FT1 FF200

Logistic Regression as a Markov Net Logistic regression is a simple Markov Net Y X1X1 X2X2 … XnXn But only models the conditional distribution, P(Y | X) and not the full joint P(X,Y) Same as a discriminatively trained naïve Bayes.

Generative vs. Discriminative Sequence Labeling Models HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q). HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task. Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O)

Classification Y X1X1 X2X2 … XnXn Y X1X1 X2X2 … XnXn Naïve Bayes Logistic Regression Conditional Generative Discriminative

Sequence Labeling Y2Y2 X1X1 X2X2 … XTXT HMM Linear-chain CRF Conditional Generative Discriminative Y1Y1 YTYT.. Y2Y2 X1X1 X2X2 … XTXT Y1Y1 YTYT

Simple Linear Chain CRF Features Modeling the conditional distribution is similar to that used in multinomial logistic regression. Create feature functions f k (Y t, Y t−1, X t ) –Feature for each state transition pair i, j f i,j (Y t, Y t−1, X t ) = 1 if Y t = i and Y t−1 = j and 0 otherwise –Feature for each state observation pair i, o f i,o (Y t, Y t−1, X t ) = 1 if Y t = i and X t = o and 0 otherwise Note: number of features grows quadratically in the number of states (i.e. tags). 28

Conditional Distribution for Linear Chain CRF Using these feature functions for a simple linear chain CRF, we can define: 29

Adding Token Features to a CRF Can add token features X i,j 30 … X 1,1 X 1,m … X 2,1 X 2,m … X T,1 X T,m … … Can add additional feature functions for each token feature to model conditional distribution. Y1Y1 Y2Y2 YTYT

Features in POS Tagging For POS Tagging, use lexicographic features of tokens. –Capitalized? –Start with numeral? –Ends in given suffix (e.g. “s”, “ed”, “ly”)? 31

Enhanced Linear Chain CRF (standard approach) Can also condition transition on the current token features. 32 … X 1,1 X 2,1 … … … … X T,1 X 1,m X 2,m X T,m Add feature functions: f i,j,k (Y t, Y t−1, X) 1 if Y t = i and Y t−1 = j and X t −1,k = 1 and 0 otherwise Y2Y2 YTYT Y1Y1

Supervised Learning (Parameter Estimation) As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data. See paper for details. 33

Sequence Tagging (Inference) Variant of Viterbi algorithm can be used to efficiently, O(TN 2 ), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X). See paper for details. 34

Skip-Chain CRFs Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model. 35 Y2Y2 X1X1 X2X2 … X3X3 Y1Y1 Y3Y3 Michael Dell said Dell bought Y 100 X 100 Y 101 X 101 Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable labeling.

36 CRF Results Experimental results verify that they have superior accuracy on various sequence labeling tasks. –Part of Speech tagging –Noun phrase chunking –Named entity recognition –Semantic role labeling However, CRFs are much slower to train and do not scale as well to large amounts of training data. –Training for POS on full Penn Treebank (~1M words) currently takes “over a week.” Skip-chain CRFs improve results on IE.

CRF Summary CRFs are a discriminative approach to sequence labeling whereas HMMs are generative. Discriminative methods are usually more accurate since they are trained for a specific performance task. CRFs also easily allow adding additional token features without making additional independence assumptions. Training time is increased since a complex optimization procedure is needed to fit supervised training data. CRFs are a state-of-the-art method for sequence labeling. 37