Conditional Random Fields

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Advertisements

Expectation Maximization
Conditional Random Fields and beyond …
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Eine Einführung.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1.
Visual Recognition Tutorial
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:
Graphical models for part of speech tagging
Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CS Statistical Machine learning Lecture 24
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
CS621: Artificial Intelligence
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Stanford POS tagger 17 th February System requirement Java 1.5+ –
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Other Models for Time Series. The Hidden Markov Model (HMM)
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2014 John Canny.
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
Prototype-Driven Learning for Sequence Models
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

Conditional Random Fields

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: DT NN VBD IN DT NN . The cat sat on the mat .

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, partial parsing (aka chunking): B-NP I-NP B-VP B-PP B-NP I-NP The cat sat on the mat

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, relation extraction: B-Arg I-Arg B-Rel I-Rel B-Arg I-Arg The cat sat on the mat

The CRF Equation A CRF model consists of F = <f1, …, fk>, a vector of “feature functions” θ = < θ1, …, θk>, a vector of weights for each feature function. Let O = < o1, …, oT> be an observed sentence Let X = <x1, …, xT> be the latent variables. This is the same as the Maximum Entropy equation!

CRF Equation, standard format Note that the denominator depends on O, but not on y (it’s marginalizing over y). Typically, we write where

Making Structured Predictions

Structured prediction vs. Text Classification Recall: max. ent. for text classification: CRFs for sequence labeling: What’s the difference?

Structured prediction vs. Text Classification Two (related) differences, both for the sake of efficiency: Feature functions in CRFs are restricted to graph parts (described later) We can’t do brute force to compute the argmax. Instead, we do Viterbi.

Finding the Best Sequence Best sequence is Recall from HMM discussion: If there are K possible states for each xi variable, and N total xi variables, Then there are KN possible settings for x So brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.

Viterbi Algorithm X1 Xt-1 Xt=hj o1 ot-1 ot ot+1 oT The state sequence which maximizes the score of seeing the observations to time t-1, landing in state hj at time t, and seeing the observation at time t

Viterbi Algorithm oT o1 ot ot-1 ot+1 x1 xt-1 xt xt+1 xT Compute the most likely state sequence by working backwards

Viterbi Algorithm ??! Recursive Computation ??! X1 Xt-1 Xt=hj Xt+1 o1 ot-1 ot ot+1 oT ??! Recursive Computation ??!

Feature functions and Graph parts To make efficient computation (dynamic programs) possible, we restrict the feature functions to: Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph. Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.

Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF

Individual node cliques Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. Individual node cliques X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF

Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. Pair-of-node cliques X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF

Clique Example For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques: X5’ X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF Larger cliques

Graph part as Feature Function Example Graph parts are feature functions f(x,o) that count how many cliques have a particular configuration. For example, f(x,o) = count of [xi = Noun]. Here, x2 and x6 are both Nouns, so f(x,o) = 2. x1=D o1 x2=N o2 x3=V o3 x4=D o4 x5=A o5 x6=N o6 CRF

Graph part as Feature Function Example For a pair-of-nodes example, f(x,o) = count of [xi = Noun,xi+1=Verb] Here, x2 is a Noun and x3 is a Verb, so f(x,o) = 1. x1=D o1 x2=N o2 x3=V o3 x4=D o4 x5=A o5 x6=N o6 CRF

Features can depend on the whole observation In a CRF, each feature function can depend on o, in addition to a clique in x Normally, we draw a CRF like this: HMM X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6

Features can depend on the whole observation In a CRF, each feature function can depend on o, in addition to a clique in x But really, it’s more like this: This would cause problems for a generative model, but in a conditional model, o is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently. HMM X1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6 CRF X 1 o1 X2 o2 X3 o3 X4 o4 X5 o5 X6 o6

Graph part as Feature Function Example An example part including x and o: f(x,o) = count of [xi = A or D,xi+1=N,o2=cat] Here, x1 is a D and x2 is a N, plus x5 is a A and x6 is a N, plus o2=cat, so f(x,o) = 2. Notice that the clique x5-x6 is allowed to depend on o2. x1=D The x2=N cat x3=V chased x4=D the x5=A tiny x6=N fly CRF

Graph part as Feature Function Example An more usual example including x and o: f(x,o) = count of [xi = A or D,xi+1=N,oi+1=cat] Here, x1 is a D and x2 is a N, plus o2=cat, so f(x,o)=1. x1=D The x2=N cat x3=V chased x4=D the x5=A tiny x6=N fly CRF

The CRF Equation, with Parts A CRF model consists of P = <p1, …, pk>, a vector of parts θ = < θ1, …, θk>, a vector of weights for each part. Let O = < o1, …, oT> be an observed sentence Let X = <x1, …, xT> be the latent variables.

Viterbi Algorithm – 2nd Try X1 Xt-1 Xt=hj Xt+1 o1 ot-1 ot ot+1 oT Recursive Computation

Supervised Parameter Estimation

Conditional Training Given a set of observations o and the correct labels x for each, determine the best θ: Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: Determine the gradient Step in the direction of the gradient Repeat until convergence

Recall: Training a ME model Training is an optimization problem: find the value for λ that maximizes the conditional log-likelihood of the training data:

Recall: Training a ME model Optimization is normally performed using some form of gradient descent: 0) Initialize λ0 to 0 1) Compute the gradient: ∇CLL 2) Take a step in the direction of the gradient: λi+1 = λi + α ∇CLL 3) Repeat until CLL doesn’t improve: stop when |CLL(λi+1) – CLL(λi)| < ε

Recall: Training a ME model Computing the gradient:

Recall: Training a ME model Computing the gradient: Involves a sum over all possible classes

Recall: Training a ME model: Expected feature counts In ME models, each document d is classified independently. The sum involves as many terms as there are classes c’. Very doable.

Training a CRF The hard part for CRFs

Training a CRF: Expected feature counts For CRFs, the term involves an exponential sum. The solution again involves dynamic programming, very similar to the Forward algorithm for HMMs.

CRFs vs. HMMs

Generative (Joint Probability) Models HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states) From a generative model, one can compute Two conditional models: P(sentence | hidden-states) and P(hidden-states| sentence) Marginal models P(sentence) and P(hidden-states) For sequence labeling, we want P(hidden-states | sentence)

Discriminative (Conditional) Models Most often, people are most interested in the conditional probability P(hidden-states | sentence) For example, this is the distribution needed for sequence labeling. Discriminative (also called conditional) models directly represent the conditional distribution These models cannot tell you the joint distribution, marginals, or other conditionals. But they’re quite good at this particular conditional distribution.

Discriminative vs. Generative HMM (generative) CRF (discriminative) Marginal, or Language model: P(sentence) Forward algorithm or Backward algorithm, linear in length of sentence Can’t do it. Find optimal label sequence Viterbi, Linear in length of sentence Supervised parameter estimation Bayesian learning, Easy and fast Convex optimization, Can be slow-ish (multiple passes through the data) Unsupervised parameter estimation Baum-Welch (non-convex optimization), Slow but doable Very difficult, and requires making extra assumptions. Feature functions Parents and children in the graph Restrictive! Arbitrary functions of a latent state and any portion of the observed nodes

CRFs vs. HMMs, a closer look It’s possible to convert an HMM into a CRF: Set pprior,state(x,o) = count[x1=state] Set θprior,state = log PHMM(x1=state) = log state Set ptrans,state1,state2(x,o)= count[xi=state1,xi+1=state2] Set θtrans,state1,state2 = log PHMM(xi+1=state2|xi=state1) = log Astate1,state2 Set pobs,state,word(x,o)= count[xi=state,oi=word] Set θobs,state,word = log PHMM(oi=word|xi=state) = log Bstate,word

CRF vs. HMM, a closer look If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities. Therefore, they will all be between –∞ and 0 Notice: CRF parameters can be between –∞ and +∞. So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)? HMMs have more bias CRFs have more variance

Comparing feature functions The biggest advantage of CRFs over HMMs is that they can handle overlapping features. For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful. However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.” These features overlap: some words end in “ing”, some don’t. Generative models have trouble handling overlapping features correctly Discriminative models don’t: they can simply use the features.

A CRF POS Tagger for English CRF Example A CRF POS Tagger for English

Vocabulary We need to determine the set of possible word types V. Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training} U {UNKNOWN} (for word types we haven’t seen)

L = Label Set Standard Penn Treebank tagset Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative Number Tag Description 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending

L = Label Set Number Tag Description 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle Number Tag Description 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh-adverb

CRF Features Feature Type Description Prior k xi = k Transition k,k’ xi = k and xi+1=k’ Word k,w xi = k and oi=w k,w xi = k and oi-1=w k,w xi = k and oi+1=w k,w,w’ xi = k and oi=w and oi-1=w’ k,w,w’ xi = k and oi=w and oi+1=w’ Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k xi=k and oi ends with s Orthography: Punctuation k xi = k and oi is capitalized k xi = k and oi is hyphenated k xi = k and oi contains a period k xi = k and oi is ALL CAPS k xi = k and oi contains a digit (0-9) …