Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Introduction to Conditional Random Fields John Osborne Sept 4, 2009.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Entropy Rates of a Stochastic Process
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 5: Learning models using EM
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Conditional Random Fields: Probabilistic Models Pusan National University AILAB. Kim, Minho.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
CS Statistical Machine learning Lecture 24
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
6.4 Random Fields on Graphs 6.5 Random Fields Models In “Adaptive Cooperative Systems” Summarized by Ho-Sik Seok.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Conditional Random Fields model
Presentation transcript:

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation: Inna Weiner

Learning Seminar, 2004 Outline Labeling sequence data problem Classification with probabilistic models: Generative and Discriminative –Why HMMs and MEMMs are not good enough Conditional Field Model Experimental Results

Learning Seminar, 2004 Labeling Sequence Data Problem X is a random variable over data sequences Y is a random variable over label sequences Y i is assumed to range over a finite label alphabet A The problem: –Learn how to give labels from a closed set Y to a data sequence X Thinkingisbeing X: x1x1 x2x2 x3x3 nounverbnoun y1y1 y2y2 y3y3 Y:

Learning Seminar, 2004 Labeling Sequence Data Problem The lab setup: Let a monkey do some behavioral task while recording movement and neural activity Motor task: Reach to target Goal: Map neural activity to behavior In our notation: –X: Neural Data –Y: Hand movements

Learning Seminar, 2004 Generative Probabilistic Models Learning problem: Choose Θ to maximize joint likelihood: L(Θ)= Σ log p Θ (y i,x i ) The goal: maximization of the joint likelihood of training examples y = argmax p*(y|x) = argmax p*(y,x)/p(x) Needs to enumerate all possible observation sequences

Learning Seminar, 2004 Markov Model A Markov process or model assumes that we can predict the future based just on the present (or on a limited horizon into the past): Let {X 1,…,X T } be a sequence of random variables taking values {1,…,N} then the Markov properties are: Limited Horizon: P(X t+1 |X 1,…,X t ) = P(X t+1 |X t ) = Time invariant (stationary): = P(X 2 |X 1 )

Learning Seminar, 2004 Describing a Markov Chain A Markov Chain can be described by the transition matrix A and the initial probabilities Q: A ij = P(X t+1 =j|X t =i) q i = P(X 1 =i)

Learning Seminar, 2004 Hidden Markov Model In a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through (X) but only some probabilistic function of it (Y). Thus, it is a Markov model with the addition of emission probabilities: B ik = P(Y t = k|X t = i)

Learning Seminar, 2004 The Three Problems of HMM Likelihood: Given a series of observations y and a model λ = {A,B,q}, compute the likelihood p(y| λ) Inference: Given a series of observations y and a model lambda compute the most likely series of hidden states x. Learning: Given a series of observations, learn the best model λ

Learning Seminar, 2004 Likelihood in HMMs Given a model λ = {A,B,q}, we can compute the likelihood by P(y) = p(y| λ) = Σp(x)p(y|x) = = q(x 1 )ΠA(x t+1 |x t ) ΠB(y t |x t ) But … this computation complexity is O(N T ), when |x i | = N  impossible in practice

Learning Seminar, 2004 Forward-Backward algorithm To compute likelihood: –Need to enumerate over all paths in the lattice (all possible instantiations of X 1 …X T ). But … some starting subpath(blue) is common to many continuing paths (blue+red) The idea: using dynamic programming, calculate a path in terms of shorter sub-paths

Learning Seminar, 2004 Forward-Backward algorithm (cont’d) We build a matrix of the probability of being at time t at state i: α t (i)=P(x t =i,y 1 y 2 …y t ). This is a function of the previous column (forward procedure):

Learning Seminar, 2004 Forward-Backward algorithm (cont’d) We can similarly define a backwards procedure for filling the matrix β t (i) = P(y t+1 …y T |x t =i)

Learning Seminar, 2004 Forward-Backward algorithm (cont’d) And we can easily combine: P(y,xt=i) = P(x t =i,y 1 y 2 …y t )* P(y t+1 …y T |x t =i)= = α t (i)β t (i) And then we get: P(y) = Σ P(y,x t =i) = Σ α t (i)β t (i) Summary: we presented a polynomial algorithm for computing likelihood in HMMs.

Learning Seminar, 2004 HMM – why not? Advantages: –Estimation very easy. –Closed form solution –The parameters can be estimated with relatively high confidence from small samples But: –The model represents all possible (x,y) sequences and defines joint probability over all possible observation and label sequences  needless effort

Learning Seminar, 2004 Discriminative Probabilistic Models “Solve the problem you need to solve”: The traditional approach inappropriately uses a generative joint model in order to solve a conditional problem in which the observations are given. To classify we need p(y|x) – there’s no need to implicitly approximate p(x). GenerativeDiscriminative

Learning Seminar, 2004 Discriminative Models - Estimation Choose Θ y to maximize conditional likelihood: L(Θ y )= Σ log p Θ y (y i |x i ) Estimation usually doesn’t have closed form Example – MinMI discriminative approach (2 nd week lecture)

Learning Seminar, 2004 Maximum Entropy Markov Model MEMM: –a conditional model that represents the probability of reaching a state given an observation and the previous state –These conditional probabilities are specified by exponential models based on arbitrary observation features

Learning Seminar, 2004 The Label Bias Problem The mass that arrives at the state must be distributed among the possible successor states Potential victims: Discriminative Models

Learning Seminar, 2004 The Label Bias Problem: Solutions Determinization of the Finite State Machine  Not always possible  May lead to combinatorial explosion Start with a fully connected model and let the training procedure to find a good structure  Prior structural knowledge has proven to be valuable in information extraction tasks

Learning Seminar, 2004 Random Field Model: Definition Let G = (V, E) be a finite graph, and let A be a finite alphabet. The configuration space Ω is the set of all labelings of the vertices in V by letters in A. If C is a part of V and ω is an element of Ω is a configuration, the ωc denotes the configuration restricted to C. A random field on G is a probability distribution on Ω.

Learning Seminar, 2004 Random Field Model: The Problem Assume that a finite number of features can define a class The features f i (w) are given and fixed. The goal: estimating λ to maximize likelihood for training examples

Learning Seminar, 2004 Conditional Random Field: Definition X – random variable over data sequences Y - random variable over label sequences Y i is assumed to range over a finite label alphabet A Discriminative approach: we construct a conditional model p(y|x) and do not explicitly model marginal p(x)

Learning Seminar, 2004 CRF - Definition Let G = (V, E) be a finite graph, and let A be a finite alphabet Y is indexed by the vertices of G Then (X,Y) is a conditional random field if the random variables Y v, conditioned on X, obey the Markov property with respect to the graph: p(Y|X,Y w,w≠v) = p(Y v |X,Y w,w~v), where w~v means that w and v are neighbors in G

Learning Seminar, 2004 CRF on Simple Chain Graph We will handle the case when G is a simple chain: G = (V = {1,…,m}, E={ (I,i+1) }) HMM (Generative)MEMM (Discriminative) CRF

Learning Seminar, 2004 Fundamental Theorem of Random Fields (Hammersley & Clifford) Assumption: –G structure is a tree, of which simple chain is a private case

Learning Seminar, 2004 CRF – the Learning Problem Assumption: the features f k and g k are given and fixed. –For example, a boolean feature g k is TRUE if the word X i is upper case and the label Y i is a “noun”. The learning problem –We need to determine the parameters Θ = (λ 1, λ 2,... ; µ1, µ2,...) from training data D = {(x(i), y(i))} with empirical distribution p ~ (x, y).

Learning Seminar, 2004 CRF – Estimation And we return to the log-likelihood maximization problem, this time – we need to find Θ that maximizes the conditional log-likelihood:

Learning Seminar, 2004 CRF – Estimation From now on we assume that the dependencies of Y, conditioned on X, form a chain. To simplify some expressions, we add special start and stop states Y 0 = start and Y n+1 = stop.

Learning Seminar, 2004 CRF – Estimation Suppose that p(Y|X) is a CRF. For each position i in the observation sequence X, we define the |Y|*|Y| matrix random variable M i (x) = [M i (y', y|x)] by: e i is the edge with labels (Y i-1,Y i ) and v i is the vertex with label Y i

Learning Seminar, 2004 CRF – Estimation The normalization function Z(x) is The conditional probability of a label sequence y is written as

Learning Seminar, 2004 Parameter Estimation for CRFs The parameter vector Θ that maximizes the log- likelihood is found using a iterative scaling algorithm. We define standard HMM-like forward and backward vectors α and β, which allow polynomial calculations. For example:

Learning Seminar, 2004 Experimental Results – Set 1 Set 1: modeling label bias Data was generated from a simple HMM which encodes a noisy version of the finite-state network (“rib/ rob”) Each state emits its designated symbol with probability 29/32 and any of the other symbols with probability 1/32 We train both an MEMM and a CRF The observation features are simply the identity of the observation symbols. 2, 000 training and 500 test samples were used Results: –CRF error: 4.6% –MEMM error: 42% Conclusion: –MEMM fails to discriminate between the two branches and we get the label bias problem

Learning Seminar, 2004 Experimental Results – Set 2 Set 2: modeling mixed order sources Data was generated from a mixed-order HMM with state transition probabilities given by p(y i |y i-1, y i-2 ) = α p 2 (y i |y i-1, y i-2 ) + (1 - α) p 1 (y i |y i-1 ) Similarly, emission probabilities given by p(x i |y i, x i-1 ) = α p 2 (x i |y i, x i-1 )+(1- α) p 1 (xi|yi) Thus, for α = 0 we have a standard first-order HMM. For each randomly generated model, a sample of 1,000 sequences of length 25 is generated for training and testing.

Learning Seminar, 2004 Experimental Results – Set 2

Learning Seminar, 2004 Experimental Results – Set 3 Set 3: Part-Of-Speech tagging experiments

Learning Seminar, 2004 Conclusions Conditional random fields offer a unique combination of properties: –discriminatively trained models for sequence segmentation and labeling –combination of arbitrary and overlapping observation features from both the past and future –efficient training and decoding based on dynamic programming for a simple chain graph –parameter estimation guaranteed to find the global optimum CRFs main current limitation is the slow convergence of the training algorithm relative to MEMMs, let alone to HMMs, for which training on fully observed data is very efficient.

Learning Seminar, 2004 Thank you