Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Overview Joint Probability with Markov Random Fields (MRF)
Conditional Random Fields (CRF), a special case of MRF Inference for CRF Parameter Estimation for CRF Experimental Results Wenlong So conditional random fields is a statistical modeling method, usually used for structure prediction. But before we jump into the details of CRF's, how it was motivated we'll talk about the problem of modeling joint probability, how markov random fields provide a concise solution, and how conditional random fields are a special case of markov random fields that model the conditional probability instead we'll then talk about the CRF model itself how to perform inference using the model as well as how to estimate the parameters (train the model) then, we'll talk about the experimental results of the paper and how it performs empirically

Modeling Joint Probability
How do we model the joint probability distribution for a group of random variables? With no independence assumption, the number of combinations is exponential P(x_1 ... x_n) = | # of outcomes in each random variable | ^ (# of random variables) With complete independence assumption, the number of combinations is linear, but is an oversimplification in most cases (variables are actually correlated) P(x_1 ... x_n) = P(x_1) ... P(x_n) | # of outcomes in each random variable | * | # of random variables | Need some middle ground model dependence and independence between random variables efficiently Wenlong So first, we're going to start with the basics. In statistical modeling, often times we have several random variables that could be dependent on each other, and we want to find a joint distribution that describes how these variables act However, this is a hard problem The reason for this is because of dependence / interactions between variables If we assume that none of the variables in the set are independent, then calculating the joint probability takes exponential time. This is not good because exponential time for a large amount of variables is intractable and impractical to calculate If we assume that all of the variables in a set are independent, then we can reduce the calculation down to linear time. However, usually this is an oversimplification of real world relationships. For example, if you have a sentence where every word in the sentence is a random variable, they're not going to be all independent right? Therefore, we need some middle ground, where we can model independence assumptions between random variables efficiently. this motivates markov random fields

Markov Random Fields (MRF)
Wenlong MRF is a graphical way to compactly represent independence assumptions

MRF Definition: Undirected graph G = (V, E) Set of random variables indexed by the nodes V Edges represent correlations between random variables Graph G is a MRF if X satisfies the local Markov property Local Markov Property variable is conditionally independent of all other variables, given its neighbors in the graph G N(v) are the neighbors (nodes that are directly connected to XV by a single edge) Wenlong The definition of a markov random field is that its an undirected graph G each of the random variables we want to model are represented as a node in the graph (vertex) and each of the edges in the graph represent a dependency or correlation between the two nodes (two random variables) This graph also has to follow the local markov property, in order to be a markov random field What's a Local markov property? this basically says, each variable or vertex is dependent only on its immediate neighbors a vertex is conditionally independent of all the rest of the vertices in the graph, given its immediate neighbors where neighbors is vertices that are connected to the vertex in question by a single edge So what does this mean? what's the significance?

What is the significance of a MRF? compact, graphical representation of dependencies between variables each variable only depends on its immediate neighbors these conditional independencies imply that we can factorize the joint probability Factorization simplifies computations and reduces amount of calculation needed remember how the independence assumption simplified calculation? How do we factorize the joint probability? Factorize into functions on cliques Hammersley Clifford Theorem proves this is valid Wenlong Having a markov random field means that we have a graphical way to represent these random variables. its easy to understand the relationships also, since each variable depends only on its immediate neighbors, this implies that variables are conditionally independent. since things are conditionally independent, we can factorize the joint probability into a bunch of conditional probabilities factorization is important because it simplifies the amount of computations needed, and reduces time complexity if you remember how the independence assumption reduced time complexity so how do we actually factorize a joint probability defined by a MRF? we factorize into functions on cliques HC theorem proves this is valid

Cliques Clique Definition Example:
A clique is a complete subgraph of G a complete subgraph is a subset of vertices such that every 2 distinct vertices in the subgraph are adjacent Example: Red groups are cliques of 1 node Orange group are also a cliques because: A, B, C are all adjacent (but A not adjacent to D) C, D are adjacent (but D not adjacent to A, B) Wenlong So before we jump into Hammersley Clifford, what do I mean by Cliques? A clique is defined as a complete subgraph of G basically, its a subgraph where, for any vertex in the subgraph, you can reach every other vertex in a single edge all the vertices in the subgraph are adjacent if we take this example, the circles are cliques (not all of them, just a few to give you an example)

Hammersley-Clifford Theorem
Called the fundamental theorem of random fields Definition: Markov Random Field is defined by the following the joint probability: C is the set of all cliques, x_c is the set of random variables in clique c F_c is some "potential function" that acts on clique c (is strictly positive) Z is the partition function (normalizing constant to make probability sum to 1) P(X) is the joint probability of the set of random variables The joint probability can be factorized into the product of "clique potentials" Wenlong so now back to factorization: we said we could factorize the joint probability into functions on cliques. there's a theorem that proves exactly that, called hammersley clifford its also called the fundamental theorem of random fields we won't do the proof, but just know that the definition is: MRF can be defined as the following probability in this case, F_c are functions, that take in a clique of random variables from the graph as input. we can factorize the joint probability into c functions, number of cliques or less Z is the normalizing constant - called the partition function it makes the factorization a probability distribution, since the functions F are unnormalized (and may not add to 1) these F are generally called clique potentials P(X) = \frac{1}{Z}\prod_{c \in \it{C}} F_c(\bf{x}_c) Z = \sum_x \prod_{c \in C} F_c(\bf{x}_c) \sum F_c doesnt need to sum up to 1, which is why we need the normalization constant Z Z = \sum_x

Factorization Example
Cliques are: (A), (B), (C), (D), (AB), (AC), (BC), (CD), (ABC) Maximal cliques (not a subset of another clique): (ABC), (CD) Therefore, if we only consider maximal cliques: Wenlong Now here's an example on the previous graph the total number of cliques is these: but the maximal cliques (cliques that are not a subset of another clique) are enough so that we get an efficient factorization while expressing all our independence assumptions so we can see that we can factorize it like this

Clique Potentials Clique potentials are usually written as an exponential function { f_k } are k local features on x_c w_k are weights for each feature f_k Allows clique potential to be strictly positive Parameterize clique potential using user-defined local features Allows the joint probability to be written as: Wenlong So we know how to factorize a joint probability now, but what exactly is that clique potential F? we don't really know what it is. therefore, we can parameterize it. what this means is we define the clique potential F as a function of some user-defined features and weights we can tweak basically, if we have a good feature set and correct weights, this expression here accurately express/estimate the clique potential. Generally, we write the clique potentials as an exponential function (since the clique potentials are required to be strictly positive, but this allows user features to be anything real-valued) F_c(\textbf{x}_c) = \text{exp} \{ \sum_k w_k f_k(\textbf{x}_c) \} P(X) = \frac{1}{Z}\prod_{c \in \it{C}} \text{exp} \{ \sum_k w_k f_k(\textbf{x}_c) \} = \frac{1}{Z} \text{ exp} \{ \sum_{c \in \it{C}, k} w_k f_k(\textbf{x}_c) \}

Recap: MRF Set of variables, some are dependent, some are independent
MRF lets us compactly model a joint distribution, with some independence assumptions Hammersley-Clifford theorem lets us factorize joint prob. into clique potentials Clique potentials can be parameterized using local features and weights TLDR: Wenlong So that's MRF's! Basically we wanted to model a joint distribution of variables while capturing the structure of the data (modeling the independence assumptions we made). MRF's let us do that, then the hammersley clifford theorem let us factorize the joint probability to reduce computation and make it easier to estimate the joint distribution, since we can define features now we have a model for a joint distribution, that's based on some local features, that captures all the dependence/independence/structure of the data P(X) = \frac{1}{Z} \text{ exp} \{ \sum_{c \in \it{C}, k} w_k f_k(\textbf{x}_c) \}

Part-of-Speech Tagging
How would we use a MRF? Part-of-Speech Tagging Problem: Model 2 sequences of random variables (length N each) X - input - sequence of words / a sentence (observations) Y - output - sequence of labels / tags (hidden states) X: [bob ] [made] [her ] [happy ] [the ] [other ] [day ] Y: [noun] [verb] [noun] [adverb] [article] [adjective] [noun] Max now that we have MRF, lets see give an example of how it’s used in NLP: one application is the POS tagging. Last group already introduced the problem so we’re familiar with it: (see this example) a sequence of words and a sequence of tags. Each tag corresponds with a word Each word and each tag is a random variable, and we’re modeling this 2 sequences of random variables

Discriminative vs Generative Models
But MRF and HMM are both generative models Uses a joint distribution P(X, Y) We don't want to have to model P(X) explicitly, if we only observe a subset of it Modeling P(X) requires making a lot of assumptions Discriminative model Uses conditional probability P(Y | X) Doesn't model P(X), is just conditioned on it instead Conditional Random Fields are a special case of MRF that are discriminative Max Generative Model: treat both observations and labels as random variables and measure joint probability Independent assumptions on X introduced before Impossible to observe all occurrences of X Discriminative Model: the data that we really want to get is the probability of a label given an observation, which is exactly what this model gives Don’t need P(X) any more

Conditional Random Fields (CRF)
We have graph on a set of random variables {X, Y}, but then fix the observed variables {X} If the nodes for random variables {Y} obey the Markov Property, {X, Y} is a CRF Yolanda So what is a conditional random field? Let’s look at the definition first. Let’s say we have a graph on a set of random variables X and Y. We fix the observed variables X. If the nodes of Y obey the Markov Property, then the set of X and Y is a conditional random field. (From before we know that the markov property in MRF means that if you have a node A, A is independent of all the other nodes except its neighbors and itself. In CRF it’s similar but the nodes are conditioned on X.) Importance of markov property: Markov property -> factorize the cliques -> simplify the conditional property

Conditional Random Fields
We can define a conditional probability instead of a joint probability for CRFs Z(x) is a normalization constant for x The conditional probability factorizes into functions on cliques, just like MRF Yolanda Let’s now look at the equation for it. It is similar to that of a markov random field. However, instead of calculating a joint probability, here we calculate a conditional probability. Here, f_k is the feature function, w_k is the weights, and z(x) is the normalization. P(Y|X) = \frac{1}{Z(\textbf{x})} \text{ exp} \{ \sum_{c \in \it{C}, k} w_k f_k(\textbf{y}_c, \textbf{x}_c) \}

Linear Chain CRF Same graph as linear-chain MRF
Hidden states (labels) form a sequence, and are conditioned on observations (words) We observe sequence X (white nodes) We don’t make any assumption on the relationship between Xs Cliques are Nodes and Edges The CRF paper splits features into edge features and vertex features Yolanda In this paper, we consider linear chain CRFs. It is similar to the structure of MRF. We have a sequence of hidden states on top and they are conditioned on a sequence of observations on the bottom. Since we observe the sequence X on the bottom, we don’t make any independence assumptions between the Xs. The circles shown in the graph denotes the cliques, including both edges and vertices.

Defining the CRF Model Conditional Probability
y is a sequence of hidden states, each state of which can take on one of values x is a sequence of observations, each observation can take on one of values Yolanda So those are the general intuition on conditional random fields. Let’s delve into the math more. Here, we show the equation of conditional probability of y given x. It is very similar to the equation we saw earlier. On the left side, Y is a sequence of hidden states and x is a sequence of observations.

Defining the CRF Model Conditional Probability
Features are given and fixed f_k are features on "hidden state edges" (ex: Y_i is a noun and Y_j is a verb given X) g_k are features on "hidden state vertices" (ex: Y_i is a noun given X) lambda_k and mu_k are parameters for each feature Z(x) is normalization based on all the observations x Yolanda On the right side, we have F_k and g_k that are the feature functions, lambda_k and mu_k that are the weights that tells how much a feature contributions to the total probability, and z(x) which is the normalization. Instead of having a general summation of weighted feature functions, the feature functions are divided into edges and vertices. F_k are features related to the edges of hidden states, for example, if we have two hidden states yi and yj, F_k could be the feature function on yi is noun and yj is verb given x. G_k are features related to the vertices of hidden states, for example, if we have a hidden state yi, gk could be the feature function on yi is a noun given X. The difference is that here we split the feature functions into edges and vertices, as specified as e and v in the equation.

Defining the CRF Model Since the CRF is a linear chain, we can define "transition weights" from one hidden state in the sequence to the next hidden state hidden state ( i ) takes on value y hidden state ( i - 1 ) takes on value y' Yolanda Now that we understand the formulation of conditional random fields, we continue to define some notations to help with later steps of inference and training. Here we define Mi(y’,y|x). Let’s say given two hidden states big Yi-1 and big Yi. It takes on two values of small y’ and small y. Mi(y’,y|x) is an unnormalized transition probability from previous state big Yi-1 to the next state big Yi when we take on the values y’ and y for the two states.

Defining the CRF Model Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) Let’s look at an example first Yolanda Going on to the next notation. This Mi(x) is the matrix form of all the possible combinations of y’ and y of Mi(y’,y|x) The math may be confusing, but we will look at an example first.

Conditional Probability Example
We have hidden state sequence start and end states Want to find probability of sequence of states, given X Yolanda Let’s say we have a hidden state sequence of Y1, Y2, Y3 as well as the start and end states of Y_S and Y_E We want to find the probability of a sequence of states given a sequence of observations X.

Y_S, Y1, Y2, Y3, Y_E: Hidden states
A, B, Start, End: Values that the hidden states has taken Edges in the graph: Looking at all the edges between two hidden states Yi-1 and Yi: y S A B E S A B E y' Yolanda Here, each of the big Y’s denote the hidden states. The small nodes that has S, A, B, E in them are the values taken for those states. For example, if Y means word taggings, then A, B could denote verb and noun respectively. The edges here is the Mi(y’,y|x) Transitions weights i ,j represents the weight from i to j

Defining the CRF Model Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) We can use this matrix to define Z and the P(Y|X) Yolanda Now that we understand the Mi in both value and matrix form, we can use it to define Z(x) and P(y|x), which is what we’ve seen before. For z(x), we first multiple all the Mi matrices, and it creates a matrix of all possible paths. In our example, we have 4 possible values taken for each hidden state, so we have 16 possibilities in total. We could choose to start in S and end in E, or start in A/B and end in B/A. Z(x) chooses the entry that starts from S and ends in E. P(y|x) is the normalized probability of a chosen path. z(x): matrix of all possible paths (we choose from S to E) All possible paths you can take, and you choose from S to E Z(X) represents the "unnormalized probability" of essentially all possible sequences of hidden states (labels), given a certain observed sequence X P(Y | X) represents the normalized probability of the single hidden state sequence Y, given a certain observed sequence X

Yolanda Let’s go back to the previous example. Z(X) represents the "unnormalized probability" of essentially all possible sequences of hidden states (labels), given a certain observed sequence X P(Y | X) represents the normalized probability of the single hidden state sequence Y, given a certain observed sequence X

Recap: CRF Conditional Random Fields follow from MRF
Discriminative model instead of Generative All the advantages of MRF (compactly models dependence assumptions) Conditional Random Fields factor the conditional probability into: features that act on cliques weights for each feature cliques are edges and nodes in graph Questions: How to perform inference? How to train (parameter estimation)? Yolanda Quick Recap We learned that conditional random fields has several benefits Discriminative model that saves modeling effort of p(x) Be factorized so that the probability can be computed more easily With our knowledge of CRF, now we could perform inference and training now we have a model for a joint distribution, that's based on some local features, that captures all the dependence/independence/structure of the data P(X) = \frac{1}{Z} \text{ exp} \{ \sum_{c \in \it{C}, k} w_k f_k(\textbf{x}_c) \}

Inference How do we perform inference if we know model parameters?
How to find the most likely hidden state sequence y? To predict the label sequence, we maximize the conditional probability: We use the Viterbi Algorithm Wenlong So now that we've defined the CRF model, we can start thinking about how we can use it. The CRF model is defined with some parameters, but if we assume that we know the parameters, we can use CRF for inference there are a few different inference tasks that we can do, but the most obvious one is related to the POS-tagging problem we introduced earlier this is the same problem that the last group wanted to solve: given some input sequence X containing our observations what's the most likely hidden state sequence Y corresponding to that? to solve this inference task, we maximize the conditional probability of observing sequence Y given X we select the Y that maximizes this probability the actual algorithm we use is the same as the previous group, its called the viterbi algorithm

Viterbi Algorithm Given the model, find the most likely sequence of hidden states Approach: Recursion + Dynamic Programming (same as HMM) Update for HMM: Update for CRF: S is the set of values y can take on. i, j are values in S delta_t ( j ) is the maximum "probability" of the most likely path ending at y_t = j Wenlong Since the previous group explained viterbi already, we won't go into depth the general gist is that we use recursion and dynamic programming to calculate the most likely path in a single pass each step find the probability of most likely path ending in that value i at Y_t-1 -> gives us delta_t-1 (j) then for every possible value j at Y_t, we calculate the probability that Y_t-1 will transition to j, and find whichever is the maximum path we start at the first state in the sequence, then iteratively look at the next for each state we look at in the sequence, and each value that state can be in, we find the most likely path that ended in that value, by multiplying the previous maximal paths by a conditional probability this works because intuitively - wont go into the proof for why it works important thing is that it runs in linear time wrt the length of the sequence N * |S|^2 time for the CRf, viterbi algorithm is basically the same. the only difference is that the conditional probabilities are represented with the unnormalized transition weight instead transition weight basically represents how likely a transition is between the previous state, in a given value, to the next state, for another value transition weight is the clique potential on that edge (that transition) d_t (j) is the maximum unnormalized probability, of a path ending in j (where j is the value that Y takes on at time t) \phi (j, i, x_t) is the factor (so basically M_i in the next few slides), essentially the unnormalized "probability" tells you the transition "probability" from state i to state j, given observation x_t basically the first equation says we pick the path with max probability ending in j by finding the max from all the previous paths times the transition into j

Calculating Marginal Probabilities
How do we calculate the most likely label for a specific state in the sequence? (or most likely transition for a pair of states?) Use the Forward-Backward algorithm to calculate marginal probabilities Probability of an edge/vertex is the normalized sum of all paths through that edge / vertex Use forward and backward vectors to cache these sums Wenlong Now, we look at a different inference problem. Given we know the parameters, what if instead we wanted to find the probability that a certain transition exists in the path we predict? or the probability that a certain state takes on a certain value? We can use viterbi, since viterbi is only to calculate the maximal path. Therefore, we use another algorithm (the forward backward algorithm) Basically, forward vectors and backward vectors are vectors a_i is the forward vector at the i'th node b_i is the backward vector at the i'th node essentially forward and backward vectors represent the unnormalized probability that paths from the start / finish will end up in state T taking on a certain value vector of values state T can take on, and their weights tells you whether a certain value is more likely than its counterparts

Calculating Marginal Probabilities
To calculate probability of an edge being in a path: To calculate probability of a vertex being in a path: Wenlong The idea behind both inferences is that we fix the probability (or transition weight or clique potential) on the node or edge we want to include, and are looking for the probability for. but then consider all the paths that come from the start and go to the end, but pass through that one edge or that one vertex we marginalize the probability of getting a path that leads up to a certain point sum over all the possible paths to a certain point we cache results from intermediate Y's in the forward/backward vectors in this case, we want to include edge e_i, so at time t-1, the state Y_t-1 has value y', at time t, the state Y_t has value y so we calculate the forward vector up until that state Y_t-1, for all paths that end in y' then we include that edge then we calculate the abckward vector for up until the state Y_t, for all the paths that start with value y, and goes until the end

Wenlong alpha 1 is the forward vector time 0 Y_s takes on value s sums over all possible paths to get from start to Y_1 alpha 2 is forward vector beta sums over all possible paths from end to Y_2

Parameter Estimation for CRF
We want to find the best values for μ,λ Christian f and g are given already user defined features

Objective Function How do we define which parameters are best?
Normalized Log Likelihood Function Christian Tilde: empirical probability define the joint probability by counts We normalize the top equation so it doesnt depend on the # of training examples

Improved Iterative Scaling Algorithm
We want to change the parameters in a way that increases the log likelihood Trying to maximize this directly results in a set of highly coupled equations We instead maximize a simpler lower bound Christian

Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood Christian T(x, y) are the global feature counts, assuming binary features

Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood Christian

Algorithm S How do we sum over varying T(x,y)?
How do we sum over all y (exponential number of combinations)? Christian

Algorithm S Idea 1: Use a slack feature (i.e. upper bound) S instead of T(x,y) Christian

Algorithm S Idea 2: Since each feature only depends on a single edge or vertex, sum over all possible edges/vertices instead of sequences (using marginal probabilities) Christian

Putting it Together Christian

Final Update Equations
Define update equation for μk similarly, using marginal probability of vertex instead of edge Christian

Improving on Algorithm S
S is usually very large (proportional to the length of the longest training sequence) Dataset has sequences of varying length Large S causes parameter updates to be very small Long time to convergence Can we use a better approximation of T(x,y)? Christian

Algorithm T Instead of taking a global upper bound on T(x,y), take the upper bound given x (per-sequence S calculation): Christian

Algorithm T Group sums by the values of T(x) Christian

Algorithm T We can use Newton’s method to find the root of the resulting polynomials Christian

Experimental Results Experiments Models Tested
Modeling Mixed-Order Sources Position-of-Speech (POS) Tagging Models Tested Hidden Markov Model (HMM): Generative Conditional Random Field (CRF): Discriminative Maximum-Entropy Markov Model (MEMM): Discriminative Condition locally on the current hidden state only - without normalizing the probabilities globally Suffered from the Label Bias Problem Max

Modeling Mixed-Order Sources
Data Generation Synthetic data by randomly chosen HMM, mixture of first-order and second-order models State transition probability: pα(yi | yi−1, yi−2) =α p2(yi | yi−1, yi−2) + (1 − α) p1(yi | yi−1) Emission probability: pα(xi | yi, xi−1) =α p2(xi | yi, xi−1)+(1−α) p1(xi | yi) Training and Testing data: 1000 sequence of length 25 Training and Testing Algorithm S (CRF), Viterbi Algorithm to label a test set MEMMs and CRFs do not use overlapping features for observations Max In the first task, we’re generating sequence of words by arbitrarily defined HMM. Data is called “mixed-order” because each hidden state may depend on the previous one or the previous two (1st or 2nd order) Higher a means there’s a higher probability that later states depend on the previous two states, so more “second-order” Note in condition models, user can define custom feature functions to include multiple features. MEMM: condition on probabilities on each local state (based on current state only) rather than normalizing globally (on all possible paths)

Modeling Mixed-Order Sources
Results Error rates increase for all models when data become “more second order” CRF typically outperforms MEMM, except for a few cases with small error rate (a < 0.01) Maybe insufficient number of CRF training iterations HMM almost always outperforms MEMM CRF typically outperforms HMM when data are second-order (a > ½) a < 1/2 a > 1/2 a < 0.01 Max Results Error rates increase for all models when data become “more second order” CRF typically outperforms MEMM, except for a few cases with small error rate (a < 0.01) Maybe insufficient number of CRF training iterations HMM almost always outperforms MEMM CRF typically outperforms HMM when data are second-order (a > ½)

Position-of-Speech (POS) Tagging
Dataset Penn Treebank part-of-speech tagset 45 syntactic tags 50% training data, 50% testing data Experiment #1 First-order HMM, MEMM, CRF Results CRF > HMM > MEMM Labeling Bias Problem Max

Position-of-Speech (POS) Tagging
Experiment #2: Introduce orthographic features Features Capitalized/Non-capitalized first letter Hyphens Containing suffixes: -ing, - ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Results CRF > MEMM Max

Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Similar presentations

Presentation on theme: "Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Similar presentations

Presentation on theme: "Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong"— Presentation transcript:

Similar presentations

About project

Feedback