Text Models Continued HMM and PCFGs
Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location of words is not captured – Colocation (n-grams) Location of words is captured, relations between words in a “sliding window” of size n Problem: small n not realistic enough, for large n no statistics
Today Two models that allow to capture language syntax, with probabilities for words in different part-of-speech The first is based on a Markov Chain (Probabilistic Finite State Machine), the second on a Probabilistic Context Free Grammar
Example Application: Part-of-Speech Tagging Goal: For every word in a sentence, infer its part-of- speech Verb, noun, adjective… Perhaps we are interested (more) in documents for which W is often the sentence subject? Part-of-speech tagging – Useful for ranking – For machine translation – Word-Sense Disambiguation – …
Part-of-Speech Tagging Tag this word. This word is a tag. He dogs like a flea The can is in the fridge The sailor dogs me every day
A Learning Problem Training set: tagged corpus – Most famous is the Brown Corpus with about 1M words – The goal is to learn a model from the training set, and then perform tagging of untagged text – Performance tested on a test-set
Different Learning Approaches Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn’t tagged Partly supervised: Training corpus isn’t tagged, but we have a dictionary giving possible tags for each word
Training set and test set In a learning problem the common practice is to split the annotated examples to training and test set Training set used to “train” the model, e.g. learn parameters Test set used to test how well the model is doing
Simple Algorithm Assign to each word its most popular tag in the training set Bag-of-words approach Problem: Ignores context Dogs, tag will always be tagged as a noun… Can will be tagged as a verb Still, achieves around 80% correctness for real-life test-sets Can we do better?
(HMM) Hidden Markov Model Model: sentences are generated by a probabilistic process In particular, a Markov Chain whose states correspond to Parts-of-Speech Transitions are probabilistic In each state a word is outputted – The output word is again chosen probabilistically based on the state
HMM HMM is: – A set of N states – A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans – A vector of size N of initial state probabilities Pstart – A matrix NXM of emissions probabilities Pout “Hidden” because we see only the outputs, not the sequence of states traversed
Example
3 Fundamental Problems 1) Given an observation sequence, find the most likely hidden state sequence This is tagging 2) Compute the probability of a given observation Sequence (=sentence) 3) Given a training set find the model that would make the observations most likely
Tagging Find the most likely sequence of states that led to an observed output sequence – We know what the output (=sentence) is, want to track back its generation, i.e. sequence of states taken in the hidden model – Each state is a tag How many possible sequences are there?
Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k
Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence: – generating the first t + 1 observations (X0,..Xt) – and terminating at state k V 0,k = P start (k)*P out (k,X 0 ) V t,k = P out (k,X t )*max{V t-1k’ *P trans (k’,k)}
Finding the path Note that we are interested in the most likely path, not only in its probability So we need to keep track at each point of the argmax – Combine them to form a sequence What about top-k?
Complexity O(T*|S|^2) Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)
Forward Algorithm α t (k) is the probability of seeing the sequence X 0 …X t and terminating at state k
Computing the probabilities α 0 (k)= P start (k)*P out (k,X 0 ) α t (k)= Pout(k,X t )*Σ k’ {α t-1k’ *P trans (k’,k)} P(X 0,… X n )= Σ k α n (k)
Learning the HMM probabilities Expectation-Maximization Algorithm (Baum-Welsch) 1.Choose initial probabilities 2.Compute E ij the expected number of transitions from i to j while generating the training sequences, for each i,j (see next) 3.Set the probability of transition from i to j to be E ij / (Σ k E ik ) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence
Forward-backward Forward probabilities: α t (k) is the probability of seeing the sequence X 0 …X t and terminating at state k Backward probabilities: β t (k) is the probability of seeing the sequence X t+1 …X n given that the Markov process is at state k at time t.
Computing the probabilities Forward algorithm α 0 (k)= Pstart(k)*Pout(k,X 0 ) α t (k)= Pout(k,X t )*Σ k’ {α t-1k’ *Ptrans(k’,k)} P(X 0,… Xn )= Σ k α n (k) Backward algorithm β t (k) = P(X t+1 …X n | state at time t is k) β t (k) = Σ k’ {Ptrans(k,k’)* Pout(k’,X t+1 )* β t+1 (k’)} β n+1 (k) = 1 for all k P(X 0,…X n )= Σ k β 0 (k)* Pstart(k)*Pout(k, X 0 ) Also: P(X 0,…X n )= Σ k α t (k)* β t (k)
Estimating Parameters
Expected number of transitions
Accuracy Tested experimentally Reaches 96-98% for the Brown corpus – Trained on half and tested on the other half Compare with the 80% by the trivial algorithm The hard cases are few but are very hard..
NLTK Natrual Language ToolKit Open source python modules for NLP tasks – Including stemming, POS tagging and much more
PCFG A PCFG is a tuple: –N is a set of non-terminals: – is a set of terminals –N 1 is the start symbol –R is a set of rules –P is the set of probabilities on rules We assume PCFG is in Chomsky Norm Form Parsing algorithms: –Earley (top-down) –CYK (bottom-up) –…–…
PFSA vs. PCFG PFSA can be seen as a special case of PCFG –State non-terminal –Output symbol terminal –Arc context-free rule –Path Parse tree (only right-branch binary tree) S1S2S3 ab S1 aS2 b S3 ε
PFSA and HMM HMM Finish Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”. Start
The connection between two algorithms HMM can (almost) be converted to a PFSA. PFSA is a special case of PCFG. Inside-outside is an algorithm for PCFG. Inside-outside algorithm will work for HMM. Forward-backward is an algorithm for HMM. In fact, Inside-outside algorithm is the same as forward-backward when the PCFG is a PFSA.
Forward and backward probabilities X1X1 XtXt XnXn … o1o1 onon X n+1 … O t-1 X1X1 … X t-1 XtXt … XnXn X n+1 O1O1 O t-1 OnOn OtOt
Backward/forward prob vs. Inside/outside prob X1X1 X t =N i OtOt OnOn O t-1 O1O1 OlOl O1O1 X1X1 X t =N i OtOt OnOn O t-1 PFSA: PCFG: Outside Inside Forward Backward
wpwp wmwm w p-1 w1w1 wqwq W q+1 N1N1 NjNj Notation
Inside and outside probabilities
Definitions Inside probability: total prob of generating words w p …w q from non-terminal N j. Outside probability: total prob of beginning with the start symbol N 1 and generating and all the words outside w p …w q When p>q,
Calculating inside probability (CYK algorithm) NjNj NrNr NsNs wpwp wdwd W d+1 wqwq
Calculating outside probability (case 1) NjNj NgNg wpwp wqwq W q+1 wewe NfNf N1N1 w1w1 wmwm
Calculating outside probability (case 2) NgNg NjNj wewe W p-1 WpWp wqwq NfNf N1N1 w1w1 wmwm
Outside probability
Probability of a sentence
Recap so far Inside probability: bottom-up Outside probability: top-down using the same chart. Probability of a sentence can be calculated in many ways.
Expected counts and update formulae
The probability of a binary rule is used (1)
The probability of N j is used (2)
The probability of a unary rule is used (3)
Multiple training sentences (1) (2)
Inner loop of the Inside-outside algorithm Given an input sequence and 1.Calculate inside probability: Base case Recursive case: 2.Calculate outside probability: Base case: Recursive case:
Inside-outside algorithm (cont) 3. Collect the counts 4. Normalize and update the parameters
Relation to EM
PCFG is a PM (Product of Multi-nominal) Model Inside-outside algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence w 1m. Y (hidden data): parse tree Tr. Θ (parameters):
Relation to EM (cont)
Summary XtXt X t+1 OtOt N1N1 NrNr NsNs wpwp wdwd W d+1 wqwq NjNj
Summary (cont) Topology is known: –(states, arcs, output symbols) in HMM –(non-terminals, rules, terminals) in PCFG Probabilities of arcs/rules are unknown. Estimating probs using EM (introducing hidden data Y)
Converting HMM to PCFG Given an HMM=(S, Σ, π, A, B), create a PCFG=(S1, Σ1,S0, R, P) as follows: –S1= –Σ1= –S0=Start –R= –P:
Path Parse tree X1X1 X2X2 XTXT … o1o1 o2o2 oToT X T+1 Start X1X1 D0D0 BOS X2X2 D 12 o1o1 … XTXT X T+1 D T,T+1 otot EOS
Outside probability q=T (j,i),(p,t) q=p (p,t) Outside prob for N j Outside prob for D ij
Inside probability q=T (j,i),(p,t) q=p (p,t) Inside prob for N j Inside prob for D ij
Renaming: (j,i), (s,j),(p,t),(m,T) Estimating
Renaming: (j,i), (s,j),(p,t),(m,T) Estimating
Renaming: (j,i), (s,j),(p,t),(m,T) Estimating
Renaming: (j,i), (s,j),(w,o),(m,T) Calculating
Renaming (j,i_j), (s,j),(p,t),(h,t), (m,T),(w,O), (N,D)