Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar
Outline Introduction Directed Graphical Models –Hidden Markov Models (HMMs) –Maximum Entropy Markov Models (MEMMs) Label Bias Problem Undirected Graphical Models –Conditional Random Fields (CRFs) Summary
The Task Labeling –Given sequence data, mark appropriate tags for each data item Segmentation –Given sequence data, segment into non- overlapping groups such that related entities are in same group
Applications Computational Linguistics –POS Tagging –Information Extraction –Syntactic Disambiguation Computational Biology –DNA and Protein Sequence Alignment –Sequence homologue searching –Protein Secondary Structure Prediction
Example : POS Tagging
Directed Graphical Models Hidden Markov models (HMMs) –Assign a joint probability to paired observation and label sequences –The parameters trained to maximize the joint likelihood of train examples
Hidden Markov Models (HMMs) Generative Model - Models the joint distribution Generation Process –Probabilistic Finite State Machine –Set of states – Correspond to tags –Alphabet - Set of words –Transition Probability – –State Probability –
HMMs (Contd..) For a given word/tag sequence pair Why Hidden? –Sequence of tags which generated word sequence not visible Why Markov? –Based on Markovian Assumption : current tag depends only on previous ‘n’ tags –Solves the “sparsity problem” Training – Learning the transition and emission probabilities from data
HMMs Tagging Process Given a string of words w, choose tag sequence t* such that Computationally expensive - Need to evaluate all possible tag sequences! –For ‘n’ possible tags, m positions – Viterbi Algorithm –Used to find the optimal tag sequence t* –Efficient dynamic programming based algorithm
Disadvantages of HMMs Need to enumerate all possible observation sequences Not possible to represent multiple interacting features Difficult to model long-range dependencies of the observations Very strict independence assumptions on the observations
Maximum Entropy Markov Models (MEMMs) Conditional Exponential Models –Assumes observation sequence given (need not model) –Trains the model to maximize the conditional likelihood P(Y|X)
MEMMs (Contd..) For a new data sequence x, the label sequence y which maximizes P(y|x,Θ) is assigned (Θ - parameter set) Arbitrary non-independent features on observation sequence possible Conditional Models known to perform well than Generative Performs Per-State Normalization –Total mass which arrives at a state must be distributed among all possible successor states
Label Bias Problem Bias towards states with fewer outgoing transitions Due to per-state normalization An Example MEMM
Undirected Graphical Models Random Fields
Conditional Random Fields (CRFs) Conditional Exponential Model like MEMM Has all the advantages of MEMMs without label bias problem –MEMM uses per-state exponential model for the conditional probabilities of next states given the current state –CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Allow some transitions “vote” more strongly than others depending on the corresponding observations
Definition of CRFs
CRF Distribution Function Where : V = Set of Label Random Variables f k and g k = Features g k = State Feature f k = Edge Feature are parameters to be estimated y| e = Set of Components of y defined by edge e y| v = Set of Components of y defined by vertex v
CRF Training
CRF Training (Contd..) Condition for maximum likelihood Expected feature count computed using Model equals Empirical feature count from training data Closed form solution for parameters not possible Iterative algorithms employed - Improve log likelihood in successive iterations Examples –Generalized Iterative Scaling (GIS) –Improved Iterative Scaling (IIS)
Graphical Comparison HMMs, MEMMs, CRFs
POS Tagging Results
Summary HMMs –Directed, Generative graphical models –Cannot be used to model overlapping features on observations MEMMs –Directed, Conditional Models –Can model overlapping features on observations –Suffer from label bias problem due to per-state normalization CRFs –Undirected, Conditional Models –Avoids label bias problem –Efficient training possible
Thanks! Acknowledgements Some slides in this presentation are from Rongkun Shen’s (Oregon State Univ) Presentation on CRFs