Graphical models for part of speech tagging

Name: Graphical models for part of speech tagging
Uploaded: 2017-09-05T21:18:48+00:00
Duration: PTM11S30
Channel: Kimberly Palmer
Description: Graphical models for part of speech tagging

Graphical models for part of speech tagging

Different Models for POS tagging
HMM Maximum Entropy Markov Models Conditional Random Fields

POS tagging: A Sequence Labeling Problem
Input and Output Input sequence x = x1x2 xn Output sequence y = y1y2 ym Labels of the input sequence Semantic representation of the input Other Applications Automatic speech recognition Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

Hidden Markov Models Doubly stochastic models
Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) A C 0.6 0.4 A C 0.9 0.1 0.9 0.5 0.8 0.2 0.1 S1 S2 S4 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. S3 A C 0.5 A C 0.3 0.7

Hidden Markov Model (HMM) : Generative Modeling
Source Model P(Y) Noisy Channel P(X|Y) y x e.g., 1st order Markov chain Parameter estimation: maximize the joint likelihood of training examples

Dependency (1st order)

Disadvantage of HMMs (1)
No Rich Feature Information Rich information are required When xk is complex When data of xk is sparse Example: POS Tagging How to evaluate P(wk|tk) for unknown words wk ? Useful features Suffix, e.g., -ed, -tion, -ing, etc. Capitalization

Disadvantage of HMMs (2)
Generative Model Parameter estimation: maximize the joint likelihood of training examples Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

Maximum Entropy Markov Model
Discriminative Sub Models Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model and parameter in noisy channel Unified conditional model Employ maximum entropy principle Maximum Entropy Markov Model

General Maximum Entropy Model
Model distribution P(Y |X) with a set of features {f1, f2, , fl} defined on X and Y Idea Collect information of features from training data Assume nothing on distribution P(Y |X) other than the collected information Maximize the entropy as a criterion

Features Features Example: POS Tagging 0-1 indicator functions
1 if (x, y) satisfies a predefined condition 0 if not Example: POS Tagging

Constraints Empirical Information Statistics from training data T
Expected Value From the distribution P(Y |X) we want to model Constraints

Maximum Entropy: Objective
Maximization Problem

Dual Problem Dual Problem Conditional model
Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

Maximum Entropy Markov Model
Use Maximum Entropy Approach to Model 1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature (Xk = xk, Yk = yk) Advantage: incorporate other advanced features on (xk, yk)

Maximum Entropy Markov Model (MEMM)
HMM vs MEMM (1st order) Maximum Entropy Markov Model (MEMM) HMM

Performance in POS Tagging
Data set: WSJ Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy

Disadvantage of MEMMs (1)
Complex Algorithm of Maximum Entropy Solution Both IIS and GIS are difficult to implement Require many tricks in implementation Slow in Training Time consuming when data set is large Especially for MEMM

Disadvantage of MEMMs (2)
Maximum Entropy Markov Model Maximum entropy model as a sub model Optimization of entropy on sub models, not on global model Label Bias Problem Conditional models with per-state normalization Effects of observations are weakened for states with fewer outgoing transitions

Label Bias Problem Training Data X:Y rib:123 rob:456 1 2 3 r i b 4 5 6
Model Parameters New input: rob

Solution Global Optimization Alternatives
Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

Conditional Random Field (CRF) (1)
Let be a graph such that Y is indexed by the vertices Then (X, Y) is a conditional random field if Conditioned globally on X

Conditional Random Field (CRF) (2)
Determined by State Transitions Exponential Model : a tree (or more specifically, a chain) with cliques as edges and vertices State determined Parameter Estimation Maximize the conditional likelihood of training examples IIS or GIS

MEMM vs CRF Similarities Differences
Both employ maximum entropy principle Both incorporate rich feature information Differences Conditional random fields are always globally conditioned on X, resulting in a global optimized model

Performance in POS Tagging
Data set: WSJ Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

Comparison of the three approaches to POS Tagging
Results (Lafferty et al. 2001) 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

References A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001,

Graphical models for part of speech tagging

Similar presentations

Presentation on theme: "Graphical models for part of speech tagging"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graphical models for part of speech tagging

Similar presentations

Presentation on theme: "Graphical models for part of speech tagging"— Presentation transcript:

Similar presentations

About project

Feedback