1 Sequence Learning Sudeshna Sarkar 14 Aug 2008. 2 Alternative graphical models for part of speech tagging.

Slides:



Advertisements
Similar presentations
Introduction to Conditional Random Fields John Osborne Sept 4, 2009.
Advertisements

Supervised Learning Recap
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Conditional Random Fields
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Machine Learning for Sequential.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Slides for “Data Mining” by I. H. Witten and E. Frank.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
MAXIMUM ENTROPY, SUPPORT VECTOR MACHINES, CONDITIONAL RANDOM FIELDS, NEURAL NETWORKS Heng Ji 04/12, 04/15, 2016.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models BMI/CS 576
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
Machine Learning Basics
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Conditional Random Fields model
IE With Undirected Models
Presentation transcript:

1 Sequence Learning Sudeshna Sarkar 14 Aug 2008

2 Alternative graphical models for part of speech tagging

3 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

4 Hidden Markov Model (HMM) : Generative Modeling Source Model P  Y  Noisy Channel P  X  Y  y x

5 Dependency (1st order)

6 Disadvantage of HMMs (1) No Rich Feature Information Rich information are required –When x k is complex –When data of x k is sparse Example: POS Tagging How to evaluate P  w k |t k  for unknown words w k ? Useful features –Suffix, e.g., -ed, -tion, -ing, etc. –Capitalization Generative Model Parameter estimation: maximize the joint likelihood of training examples

7 Generative Models Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

8 Generative Models (cont’d) Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

9 Making use of rich domain features A learning algorithm is as good as its features. There are many useful features to include in a model Most of them aren’t independent of each other Identity of word Ends in “-shire” Is capitalized Is head of noun phrase Is in a list of city names Is under node X in WordNet Word to left is verb Word to left is lowercase Is in bold font Is in hyperlink anchor Other occurrences in doc …

10 Problems with Richer Representation and a Generative Model These arbitrary features are not independent: Overlapping and long-distance dependences Multiple levels of granularity (words, characters) Multiple modalities (words, formatting, layout) Observations from past and future HMMs are generative models of the text: Generative models do not easily handle these non-independent features. Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

11 Discriminative Models We would prefer a conditional model: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Provide the ability to handle many arbitrary features.

12 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1 O t +1 O t Conditional... transitions observations Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000]

13 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1... Conditional... transitions entire observation sequence Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000] Or, more generally:...

14 Exponential Form for “Next State” Function Overall Recipe: - Labeled data is assigned to transitions. - Train each state’s exponential model by maximum likelihood (iterative scaling or conjugate gradient). weightfeature Black-box classifier s t-1

15 Principle of Maximum Entropy The correct distribution P(s,o) is that which maximizes entropy, or “uncertainty” subject to constraints Constraints represent evidence Given k features, constraints have the form, i.e. the model’s expectation for each feature should match the observed expectation Philosophy: Making inferences on the basis of partial information without biasing the assignment would amount to arbitrary assumptions of information that we do not have

16 Maximum Entropy Classifier Conditional model p(y|x) Does not try to model p(x) Can work with complicated input features since we do not need to model dependencies between them. Principle of maximum entropy We want a classifier –Matching feature constraints from training data –Predictions maximize entropy There is a unique exponential family distribution that meets these criteria. Maximum Entropy Classifier p(y|x;  ) inference and learning

17 Indicator Features Feature functions f(x,y) f1(w,y) = {word is Sarani & y = Location} f2(w,y) = {previous tag = Per-begin, current word suffix = “an:, & y = Per-end}

18 Problems with MaxEnt classifier It makes decisions at each point independently

19 MEMM Use a series of maximum entropy classifiers that know the previous label Define a Viterbi model of inference P(y|x) =  t Pyt-1 (yt|x) Finding the most likely label sequence given an input sequence and learning Combines the advantages of HMM and maximum entropy. But there is a problem.

20 Maximum Entropy Markov Model Label bias problem: the probability transitions leaving any given state must sum to one

21 In some state space configurations, MEMMs essentially completely ignore the inputs Example of label bias problem This is not a problem for HMMs, because the input is generated by the model.

22 P=0.75 P=0.25 P=1 Label Bias Example Given: “ rib ” 3 times, “ rob ” 1 times Training: p(1|0, “ r ” )=0.75, p(4|0, “ r ” )=0.25 Inference:

23 Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS S t-1 StSt OtOt S t+1 O t+1 O t-1... S t-1 StSt OtOt S t+1 O t+1 O t-1...

24 Random Field

25 CRF CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations X y1y1 y2y2 y3y3

26 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

27 Machine Learning a Panacea? A machine learning method is as good as the feature set it uses Shift focus from linguistic processing to feature set design

28 Features to use in IE Features are task dependent Good feature identification needs a good knowledge of the domain combined with automatic methods of feature selection.

29 Feature Examples Extraction of protein and their interactions from biomedical literature (Mooney) For each token, they take the following as features: –Current token –Last 2 tokens and next 2 tokens –Output of dictionary-based tagger for these 5 tokens –Suffix for each of the 5 tokens (last 1, 2, and 3 characters) –Class labels for last 2 tokens Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein

30 More Feature Examples line, sentence, or paragraph features: length is centered in page percent of non-alphabetics white-space aligns with next line containing sentence has two verbs grammatically contains a question contains links to “authoritative” pages emissions that are uncountable features at multiple levels of granularity Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was female next two words are “and Associates”

31 Indicator Features They’re a little different from the typical supervised ML approach Limited to binary values –Think of a feature as being on or off rather than as a feature with a value Feature values are relative to an object/class pair rather than being a function of the object alone. Typically have lots and lots of features (100s of 1000s of features is quite common.)

32 Feature Templates Next word A feature template – gives rise to |V|x|T| binary features Curse of Dimensionality Overfitting

33 Feature Selection vs Extraction Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original x i, i =1,...,d dimensions to new k<d dimensions, z j, j =1,...,k Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

34 Feature Reduction Example domain: NER in Hindi (Sujan Saha) Feature Value Selection Feature Value Clustering ACL 2008: Kumar Saha; Pabitra Mitra; Sudeshna Sarkar Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

35

36 Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

37 Maximum Entropy modeling N-gram model : probabilities depend on the previous few tokens. We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc) Maxent combines these features in a probabilistic model. The given features provide a constraint on the model. We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

38 Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model –Two parameters in generative model, –parameter in source model and parameter in noisy channel –Unified conditional model Employ maximum entropy principle Maximum Entropy Markov Model

39 General Maximum Entropy Principle Model Model distribution P  Y  |X  with a set of features  f   f     f l  defined on X and Y Idea Collect information of features from training data Principle –Model what is known –Assume nothing else  Flattest distribution  Distribution with the maximum Entropy

40 Example ( Berger et al., 1996) example Model translation of word “in” from English to French –Need to model P(word French ) –Constraints  1: Possible translations: dans, en, à, au course de, pendant  2: “dans” or “en” used in 30% of the time  3: “dans” or “à” in 50% of the time

41 Features 0-1 indicator functions –1 if  x  y  satisfies a predefined condition –0 if not Example: POS Tagging

42 Constraints Empirical Information Statistics from training data T Constraints Expected Value From the distribution P  Y  |X  we want to model

43 Maximum Entropy: Objective Entropy Maximization Problem

44 Dual Problem Conditional model Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

45 Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature  X k  x k  Y k  y k  Advantage: incorporate other advanced features on  x k  y k 

46 HMM vs MEMM (1st order) HMM Maximum Entropy Markov Model (MEMM)

47 Performance in POS Tagging POS Tagging Data set: WSJ Features: –HMM features, spelling features (like – ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order HMM –94.31% accuracy, 54.01% OOV accuracy 1st order MEMM –95.19% accuracy, 73.01% OOV accuracy

48 ME applications Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources –Word window (4) –Word features (prefix, suffix, capitalization) –Previous POS tags

49 ME applications Abbreviation expansion (Pakhomov, 2002) Information sources –Word window (4) –Document title Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources –Word window (4) –Structurally related words (4) Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources –Token features (prefix, suffix, capitalization, abbreviation) –Word window (2)

50 Solution Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

51 Why ME? Advantages Combine multiple knowledge sources –Local  Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) )  Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002) )  Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997) ) –Global  N-grams (Rosenfeld, 1997)  Word window  Document title (Pakhomov, 2002)  Structurally related words (Chao & Dyer, 2002)  Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources

52 Why ME? Advantages Add additional knowledge sources Implicit smoothing Disadvantages Computational –Expected value at each iteration –Normalizing constant Overfitting –Feature selection  Cutoffs  Basic Feature Selection (Berger et al., 1996)

53 Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

54 Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

55 MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”) Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

56 Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation Consider this MEMM:

57 Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction)

58 Random Field

59 Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

60 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

61 Example of CRFs

62 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

63 Conditional Distribution x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features are parameters to be estimated y| e is the set of components of y defined by edge e y| v is the set of components of y defined by vertex v If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

64 Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

65 Parameter Estimation for CRFs The paper provided iterative scaling algorithms It turns out to be very inefficient Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient

66 Training of CRFs (From Prof. Dietterich) First, we take the log of the equation Then, take the derivative of the above equation For training, the first 2 items are easy to get. For example, for each k, f k is a sequence of Boolean numbers, such as is just the total number of 1’s in the sequence. The hardest thing is how to calculate Z(x)

67 Training of CRFs (From Prof. Dietterich) (cont’d) Maximal cliques y1y1 y2y2 y3y3 y4y4 c1c1 c2c2 c3c3 c1c1 c2c2 c3c3

68 POS tagging Experiments

69 POS tagging Experiments (cont’d) Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

70 Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance