Albert Gatt Corpora and Statistical Methods Lecture 10.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Text Classification, Active/Interactive learning.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Albert Gatt LIN3022 Natural Language Processing Lecture 7.
Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.
CSA3202 Human Language Technology HMMs for POS Tagging.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
John Lafferty Andrew McCallum Fernando Pereira
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Data Mining and Decision Support
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
KNN & Naïve Bayes Hongning Wang
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Lecture 15: Text Classification & Naive Bayes
CSC 594 Topics in AI – Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Presentation transcript:

Albert Gatt Corpora and Statistical Methods Lecture 10

POS Tagging continued Part 1

Transformation-based error-driven learning

Transformation-based learning Approach proposed by Brill (1995) uses quantitative information at training stage outcome of training is a set of rules tagging is then symbolic, using the rules Components: a set of transformation rules learning algorithm

Transformations General form: t1  t2 “replace t1 with t2 if certain conditions are satisfied” Examples: Morphological: Change the tag from NN to NNS if the word has the suffix "s" dogs_NN  dogs_NNS Syntactic: Change the tag from NN to VB if the word occurs after "to" go_NN to_TO  go_VB Lexical: Change the tag to JJ if deleting the prefix "un" results in a word. uncool_XXX  uncool_JJ uncle_NN -/-> uncle_JJ

Learning Unannotated text Initial state annotator e.g. assign each word its most frequent tag in a dictionary truth: a manually annotated version of corpus against which to compare Learner: learns rules by comparing initial state to Truth rules

Learning algorithm Simple iterative process: apply a rule to the corpus compare to the Truth if error rate is reduced, keep the results A priori specifications: how initial state annotator works the space of possible transformations Brill (1995) used a set of initial templates the function to compare the result of applying the rules to the truth

Non-lexicalised rule templates Take only tags into account, not the shape of words Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the three preceding (following) words is tagged z. 4. The preceding (following) word is tagged z and the word two before (after) is tagged w. 5. …

Lexicalised rule templates Take into account specific words in the context Change tag a to tag b when: 1. The preceding (following) word is w. 2. The word two before (after) is w. 3. The current word is w, the preceding (following) word is w 2 and the preceding (following) tag is t. 4. …

Morphological rule templates Usful for completely unknown words. Sensitive to the word’s “shape”. Change the tag of an unknown word (from X) to Y if: 1. Deleting the prefix (suffix) x, |x| ≤ 4, results in a word 2. The first (last) (1,2,3,4) characters of the word are x. 3. Adding the character string x as a prefix (suffix) results in a word (|x| ≤ 4). 4. Word w ever appears immediately to the left (right) of the word. 5. Character z appears in the word. 6. …

Order-dependence of rules Rules are triggered by environments satisfying their conditions E.g. “A  B if preceding tag is A” Suppose our sequence is “AAAA” Two possible forms of rule application: immediate effect: applications of the same transformation can influence eachother result: ABAB delayed effect: results in ABBB the rule is triggered multiple times from the same initial input Brill (1995) opts for this solution

More on Transformation-based tagging Can be used for unsupervised learning like HMM-based tagging, the only info available is the allowable tags for each word takes advantage of the fact that most words have only one tag E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ” Unsupervised method achieves 95.6% accuracy!!

Maximum Entropy models and POS Tagging

Limitations of HMMs An HMM tagger relies on: P(tag|previous tag) P(word|tag) these are combined by multiplication TBL includes many other useful features which are hard to model in HMM: prefixes, suffixes capitalisation … Can we combine both, i.e. have HMM-style tagging with multiple features?

The rationale In order to tag a word, we consider its context or “history” h. We want to estimate a probability distribution p(h,t) from sparse data. h is encoded in terms of features (e.g. morphological features, surrounding tag features etc) There are some constraints on these features that we discover from training data. We want our model to make the fewest possible assumptions beyond these constraints.

Motivating example Suppose we wanted to tag the word w. Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …} The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

Motivating example Suppose we find that the possible tags for w are NN, JJ, NNS, VB. We therefore impose our first constraint on the model: (and the prob. of every other tag is 0) The simplest model satisfying this constraint:

Motivating example We suddenly discover that w is tagged as NN or NNS 8 out of 10 times. Model now has two constraints: Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal: P(NN) = 4/10 P(NNS) = 4/10 P(JJ) = 1/10 P(VB) = 1/10

Motivating example We suddenly discover that verbs (VB) occur 1 in every 20 words. Model now has three constraints: Simplest distribution is now: P(NN) = 4/10 P(NNS) = 4/10 P(JJ) = 3/20 P(VB) = 1/20

What we’ve been doing Maximum entropy builds a distribution by continuously adding features. Each feature picks out a subset of the training observations. For each feature, we add a constraint on our total distribution. Our task is then to find the best distribution given the constraints.

Features for POS Tagging Each tagging decision for a word occurs in a specific context or “history” h. For tagging, we consider as context: the word itself morphological properties of the word other words surrounding the word previous tags For each relevant aspect of the context h i, we can define a feature f j that allows us to learn how well that aspect is associated with a tag t i. Probability of a tag given a context is a weighted function of the features.

Features for POS Tagging In a maximum entropy model, this information is captured by a binary or indicator feature each feature f i has a weight α i reflecting its importance NB: each α i is uniquely associated with a feature

Features for POS Tagging in Ratnaparkhi (1996) Had three sets of features, for non-rare, rare and all words:

Features for POS Tagging Given the large number of possible features, which ones will be part of the model? We do not want redundant features We do not want unreliable and rarely occurring features (avoid overfitting) Ratnaparkhi (1996) used only those features which occur 10 times or more in the training data

The form of the model Features f j and their parameters are used to compute the probability p(h i, t i ): where j ranges over features & Z is a normalisation constant Transform into a linear equation:

Conditional probabilities The conditional probabilities can be computed based on the joint probabilities Probability of a sequence of tags given a sequence of words: NB: unlike an HMM, we have one probability here. we directly estimate p(t|h) model combines all features in h i into a single estimate no limit in principle on what features we can take into account

The use of constraints Every feature we have imposes a constraint or expectation on the probability model. We want: Where: the model p’s expectation of f j the empirical expectation of f j

Why maximum entropy? Recall that entropy is a measure of uncertainty in a distribution. Without any knowledge, simplest distribution is uniform uniform distributions have the highest entropy As we add constraints, the MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: where P is the set of possible distributions with p* is unique and has the form given earlier Basically, an application of Occam’s razor: make no further assumptions than necessary.

Training a MaxEnt model

Training 1: computing empirical expectation Recall that: Suppose we are interested in the feature: In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:

Training 2: computing model expectation Recall that: Requires sum over all possible histories and tags! Approximate by computing model expectation of the feature on training data only:

Learning the optimal parameters Our goal is to learn the best parameter α j for each feature f j, such that: i.e.: One method is Generalised Iterative Scaling

Generalised Iterative Scaling: assumptions for all (h,t), features sum to a constant value: If this is not the case, we set C to: and add a filler feature f l, such that:

Generalised Iterative Scaling: assumptions (II) for all (h,t), there is at least one feature f which is active, i.e.:

Generalised Iterative Scaling Input: Features f 1,..., f n and empirical distribution Output: Optimal parameter values α 1,..., α n 1. Initialise α i = 0 for all i Є {1, 2,..., n} 2. For each i do: a. set b. set 3. If model has not converged, repeat from (2)

Finding tag sequences with MaxEnt models

Tagging sequences We want to tag a sequence w 1,...,w n This can be decomposed into: The history h i consists of the words w 1,...,w i-1 and previous tags t 1,..., t i−1

Finding the best tag sequence: beam search (Ratnaparki, 1996) To find the best sequence of n tags given N features. s ij = the j th highest probability tag sequence up to word i. 1. Generate all tags for w 1 a. find the top N tags b. set s 1j for 1 ≤ j ≤ N 2. for i = 2 to n do: a. for j = 1 to N do: i. Generate tags for w i given s (i-1)j ii. Append each tag to s (i-1)j to create new sequence b. Find the N highest probability sequences generated by loop 2a. 3. Return s n1

Worked example Suppose our data consists of the sequence: a, b, c Assume the correct tags are A, B, C Assume that N = 1 (i.e. we only ever consider the top most likely tag)

Worked Example abc s 11 s 21 s 31 Step 1: generate all possible tags for a: A, B, C Step 2: find the most likely tag for a: A

Worked Example abc s 11 s 21 s 31 A Step 2: generate all possible tags for b: A, B, C merge with s 11 : A-A, A-B, A-C Find the most likely sequence: A-B

Worked Example abc s 11 s 21 s 31 AA-B Step 3: generate all possible tags for w3: A, B, C merge with s 21 : A-B-A, A-B-B, A-B-C Find the most likely: A-B-C

Worked Example abc s 11 s 21 s 31 AA-BA-B-C Return s 31 (=A-B-C)

Markov Models vs. MaxEnt

HMM vs MaxEnt Standard HMMs cannot compute conditional probability directly. E.g. for tagging: we want p(t 1,n |w 1,n ) we obtain it via Bayes’ rule, combining p(w 1,n |t 1,n ) with the prior p(t 1,n ) HMMs are generative models which optimise p(w 1,n |t 1,n ) By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t 1,n |w 1,n ) directly.

Graphically (after Jurafsky & Martin 2009) HMM has separate models for P(w|t) and for P(t) MEMM has a single model to estimate P(t|w)

More formally… With an HMM: With a MEMM:

Adapting Viterbi We can easily adapt the Viterbi algorithm to find the best state sequence in a MEMM. Recall that with HMMs: Adaptation for MEMMs:

Summary MaxEnt is a powerful classification model with some advantages over HMM: direct computation of conditional probabilities from the training data can handle multiple features First introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.