Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

Slides:

Advertisements

Similar presentations

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Advertisements

Outline Why part of speech tagging? Word classes

Word Classes and Part-of-Speech (POS) Tagging

1 Part of Speech tagging Lecture 9 Slides adapted from: Dan Jurafsky, Julia Hirschberg, Jim Martin.

Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

September PART-OF-SPEECH TAGGING Universita’ di Venezia 1 Ottobre 2003.

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 22: 11/9.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.

Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Part of speech (POS) tagging

1 PART-OF-SPEECH TAGGING. 2 Topics of the next three lectures Tagsets Rule-based tagging Brill tagger Tagging with Markov models The Viterbi algorithm.

CMSC 723 / LING 645: Intro to Computational Linguistics November 3, 2004 Lecture 9 (Dorr): Word Classes, POS Tagging (Chapter 8) Intro to Syntax (Start.

Word classes and part of speech tagging Chapter 5.

Albert Gatt Corpora and Statistical Methods Lecture 9.

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.

8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.

1 POS Tagging: Introduction Heng Ji Feb 2, 2008 Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.

1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.

Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Word classes and part of speech tagging Chapter 5.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.

CSA3202 Human Language Technology HMMs for POS Tagging.

中文信息处理 Chinese NLP Lecture 7.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

Part-of-speech tagging

Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging.

NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Word classes and part of speech tagging Chapter 5.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji.

1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.

6/18/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.

CSC 594 Topics in AI – Natural Language Processing

CS 224S / LINGUIST 285 Spoken Language Processing

Lecture 5 POS Tagging Methods

Word classes and part of speech tagging

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models IP notice: slides from Dan Jurafsky.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

Natural Language Processing

CPSC 503 Computational Linguistics

Presentation transcript:

Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2011 Overview Last Time Overview of POS TagsToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggersReadings Chapter ?

– 3 – CSCE 771 Spring 2011 History of Tagging Dionysius Thrax of Alexandria (circa 100 B.C.) wrote a “techne” which summarized linguistic knowledge of the day Terminology that is still used 2000 years later Syntax, dipthong, clitic, analogy Also included eight “parts-of-speech” basis of subsequent POS descriptions of Greek, Latin and European languages Noun Verb Pronoun Preposition Adverb Conjunction Participle Article

– 4 – CSCE 771 Spring 2011 History of Tagging 100 BC Dionysis Thrax – document eight parts of speech 1959 Harris (U Penn) first tagger as part of TDAP parser project 1963 Klien and Simmons() Computational Grammar Coder CGC Small lexicon (1500 exceptional words), morphological analyzer and context disambiguator 1971 Green and Rubin TAGGIT expanded CGC More tags 87 and bigger dictionary Achieved 77% accuracy when applied to Brown Corpus 1983 Marshal/Garside – CLAWS tagger Probabilistic algorithm using tag bigram probabilities

– 5 – CSCE 771 Spring 2011 History of Tagging (continued) 1988 Church () PARTS tagger Extended CLAWS idea Stored P(tag | word) * P(tag | previous n tags) instead of P(word | tag) * P(tag | previous n tags) used in HMM taggers 1992 Kupiec HMM tagger 1994 Schütze and Singer () variable length Markov Models 1994 Jelinek/Magerman – decision trees for the probabilities 1996 Ratnaparkhi () - the Maximum Entropy Algorithm 1997 Brill – unsupervised version of the TBL algorithm

– 6 – CSCE 771 Spring 2011 POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.

– 7 – CSCE 771 Spring 2011 How Hard is POS Tagging? Measuring Ambiguity

– 8 – CSCE 771 Spring 2011 Two Methods for POS Tagging 1.Rule-based tagging (ENGTWOL) 2.Stochastic 1. Probabilistic sequence models HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

– 9 – CSCE 771 Spring 2011 Rule-Based Tagging Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word.

– 10 – CSCE 771 Spring 2011 Start With a Dictionary she:PRP she:PRP promised:VBN,VBD promised:VBN,VBD toTO toTO back:VB, JJ, RB, NN back:VB, JJ, RB, NN the:DT the:DT bill: NN, VB bill: NN, VB Etc… for the ~100,000 words of English with more than 1 tag Etc… for the ~100,000 words of English with more than 1 tag

– 11 – CSCE 771 Spring 2011 Assign Every Possible Tag NNRB VBNJJ VB PRPVBD TOVB DT NN Shepromised to back the bill

– 12 – CSCE 771 Spring 2011 Write Rules to Eliminate Tags Eliminate VBN if VBD is an option when VBN|VBD follows “ PRP” NN NN RB RB JJ VB JJ VB PRPVBD TO VB DTNN Shepromisedtoback thebill VBN

– 13 – CSCE 771 Spring 2011 Stage 1 of ENGTWOL Tagging First Stage: Run words through FST morphological analyzer to get all parts of speech. Example: Pavlov had shown that salivation … PavlovPAVLOV N NOM SG PROPER hadHAVE V PAST VFIN SVO HAVE PCP2 SVO shownSHOW PCP2 SVOO SVO SV thatADV PRON DEM SG DET CENTRAL DEM SG CS salivationN NOM SG PavlovPAVLOV N NOM SG PROPER hadHAVE V PAST VFIN SVO HAVE PCP2 SVO shownSHOW PCP2 SVOO SVO SV thatADV PRON DEM SG DET CENTRAL DEM SG CS salivationN NOM SG

– 14 – CSCE 771 Spring 2011 Stage 2 of ENGTWOL Tagging Second Stage: Apply NEGATIVE constraints. Example: Adverbial “that” rule Eliminates all readings of “that” except the one in “It isn’t that odd” Given input: “that” If (+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier (+2 SENT-LIM) ;following which is E-O-S (NOT -1 SVOC/A) ; and the previous word is not a ; verb like “consider” which ; verb like “consider” which ; allows adjective complements ; allows adjective complements ; in “I consider that odd” ; in “I consider that odd” Then eliminate non-ADV tags Else eliminate ADV

– 15 – CSCE 771 Spring 2011 Hidden Markov Model Tagging Using an HMM to do POS tagging is a special case of Bayesian inference Foundational work in computational linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship identification It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT

– 16 – CSCE 771 Spring 2011 POS Tagging as Sequence Classification We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow What is the best sequence of tags that corresponds to this sequence of observations? Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w 1 …w n.

– 17 – CSCE 771 Spring 2011 Getting to HMMs We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest. Hat ^ means “our estimate of the best one” Argmax x f(x) means “the x such that f(x) is maximized”

– 18 – CSCE 771 Spring 2011 Getting to HMMs This equation is guaranteed to give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian classification: Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute

– 19 – CSCE 771 Spring 2011 Using Bayes Rule

– 20 – CSCE 771 Spring 2011 Likelihood and Prior

– 21 – CSCE 771 Spring 2011 Two Kinds of Probabilities Tag transition probabilities p(t i |t i-1 ) Determiners likely to precede adjs and nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be: Compute P(NN|DT) by counting in a labeled corpus:

– 22 – CSCE 771 Spring 2011 Two Kinds of Probabilities Word likelihood probabilities p(w i |t i ) VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a labeled corpus:

– 23 – CSCE 771 Spring 2011 Example: The Verb “race” Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag?

– 24 – CSCE 771 Spring 2011 Disambiguating “race”

– 25 – CSCE 771 Spring 2011 Example P(NN|TO) = P(VB|TO) =.83 P(race|NN) = P(race|VB) = P(NR|VB) =.0027 P(NR|NN) =.0012 P(VB|TO)P(NR|VB)P(race|VB) = P(NN|TO)P(NR|NN)P(race|NN)= So we (correctly) choose the verb reading,

– 26 – CSCE 771 Spring 2011 Hidden Markov Models What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

– 27 – CSCE 771 Spring 2011 Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences

– 28 – CSCE 771 Spring 2011 Markov Chain for Weather

– 29 – CSCE 771 Spring 2011 Markov Chain for Words

– 30 – CSCE 771 Spring Markov Chain: “First-order observable Markov Model” A set of states Q = q 1, q 2 …q N; the state at time t is q t Transition probabilities: a set of probabilities A = a 01 a 02 …a n1 …a nn. Each a ij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Current state only depends on previous state

– 31 – CSCE 771 Spring 2011 Markov Chain for Weather What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is P(3,3,3,3) =  1 a 11 a 11 a 11 a 11 = 0.2 x (0.6) 3 =

– 32 – CSCE 771 Spring 2011 HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Baltimore, MA for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was

– 33 – CSCE 771 Spring 2011 Hidden Markov Model For Markov chains, the output symbols are the same as the states. See hot weather: we’re in state hot But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don’t know which state we are in.

– 34 – CSCE 771 Spring 2011 States Q = q 1, q 2 … q N; Observations O= o 1, o 2 … o N; Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V } Transition probabilities Transition probability matrix A = {a ij } Observation likelihoods Output probability matrix B={b i (k)} Special initial probability vector  Hidden Markov Models

– 35 – CSCE 771 Spring 2011 Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3…Produce: Weather Sequence: H,C,H,H,H,C…

– 36 – CSCE 771 Spring 2011 HMM for Ice Cream

– 37 – CSCE 771 Spring 2011 Transition Probabilities

– 38 – CSCE 771 Spring 2011 Observation Likelihoods

– 39 – CSCE 771 Spring 2011 Decoding Ok, now we have a complete model that can give us what we need. Recall that we need to get We could just enumerate all paths given the input and use the model to assign probabilities to each. Not a good idea. Luckily dynamic programming (last seen in Ch. 3 with minimum edit distance) helps us here

– 40 – CSCE 771 Spring 2011 The Viterbi Algorithm

– 41 – CSCE 771 Spring 2011 Viterbi Example

– 42 – CSCE 771 Spring 2011 Viterbi Summary Create an array With columns corresponding to inputs Rows corresponding to possible states Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

– 43 – CSCE 771 Spring 2011 Evaluation So once you have you POS tagger running how do you evaluate it? Overall error rate with respect to a gold-standard test set. Error rates on particular tags Error rates on particular words Tag confusions...

– 44 – CSCE 771 Spring 2011 Error Analysis Look at a confusion matrix See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

– 45 – CSCE 771 Spring 2011 Evaluation The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one that uses no context). Important: 100% is impossible even for human annotators.

– 46 – CSCE 771 Spring 2011 Summary Parts of speech Tagsets Part of speech tagging HMM Tagging Markov Chains Hidden Markov Models