Word classes and part of speech tagging Chapter 5.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Outline Why part of speech tagging? Word classes
Word Classes and Part-of-Speech (POS) Tagging
1 Part of Speech tagging Lecture 9 Slides adapted from: Dan Jurafsky, Julia Hirschberg, Jim Martin.
Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
BİL711 Natural Language Processing
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
September PART-OF-SPEECH TAGGING Universita’ di Venezia 1 Ottobre 2003.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 22: 11/9.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
1 PART-OF-SPEECH TAGGING. 2 Topics of the next three lectures Tagsets Rule-based tagging Brill tagger Tagging with Markov models The Viterbi algorithm.
CMSC 723 / LING 645: Intro to Computational Linguistics November 3, 2004 Lecture 9 (Dorr): Word Classes, POS Tagging (Chapter 8) Intro to Syntax (Start.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Word classes and part of speech tagging Chapter 5.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
Natural Language Processing
CSA3202 Human Language Technology HMMs for POS Tagging.
中文信息处理 Chinese NLP Lecture 7.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
6/18/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
CSCI 5832 Natural Language Processing
CS4705 Part of Speech tagging
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Natural Language Processing
CPSC 503 Computational Linguistics
Presentation transcript:

Word classes and part of speech tagging Chapter 5

Slide 1 Outline Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging On Part 2: finish stochastic tagging, and continue on to: evaluation

Slide 2 An Example the girl kiss the boy on the cheek LEMMATAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN the girl kissed the boy on the cheek WORD

Slide 3 Word Classes: Tag Sets Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and purpose

Slide 4 Word Classes: Tag set example PRP PRP$

Slide 5 Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ? See Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

Slide 6 The Problem Words often have more than one word class: this This is a nice day = PRP This day is nice = DT You can go this far = RB

Slide 7 Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 2 tags3,760 3 tags264 4 tags61 5 tags12 6 tags2 7 tags1 (Derose, 1988)

Slide 8 Part-of-Speech Tagging Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) Stochastic Tagger: HMM-based Transformation-Based Tagger (Brill) (we won’t cover this)

Slide 9 Rule-Based Tagging Basic Idea: –Assign all possible tags to words –Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. –Typically more than 1000 hand-written rules

Slide 10 Sample ENGTWOL Lexicon Demo:

Slide 11 Stage 1 of ENGTWOL Tagging First Stage: Run words through a morphological analyzer to get all parts of speech. Example: Pavlov had shown that salivation … PavlovPAVLOV N NOM SG PROPER hadHAVE V PAST VFIN SVO HAVE PCP2 SVO shownSHOW PCP2 SVOO SVO SV thatADV PRON DEM SG DET CENTRAL DEM SG CS salivationN NOM SG

Slide 12 Stage 2 of ENGTWOL Tagging Second Stage: Apply constraints. Constraints used in negative way. Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV

Slide 13 Stochastic Tagging Based on probability of certain tag occurring given various possibilities Requires a training corpus No probabilities for words not in corpus.

Slide 14 Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! –Result: 90% accuracy –Baseline –Others will do better –HMM is an example

Slide 15 HMM Tagger Intuition: Pick the most likely tag for this word. Let T = t 1,t 2,…,t n Let W = w 1,w 2,…,w n Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W.

Slide 16 Toward a Bigram-HMM Tagger argmax T P(T|W) argmax T P(T)P(W|T) argmax t P(t 1 …t n )P(w 1 …w n |t 1 …t n ) argmax t [P(t 1 )P(t 2 |t 1 )…P(t n |t n-1 )][P(w 1 |t 1 )P(w 2 |t 2 )…P(w n |t n )] To tag a single word: t i = argmax j P(t j |t i-1 )P(w i |t j ) How do we compute P(t i |t i-1 )? c(t i-1 t i )/c(t i-1 ) How do we compute P(w i |t i )? c(w i,t i )/c(t i ) How do we compute the most probable tag sequence? Viterbi

Slide 17 7/2/2015 Speech and Language Processing - Jurafsky and Martin 17 Disambiguating “race”

Slide 18 7/2/2015 Speech and Language Processing - Jurafsky and Martin 18 Example P(NN|TO) = P(VB|TO) =.83 P(race|NN) = P(race|VB) = P(NR|VB) =.0027 P(NR|NN) =.0012 P(VB|TO)P(NR|VB)P(race|VB) = P(NN|TO)P(NR|NN)P(race|NN)= So we (correctly) choose the verb reading,

Slide 19 7/2/2015 Speech and Language Processing - Jurafsky and Martin 19 Hidden Markov Models What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

Slide 20 7/2/2015 Speech and Language Processing - Jurafsky and Martin 20 Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities on arcs leaving a node must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent ambiguous problems Useful for assigning probabilities to unambiguous sequences

Slide 21 7/2/2015 Speech and Language Processing - Jurafsky and Martin 21 Markov Chain for Weather

Slide 22 7/2/2015 Speech and Language Processing - Jurafsky and Martin 22 Markov Chain for Words

Slide 23 7/2/2015 Speech and Language Processing - Jurafsky and Martin 23 Markov Chain: “First-order observable Markov Model” A set of states Q = q 1, q 2 …q N; the state at time t is q t Transition probabilities: a set of probabilities A = a 01 a 02 …a n1 …a nn. Each a ij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Current state only depends on previous state

Slide 24 7/2/2015 Speech and Language Processing - Jurafsky and Martin 24 HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Baltimore, MA for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was

Slide 25 7/2/2015 Speech and Language Processing - Jurafsky and Martin 25 Hidden Markov Model For Markov chains, the symbols are the same as the states. See hot weather: we’re in state hot But in part-of-speech tagging The output symbols are words But the hidden states are part-of-speech tags A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don’t know which state we are in.

Slide 26 7/2/2015 Speech and Language Processing - Jurafsky and Martin 26 States Q = q 1, q 2 …q N; Observations O= o 1, o 2 …o N; Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V } Transition probabilities Transition probability matrix A = {a ij } Observation likelihoods Output probability matrix B={b i (k)} Special initial probability vector  Hidden Markov Models

Slide 27 7/2/2015 Speech and Language Processing - Jurafsky and Martin 27 Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… Produce: Weather Sequence: H,C,H,H,H,C…

Slide 28 7/2/2015 Speech and Language Processing - Jurafsky and Martin 28 HMM for Ice Cream

Slide 29 7/2/2015 Speech and Language Processing - Jurafsky and Martin 29 Transition Probabilities

Slide 30 7/2/2015 Speech and Language Processing - Jurafsky and Martin 30 Observation Likelihoods

Slide 31 7/2/2015 Speech and Language Processing - Jurafsky and Martin 31 Decoding Ok, now we have a complete model that can give us what we need. We could just enumerate all paths given the input and use the model to assign probabilities to each. Not a good idea. Luckily dynamic programming (also seen in Ch. 3 with minimum edit distance, but we didn’t cover it) helps us here

Slide 32 Viterbi Algorithm The Viterbi algorithm is used to compute the most likely tag sequence in O(W x T 2 ) time, where T is the number of possible part-of-speech tags and W is the number of words in the sentence. The algorithm sweeps through all the tag possibilities for each word, computing the best sequence leading to each possibility. The key that makes this algorithm efficient is that we only need to know the best sequences leading to the previous word because of the Markov assumption.

Slide 33 Computing the Probability of a Sentence and Tags We want to find the sequence of tags that maximizes the formula P (T 1..T n | w i..w n ), which can be estimated as: P (T i | T i−1 ) is computed by multiplying the arc values in the HMM. P (w i | T i ) is computed by multiplying the lexical generation probabilities associated with each word

Slide 34 The Viterbi Algorithm Let T = # of part-of-speech tags W = # of words in the sentence /* Initialization Step */ for t = 1 to T Score(t, 1) = Pr(Word 1 | Tag t ) * Pr(Tag t | φ) BackPtr(t, 1) = 0; /* Iteration Step */ for w = 2 to W for t = 1 to T Score(t, w) = Pr(Word w | Tag t ) *MAX j=1,T (Score(j, w-1) * Pr(Tag t | Tag j )) BackPtr(t, w) = index of j that gave the max above /* Sequence Identification */ Seq(W ) = t that maximizes Score(t,W ) for w = W -1 to 1 Seq(w) = BackPtr(Seq(w+1),w+1)

Slide 35 7/2/2015 Speech and Language Processing - Jurafsky and Martin 35 Viterbi Example in lecture

Slide 36 7/2/2015 Speech and Language Processing - Jurafsky and Martin 36 Evaluation Once you have you POS tagger running how do you evaluate it? Overall error rate with respect to a gold-standard test set. Error rates on particular tags Error rates on particular words Tag confusions...

Slide 37 7/2/2015 Speech and Language Processing - Jurafsky and Martin 37 Error Analysis Look at a confusion matrix See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

Slide 38 7/2/2015 Speech and Language Processing - Jurafsky and Martin 38 Evaluation The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one that uses no context). Important: 100% is impossible even for human annotators.

Slide 39 7/2/2015 Speech and Language Processing - Jurafsky and Martin 39 Summary Parts of speech Tagsets Part of speech tagging HMM Tagging Markov Chains Hidden Markov Models