CSA3202 Human Language Technology HMMs for POS Tagging.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Outline Why part of speech tagging? Word classes

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.

LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Albert Gatt Corpora and Statistical Methods Lecture 8.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.

FSA and HMM LING 572 Fei Xia 1/5/06.

Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside

Learning, Uncertainty, and Information Big Ideas November 8, 2004.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Part of speech (POS) tagging

Word classes and part of speech tagging Chapter 5.

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue

Albert Gatt Corpora and Statistical Methods Lecture 9.

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.

Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.

Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Word classes and part of speech tagging Chapter 5.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.

中文信息处理 Chinese NLP Lecture 7.

CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.

NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.

Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.

CSC 594 Topics in AI – Natural Language Processing

CS 224S / LINGUIST 285 Spoken Language Processing

Lecture 5 POS Tagging Methods

Word classes and part of speech tagging

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models IP notice: slides from Dan Jurafsky.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

CSCI 5582 Artificial Intelligence

Presentation transcript:

CSA3202 Human Language Technology HMMs for POS Tagging

Acknowledgement  Slides adapted from lecture by Dan Jurafsky

3 Hidden Markov Model Tagging  Using an HMM to do POS tagging is a special case of Bayesian inference  Foundational work in computational linguistics  Bledsoe 1959: OCR  Mosteller and Wallace 1964: authorship identification  It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT

4 Noisy Channel Model for Machine Translation (Brown et. al. 1990) source sentence target sentence sentence

5 POS Tagging as Sequence Classification  Probabilistic view:  Given an observation sequence O = w 1 …w n  e.g. Secretariat is expected to race tomorrow  What is the best sequence of tags T that corresponds to this sequence of observations?  Consider all possible sequences of tags  Out of this universe of sequences, choose the tag sequence which maximises P(T|O)

6 Getting to HMMs  i.e. out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest.  Hat ^ means “our estimate of the best one”  argmax x f(x) means “the x such that f(x) is maximized”

7 Getting to HMMs  But how to make it operational? How to compute this value?  Intuition of Bayesian classification:  Use Bayes Rule to transform this equation into a set of other probabilities that are easier to compute

8 Bayes Rule

9 recall this picture.... A B sample space A  B

10 Using Bayes Rule

11 Two Probabilities Involved: Probability of a tag sequence Probability of a word sequence given a tag sequence

12 Probability of tag sequence  We are seeking p(t i,...,t 1 ) = P(t 1 |0) * p(t 2 |t 1 ) * p(t 3 |t 1,t 2 )*, …, * p(t i |t i-1,…,t 1 ) (by chain rule of probabilities where 0 is the the beginning of string)  Next we approximate this product …

13 Probability of tag sequence by N-gram Approximation  By employing the following assumption p(t i |t i-1,...,t 1 ) ≈ p(t i |t i-1 )  i.e. we approximate the entire left context by the previous tag. So the product is simplified to P(t 1 |0) * p(t 2 |t 1 ) * p(t 3 |t 2 )*, …, * p(t i |t i-1 )

14 Generalizing the Approach The same trick works for approximating P(w n 1 |t n 1 ) P(w n 1 |t n 1 ) ≈ P(w 1 |t 1 ) * P(w 2 |t 2 ) * P(w n-1 |t n-1 ) * P(w n |t n ) as seen on the next slide

15 Likelihood and Prior

16 Two Kinds of Probabilities 1.Tag transition probabilities p(t i |t i-1 ) 2.Word likelihood probabilities p(w i |t i )

17 Estimating Probabilities  Tag transition probabilities p(t i |t i-1 )  Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  Compute p(t i |t i-1 ) by counting in a labeled corpus:

18 Example

19 Two Kinds of Probabilities  Word likelihood probabilities p(w i |t i )  VBZ (3sg Pres verb) likely to be “is”  Compute P(is|VBZ) by counting in a labeled corpus:

20 Example: The Verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

21 Disambiguating “race”

22 Example  P(NN|TO) =  P(VB|TO) =.83  P(race|NN) =  P(race|VB) =  P(NR|VB) =.0027  P(NR|NN) =.0012  P(VB|TO)P(NR|VB)P(race|VB) =  P(NN|TO)P(NR|NN)P(race|NN)=  So we (correctly) choose the verb reading

23 Hidden Markov Models  What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)  We approach the concept of an HMM via the simpler notions  Weighted Finite State Automaton  Markov Chain

24 WFST Definition  A Weighted Finite-State Automaton (WFST) adds probabilities to the arcs  The sum of the probabilities leaving any arc must sum to one  A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through  Examples follow

25 Markov Chain for Pronunciation Variants

26 Markov Chain for Weather

27 Probability of a Sequence  Markov chain assigns probabilities to unambiguous sequences  What is the probability of 3 consecutive rainy days?  Sequence is rainy-rainy-rainy  I.e., state sequence is  P(0,3,3,3) =   0 a 03 a 33 a 33 = 0.2 x (0.6) 3 =

28 Markov Chain: “First-order observable Markov Model”  A set of states  Q = q 1, q 2 …q N; the state at time t is q t  Transition probabilities:  a set of probabilities A = a 01 a 02 …a n1 …a nn.  Each a ij represents the probability of transitioning from state i to state j  The set of these is the transition probability matrix A  Current state only depends on previous state

29 But....  Markov chains can’t represent inherently ambiguous problems  For that we need a Hidden Markov Model

30 Hidden Markov Model  For Markov chains, the output symbols are the same as the states.  See hot weather: we’re in state hot  But in part-of-speech tagging (and other things)  The output symbols are words  But the hidden states are part-of-speech tags  So we need an extension!  A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.  This means we don’t know which state we are in.

31  States Q = q 1, q 2 … q N;  Observations O= o 1, o 2 … o N;  Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities  Transition probability matrix A = {a ij }  Observation likelihoods  Output probability matrix B={b i (k)}  Special initial probability vector  Hidden Markov Models

32 Underlying Markov Chain is as before, but we need to add...

33 Observation Likelihoods

34 Observation Likelihoods  Observation likelihoods P(O|S) link actual observations O with underlying states S  Are necessary whenever there is not a 1:1 correspondence between the two.  Examples from many different domains where there is uncertainty of interpretation  OCR  Speech Recognition  Spelling Correction  POS

35 Decoding  Recall that we need to compute  We could just enumerate all paths given the input and use the model to assign probabilities to each.  Many Paths  Only need to store the most probable path  Dynamic programming  Viterbi Algorithm

36 Viterbi Algorithm  Inputs  HMM  Q: States  A: Transition probabilities  B: Observation likelihoods  O: Observed Words  Output  Most probable state/tag sequence together with its probability

37 POS Transition Probabilities

38 POS Observation Likelihoods

39 Viterbi Summary  Create an array  Columns corresponding to inputs  Rows corresponding states  Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs  Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

40 Viterbi Example

41

42 POS-Tagger Evaluation  Different metrics can be used:  Overall error rate with respect to a gold- standard test set.  Error rates on particular tags c(gold(tok)=t and tagger(tok)=t)/ c(gold(tok)=t)  Error rates on particular words  Tag confusions...

43 Evaluation  The result is compared with a manually coded “Gold Standard”  Typically accuracy reaches 96-97%  This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators.  Inter annotator agreement problem

44 Error Analysis  Look at a confusion matrix  See what errors are causing problems  Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)  Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

45 POS Tagging Summary  Parts of speech  Tagsets  Part of speech tagging  Rule-Based  HMM Tagging  Markov Chains  Hidden Markov Models  Next: Transformation based