Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.

Outline Why part of speech tagging? Word classes

Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.

LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.

Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Part of speech (POS) tagging

Word classes and part of speech tagging Chapter 5.

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue

Albert Gatt Corpora and Statistical Methods Lecture 9.

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.

Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Word classes and part of speech tagging Chapter 5.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

CSA3202 Human Language Technology HMMs for POS Tagging.

中文信息处理 Chinese NLP Lecture 7.

CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

Part-of-speech tagging

NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji.

1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.

6/18/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.

CSC 594 Topics in AI – Natural Language Processing

CS 224S / LINGUIST 285 Spoken Language Processing

Lecture 5 POS Tagging Methods

Word classes and part of speech tagging

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models IP notice: slides from Dan Jurafsky.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

Natural Language Processing

Part-of-Speech Tagging Using Hidden Markov Models

Presentation transcript:

Natural Language Processing Lecture 8—9/24/2013 Jim Martin

5/15/2015 Speech and Language Processing - Jurafsky and Martin 2 Today  Review  Word classes  Part of speech tagging  HMMs  Basic HMM model  Decoding  Viterbi

5/15/2015 Speech and Language Processing - Jurafsky and Martin 3 Word Classes: Parts of Speech  8 (ish) traditional parts of speech  Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc  Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...  Lots of debate within linguistics about the number, nature, and universality of these  We’ll completely ignore this debate.

5/15/2015 Speech and Language Processing - Jurafsky and Martin 4 POS Tagging  The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag theDET koalaN put V the DET keysN onP theDET tableN

5/15/2015 Speech and Language Processing - Jurafsky and Martin 5 Penn TreeBank POS Tagset

5/15/2015 Speech and Language Processing - Jurafsky and Martin 6 POS Tagging  Words often have more than one part of speech: back  The back door = JJ  On my back = NN  Win the voters back = RB  Promised to back the bill = VB  The POS tagging problem is to determine the POS tag for a particular instance of a word in context  Usually a sentence

POS Tagging  Note this is distinct from the task of identifying which sense of a word is being used given a particular part of speech. That’s called word sense disambiguation. We’ll get to that later.  “… backed the car into a pole”  “… backed the wrong candidate” 5/15/2015 Speech and Language Processing - Jurafsky and Martin 7

5/15/2015 Speech and Language Processing - Jurafsky and Martin 8 How Hard is POS Tagging? Measuring Ambiguity

5/15/2015 Speech and Language Processing - Jurafsky and Martin 9 Two Methods for POS Tagging 1.Rule-based tagging  (ENGTWOL; Section 5.4 ) 2.Stochastic 1.Probabilistic sequence models  HMM (Hidden Markov Model) tagging  MEMMs (Maximum Entropy Markov Models)

5/15/2015 Speech and Language Processing - Jurafsky and Martin 10 POS Tagging as Sequence Classification  We are given a sentence (an “observation” or “sequence of observations”)  Secretariat is expected to race tomorrow  What is the best sequence of tags that corresponds to this sequence of observations?  Probabilistic view  Consider all possible sequences of tags  Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w 1 …w n.

5/15/2015 Speech and Language Processing - Jurafsky and Martin 11 Getting to HMMs  We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest.  Hat ^ means “our estimate of the best one”  Argmax x f(x) means “the x such that f(x) is maximized”

5/15/2015 Speech and Language Processing - Jurafsky and Martin 12 Getting to HMMs  This equation should give us the best tag sequence  But how to make it operational? How to compute this value?  Intuition of Bayesian inference:  Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer)

5/15/2015 Speech and Language Processing - Jurafsky and Martin 13 Using Bayes Rule Know this.

5/15/2015 Speech and Language Processing - Jurafsky and Martin 14 Likelihood and Prior

5/15/2015 Speech and Language Processing - Jurafsky and Martin 15 Two Kinds of Probabilities  Tag transition probabilities p(t i |t i-1 )  Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  Compute P(NN|DT) by counting in a labeled corpus:

5/15/2015 Speech and Language Processing - Jurafsky and Martin 16 Two Kinds of Probabilities  Word likelihood probabilities p(w i |t i )  VBZ (3sg Pres Verb) likely to be “is”  Compute P(is|VBZ) by counting in a labeled corpus:

5/15/2015 Speech and Language Processing - Jurafsky and Martin 17 Example: The Verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

5/15/2015 Speech and Language Processing - Jurafsky and Martin 18 Disambiguating “race”

5/15/2015 Speech and Language Processing - Jurafsky and Martin 19 Disambiguating “race”

5/15/2015 Speech and Language Processing - Jurafsky and Martin 20 Example  P(NN|TO) =  P(VB|TO) =.83  P(race|NN) =  P(race|VB) =  P(NR|VB) =.0027  P(NR|NN) =.0012  P(VB|TO)P(NR|VB)P(race|VB) =  P(NN|TO)P(NR|NN)P(race|NN)=  So we (correctly) choose the verb tag for “race”

5/15/2015 Speech and Language Processing - Jurafsky and Martin 21 Break  Quiz readings  Chapters 1 to 6  Chapter 2  Chapter 3  Skip 3.4.1, 3.10, 3.12  Chapter 4  Skip 4.7,  Chapter 5  Skip 5.5.4, 5.6,  Chapter 6  Skip

5/15/2015 Speech and Language Processing - Jurafsky and Martin 22 Hidden Markov Models  What we’ve just described is called a Hidden Markov Model (HMM)  This is a kind of generative model.  There is a hidden underlying generator of observable events  The hidden generator can be modeled as a network of states and transitions  We want to infer the underlying state sequence given the observed event sequence

5/15/2015 Speech and Language Processing - Jurafsky and Martin 23  States Q = q 1, q 2 …q N;  Observations O= o 1, o 2 …o N;  Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities  Transition probability matrix A = {a ij }  Observation likelihoods  Output probability matrix B={b i (k)}  Special initial probability vector  Hidden Markov Models

5/15/2015 Speech and Language Processing - Jurafsky and Martin 24 HMMs for Ice Cream  You are a climatologist in the year 2799 studying global warming  You can’t find any records of the weather in Baltimore for summer of 2007  But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer  Your job: figure out how hot it was each day

5/15/2015 Speech and Language Processing - Jurafsky and Martin 25 Eisner Task  Given  Ice Cream Observation Sequence: 1,2,3,2,2,2,3…  Produce:  Hidden Weather Sequence: H,C,H,H,H,C, C…

5/15/2015 Speech and Language Processing - Jurafsky and Martin 26 HMM for Ice Cream

Ice Cream HMM  Let’s just do 131 as the sequence  How many underlying state (hot/cold) sequences are there?  How do you pick the right one? 5/15/2015 Speech and Language Processing - Jurafsky and Martin 27 HHH HHC HCH HCC CCC CCH CHC CHH HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1)

Ice Cream HMM Let’s just do 1 sequence: CHC 5/15/2015 Speech and Language Processing - Jurafsky and Martin 28 Cold as the initial state P(Cold|Start) Cold as the initial state P(Cold|Start) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) Cold as the next state P(Cold|Hot) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold)

5/15/2015 Speech and Language Processing - Jurafsky and Martin 29 POS Transition Probabilities

5/15/2015 Speech and Language Processing - Jurafsky and Martin 30 Observation Likelihoods

Question  If there are 30 or so tags in the Penn set  And the average sentence is around 20 words...  How many tag sequences do we have to enumerate to argmax over in the worst case scenario? 5/15/2015 Speech and Language Processing - Jurafsky and Martin

5/15/2015 Speech and Language Processing - Jurafsky and Martin 32 Decoding  Ok, now we have a complete model that can give us what we need. Recall that we need to get  We could just enumerate all paths given the input and use the model to assign probabilities to each.  Not a good idea.  Luckily dynamic programming (last seen in Ch. 3 with minimum edit distance) helps us here

5/15/2015 Speech and Language Processing - Jurafsky and Martin 33 Intuition  Consider a state sequence (tag sequence) that ends at state j with a particular tag T.  The probability of that tag sequence can be broken into two parts  The probability of the BEST tag sequence up through j-1  Multiplied by the transition probability from the tag at the end of the j-1 sequence to T.  And the observation probability of the word given tag T.

5/15/2015 Speech and Language Processing - Jurafsky and Martin 34 The Viterbi Algorithm

5/15/2015 Speech and Language Processing - Jurafsky and Martin 35 Viterbi Example

5/15/2015 Speech and Language Processing - Jurafsky and Martin 36 Viterbi Summary  Create an array  With columns corresponding to inputs  Rows corresponding to possible states  Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs  Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

5/15/2015 Speech and Language Processing - Jurafsky and Martin 37 Evaluation  So once you have you POS tagger running how do you evaluate it?  Overall error rate with respect to a gold- standard test set.  Error rates on particular tags  Error rates on particular words  Tag confusions...

5/15/2015 Speech and Language Processing - Jurafsky and Martin 38 Error Analysis  Look at a confusion matrix  See what errors are causing problems  Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)  Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

5/15/2015 Speech and Language Processing - Jurafsky and Martin 39 Evaluation  The result is compared with a manually coded “Gold Standard”  Typically accuracy reaches 96-97%  This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators.

5/15/2015 Speech and Language Processing - Jurafsky and Martin 40 Summary  Parts of speech  Tagsets  Part of speech tagging  HMM Tagging  Markov Chains  Hidden Markov Models  Viterbi decoding

5/15/2015 Speech and Language Processing - Jurafsky and Martin 41 Next Time  Three tasks for HMMs  Decoding  Viterbi algorithm  Assigning probabilities to inputs  Forward algorithm  Finding parameters for a model  EM