Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.

Slides:



Advertisements
Similar presentations
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
Advertisements

LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
Machine Learning 4 Machine Learning 4 Hidden Markov Models.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Part of speech (POS) tagging
Word classes and part of speech tagging Chapter 5.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Albert Gatt Corpora and Statistical Methods Lecture 9.
M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Word classes and part of speech tagging Chapter 5.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CSA3202 Human Language Technology HMMs for POS Tagging.
中文信息处理 Chinese NLP Lecture 7.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.
CSC 594 Topics in AI – Natural Language Processing
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
LECTURE 15: REESTIMATION, EM AND MIXTURES
Hidden Markov Models Teaching Demo The University of Arizona
CSCI 5582 Artificial Intelligence
Presentation transcript:

Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown

10/5/2015 Speech and Language Processing - Jurafsky and Martin 2 Today  Part of speech tagging  HMMs  Basic HMM model  Decoding  Viterbi  Review chapters 1-4

10/5/2015 Speech and Language Processing - Jurafsky and Martin 3 POS Tagging as Sequence Classification  We are given a sentence (an “observation” or “sequence of observations”)  Secretariat is expected to race tomorrow  What is the best sequence of tags that corresponds to this sequence of observations?  Probabilistic view  Consider all possible sequences of tags  Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w 1 …w n.

10/5/2015 Speech and Language Processing - Jurafsky and Martin 4 Getting to HMMs  We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest.  Hat ^ means “our estimate of the best one”  Argmax x f(x) means “the x such that f(x) is maximized”

10/5/2015 Speech and Language Processing - Jurafsky and Martin 5 Getting to HMMs  This equation should give us the best tag sequence  But how to make it operational? How to compute this value?  Intuition of Bayesian inference:  Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer)

Bayesian inference  Update the probability of a hypothesis as you get evidence  Rationale: two components  How well does the evidence match the hypothesis?  How probable is the hypothesis a priori? 10/5/2015 Speech and Language Processing - Jurafsky and Martin 6

10/5/2015 Speech and Language Processing - Jurafsky and Martin 7 Using Bayes Rule

10/5/2015 Speech and Language Processing - Jurafsky and Martin 8 Likelihood and Prior

10/5/2015 Speech and Language Processing - Jurafsky and Martin 9 Two Kinds of Probabilities  Tag transition probabilities p(t i |t i-1 )  Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  Compute P(NN|DT) by counting in a labeled corpus:

10/5/2015 Speech and Language Processing - Jurafsky and Martin 10 Two Kinds of Probabilities  Word likelihood probabilities p(w i |t i )  VBZ (3sg Pres Verb) likely to be “is”  Compute P(is|VBZ) by counting in a labeled corpus:

10/5/2015 Speech and Language Processing - Jurafsky and Martin 11 Example: The Verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

10/5/2015 Speech and Language Processing - Jurafsky and Martin 12 Disambiguating “race”

10/5/2015 Speech and Language Processing - Jurafsky and Martin 13 Disambiguating “race”

10/5/2015 Speech and Language Processing - Jurafsky and Martin 14 Example  P(NN|TO) =  P(VB|TO) =.83  P(race|NN) =  P(race|VB) =  P(NR|VB) =.0027  P(NR|NN) =.0012  P(VB|TO)P(NR|VB)P(race|VB) =  P(NN|TO)P(NR|NN)P(race|NN)=  So we (correctly) choose the verb tag for “race”

Question  If there are 30 or so tags in the Penn set  And the average sentence is around 20 words...  How many tag sequences do we have to enumerate to argmax over in the worst case scenario? 10/5/2015 Speech and Language Processing - Jurafsky and Martin

Hidden Markov Models  Remember FSAs?  HMMs are a special kind that use probabilities with the transitions  Minimum edit distance?  Viterbi and Forward algorithms  Dynamic programming?  Efficient means of finding most likely path 10/5/2015 Speech and Language Processing - Jurafsky and Martin 16

10/5/2015 Speech and Language Processing - Jurafsky and Martin 17 Hidden Markov Models  We can represent our race tagging example as an HMM.  This is a kind of generative model.  There is a hidden underlying generator of observable events  The hidden generator can be modeled as a network of states and transitions  We want to infer the underlying state sequence given the observed event sequence

10/5/2015 Speech and Language Processing - Jurafsky and Martin 18  States Q = q 1, q 2 …q N;  Observations O= o 1, o 2 …o N;  Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities  Transition probability matrix A = {a ij }  Observation likelihoods  Vectors of probabilities associated with the states  Special initial probability vector  Hidden Markov Models

10/5/2015 Speech and Language Processing - Jurafsky and Martin 19 HMMs for Ice Cream  You are a climatologist in the year 2799 studying global warming  You can’t find any records of the weather in Baltimore for summer of 2007  But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer  Your job: figure out how hot it was each day

10/5/2015 Speech and Language Processing - Jurafsky and Martin 20 Eisner Task  Given  Ice Cream Observation Sequence: 1,2,3,2,2,2,3…  Produce:  Hidden Weather Sequence: H,C,H,H,H,C, C…

10/5/2015 Speech and Language Processing - Jurafsky and Martin 21 HMM for Ice Cream

Ice Cream HMM  Let’s just do 131 as the sequence  How many underlying state (hot/cold) sequences are there?  How do you pick the right one? 10/5/2015 Speech and Language Processing - Jurafsky and Martin 22 HHH HHC HCH HCC CCC CCH CHC CHH HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1)

Ice Cream HMM Let’s just do 1 sequence: CHC 10/5/2015 Speech and Language Processing - Jurafsky and Martin 23 Cold as the initial state P(Cold|Start) Cold as the initial state P(Cold|Start) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) Cold as the next state P(Cold|Hot) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold)

10/5/2015 Speech and Language Processing - Jurafsky and Martin 24 POS Transition Probabilities

10/5/2015 Speech and Language Processing - Jurafsky and Martin 25 Observation Likelihoods

10/5/2015 Speech and Language Processing - Jurafsky and Martin 26 Decoding  Ok, now we have a complete model that can give us what we need. Recall that we need to get  We could just enumerate all paths given the input and use the model to assign probabilities to each.  Not a good idea.  Luckily dynamic programming (last seen in Ch. 3 with minimum edit distance) helps us here

10/5/2015 Speech and Language Processing - Jurafsky and Martin 27 Intuition  Consider a state sequence (tag sequence) that ends at state j with a particular tag T.  The probability of that tag sequence can be broken into two parts  The probability of the BEST tag sequence up through j-1  Multiplied by the transition probability from the tag at the end of the j-1 sequence to T.  And the observation probability of the word given tag T.

10/5/2015 Speech and Language Processing - Jurafsky and Martin 28 The Viterbi Algorithm

10/5/2015 Speech and Language Processing - Jurafsky and Martin 29 Viterbi Summary  Create an array  With columns corresponding to inputs  Rows corresponding to possible states  Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs  Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

10/5/2015 Speech and Language Processing - Jurafsky and Martin 30 Evaluation  So once you have you POS tagger running how do you evaluate it?  Overall error rate with respect to a gold- standard test set  With respect to a baseline  Error rates on particular tags  Error rates on particular words  Tag confusions...

10/5/2015 Speech and Language Processing - Jurafsky and Martin 31 Error Analysis  Look at a confusion matrix  See what errors are causing problems  Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)  Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

10/5/2015 Speech and Language Processing - Jurafsky and Martin 32 Evaluation  The result is compared with a manually coded “Gold Standard”  Typically accuracy reaches 96-97%  This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators.  Issues with manually coded gold standards

10/5/2015 Speech and Language Processing - Jurafsky and Martin 33 Summary  Parts of speech  Tagsets  Part of speech tagging  HMM Tagging  Markov Chains  Hidden Markov Models  Viterbi decoding

10/5/2015 Speech and Language Processing - Jurafsky and Martin 34 Review  Exam readings  Chapters 1 to 6  Chapter 2  Chapter 3  Skip 3.4.1, 3.10, 3.12  Chapter 4  Skip 4.7,  Chapter 5  Skip 5.5.4, 5.6,

10/5/2015 Speech and Language Processing - Jurafsky and Martin 35 3 Formalisms  Regular expressions describe languages (sets of strings)  Turns out that there are 3 formalisms for capturing such languages, each with their own motivation and history  Regular expressions  Compact textual strings  Perfect for specifying patterns in programs or command-lines  Finite state automata  Graphs  Regular grammars  Rules

Regular expressions  Anchor expressions  ^, $, \b  Counters  *, +, ?  Single character expressions ., [ ], [ - ]  Grouping for precedence  ( )  [dog]* vs. (dog)*  No need to memorize shortcuts  \d, \s 10/5/2015 Speech and Language Processing - Jurafsky and Martin 36

FSAs  Components of an FSA  Know how to read one and draw one  Deterministic vs. non-deterministic  How is success/failure different?  Relative power  Recognition vs. generation  How do we implement FSAs for recognition? 10/5/2015 Speech and Language Processing - Jurafsky and Martin 37

10/5/2015 Speech and Language Processing - Jurafsky and Martin 38 More Formally  You can specify an FSA by enumerating the following things.  The set of states: Q  A finite alphabet: Σ  A start state  A set of accept states  A transition function that maps Qx Σ to Q

FSTs  Components of an FST  Inputs and outputs  Relations 10/5/2015 Speech and Language Processing - Jurafsky and Martin 39

Morphology  What is a morpheme?  Stems and affixes  Inflectional vs. derivational  Fuzzy -> fuzziness  Fuzzy -> fuzzier  Application of derivation rules  N -> V with –ize  System, chair  Regular vs. irregular 10/5/2015 Speech and Language Processing - Jurafsky and Martin 40

10/5/2015 Speech and Language Processing - Jurafsky and Martin 41 Derivational Rules

10/5/2015 Speech and Language Processing - Jurafsky and Martin 42 Lexicons  So the big picture is to store a lexicon (list of words you care about) as an FSA. The base lexicon is embedded in larger automata that captures the inflectional and derivational morphology of the language.  So what? Well, the simplest thing you can do with such an FSA is spell checking  If the machine rejects, the word isn’t in the language  Without listing every form of every word

10/5/2015 Speech and Language Processing - Jurafsky and Martin 43 Next Time  Three tasks for HMMs  Decoding  Viterbi algorithm  Assigning probabilities to inputs  Forward algorithm  Finding parameters for a model  EM