Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models David Meir Blei November 1, 1999.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Conditional Random Fields
Graphical models for part of speech tagging
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models for Information Extraction CSE 454.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
NLP. Introduction to NLP Example: –Input: Written English (X) –Encoder: garbles the input (X->Y) –Output: Spoken English (Y) More examples: –Grammatical.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Learning, Uncertainty, and Information: Learning Parameters
CSC 594 Topics in AI – Natural Language Processing
Speech Processing Speech Recognition
Hidden Markov Models Teaching Demo The University of Arizona
Presentation transcript:

Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391

CIS Intro to AI 2 NLP Task I – Determining Part of Speech Tags  The Problem: WordPOS listing in Brown Corpus heatnounverb oilnoun inprepnounadv adetnounnoun-proper largeadjnounadv potnoun

CIS Intro to AI 3 NLP Task I – Determining Part of Speech Tags  The Old Solution: Depth First search. If each of n words has k tags on average, try the n k combinations until one works.  Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. The best techniques achieve 97+% accuracy per word on new materials, given large training corpora.

CIS Intro to AI 4 What is POS tagging good for?  Speech synthesis: How to pronounce “ lead ” ? INsult inSULT OBject obJECT OVERflow overFLOW DIScountdisCOUNT CONtent conTENT  Stemming for information retrieval Knowing a word is a N tells you it gets plurals Can search for “aardvarks” get “aardvark”  Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs

CIS Intro to AI 5 Equivalent Problem in Bioinformatics  Durbin et al. Biological Sequence Analysis, Cambridge University Press.  Several applications, e.g. proteins  From primary structure ATCPLELLLD  Infer secondary structure HHHBBBBBC..

CIS Intro to AI 6 Penn Treebank Tagset I TagDescriptionExample CCcoordinating conjunctionand CDcardinal number1, third DTdeterminerthe EXexistential therethere is FWforeign wordd'hoevre INpreposition/subordinating conjunctionin, of, like JJadjectivegreen JJRadjective, comparativegreener JJSadjective, superlativegreenest LSlist marker1) MDmodalcould, will NNnoun, singular or masstable NNSnoun pluraltables NNPproper noun, singularJohn NNPSproper noun, pluralVikings

CIS Intro to AI 7 TagDescriptionExample PDTpredeterminerboth the boys POSpossessive endingfriend 's PRPpersonal pronounI, me, him, he, it PRP$possessive pronounmy, his RBadverbhowever, usually, here, good RBRadverb, comparativebetter RBSadverb, superlativebest RPparticlegive up TOtoto go, to him UHinterjectionuhhuhhuhh Penn Treebank Tagset II

CIS Intro to AI 8 TagDescriptionExample VBverb, base formtake VBDverb, past tensetook VBGverb, gerund/present participletaking VBNverb, past participletaken VBPverb, sing. present, non-3dtake VBZverb, 3rd person sing. presenttakes WDTwh-determinerwhich WPwh-pronounwho, what WP$possessive wh-pronounwhose WRBwh-abverbwhere, when Penn Treebank Tagset III

CIS Intro to AI 9 Simple Statistical Approaches: Idea 1

CIS Intro to AI 10 Simple Statistical Approaches: Idea 2 For a string of words W = w 1 w 2 w 3 …w n find the string of POS tags T = t 1 t 2 t 3 …t n which maximizes P(T|W) i.e., the most likely POS tag t i for each word w i given its surrounding context

CIS Intro to AI 11 The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

CIS Intro to AI 12 A BOTEC Estimate of What We Can Estimate  What parameters can we estimate with a million words of hand tagged training data? Assume a uniform distribution of 5000 words and 40 part of speech tags..  Rich Models often require vast amounts of data  Good estimates of models with bad assumptions often outperform better models which are badly estimated

CIS Intro to AI 13 A Practical Statistical Tagger

CIS Intro to AI 14 A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… Again, we change to a model that we CAN estimate:

CIS Intro to AI 15 A Practical Statistical Tagger III So, for a given string W = w 1 w 2 w 3 …w n, the tagger needs to find the string of tags T which maximizes

CIS Intro to AI 16 Training and Performance  To estimate the parameters of this model, given an annotated training corpus:  Because many of these counts are small, smoothing is necessary for best results…  Such taggers typically achieve about 95-96% correct tagging, for tag sets of tags.

CIS Intro to AI 17 Hidden Markov Models This model is an instance of a Hidden Markov Model. Viewed graphically: Adj Det Noun Verb P(w|Det) a.4 the.4 P(w|Adj) good.02 low.04 P(w|Noun) price.001 deal.0001

CIS Intro to AI 18 Viewed as a generator, an HMM: Adj Det Noun Verb the.4a P(w|Det).04low.02good P(w|Adj).0001deal.001price P(w|Noun)

CIS Intro to AI 19 Recognition using an HMM

CIS Intro to AI 20 A Practical Statistical Tagger IV  Finding this maximum can be done using an exponential search through all strings for T. linear time  However, there is a linear time solution using dynamic programming called Viterbi decoding.

CIS Intro to AI 21 Parameters of an HMM  States: A set of states S=s 1,…,s n  Transition probabilities: A= a 1,1,a 1,2,…,a n,n Each a i,j represents the probability of transitioning from state s i to s j.  Emission probabilities: A set B of functions of the form b i (o t ) which is the probability of observation o t being emitted by s i  Initial state distribution: is the probability that s i is a start state

CIS Intro to AI 22 The Three Basic HMM Problems  Problem 1 (Evaluation): Given the observation sequence O=o 1,…,o T and an HMM model, how do we compute the probability of O given the model?  Problem 2 (Decoding): Given the observation sequence O=o 1,…,o T and an HMM model, how do we find the state sequence that best explains the observations? (This and following slides follow classic formulation by Rabiner and Juang, as adapted by Manning and Schutze. Slides adapted from Dorr.)

CIS Intro to AI 23  Problem 3 (Learning): How do we adjust the model parameters, to maximize ? The Three Basic HMM Problems

CIS Intro to AI 24 Problem 1: Probability of an Observation Sequence  What is ?  The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM.  Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences.  Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths  Solution to this and problem 2 is to use dynamic programming

CIS Intro to AI 25 The Trellis

CIS Intro to AI 26 Forward Probabilities  What is the probability that, given an HMM, at time t the state is i and the partial observation o 1 … o t has been generated?

CIS Intro to AI 27 Forward Probabilities

CIS Intro to AI 28 Forward Algorithm  Initialization:  Induction:  Termination:

CIS Intro to AI 29 Forward Algorithm Complexity  Naïve approach takes O(2T*N T ) computation  Forward algorithm using dynamic programming takes O(N 2 T) computations

CIS Intro to AI 30 Backward Probabilities  What is the probability that given an HMM and given the state at time t is i, the partial observation o t+1 … o T is generated?  Analogous to forward probability, just in the other direction

CIS Intro to AI 31 Backward Probabilities

CIS Intro to AI 32 Backward Algorithm  Initialization:  Induction:  Termination:

CIS Intro to AI 33 Problem 2: Decoding  The Forward algorithm gives the sum of all paths through an HMM efficiently.  Here, we want to find the highest probability path.  We want to find the state sequence Q=q 1 …q T, such that

CIS Intro to AI 34 Viterbi Algorithm  Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum  Forward:  Viterbi Recursion:

CIS Intro to AI 35 Core Idea of Viterbi Algorithm

CIS Intro to AI 36 Viterbi Algorithm  Initialization:  Induction:  Termination:  Read out path:

CIS Intro to AI 37 Problem 3: Learning  Up to now we’ve assumed that we know the underlying model  Often these parameters are estimated on annotated training data, but:  Annotation is often difficult and/or expensive  Training data is different from the current data  We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model, such that

CIS Intro to AI 38 Problem 3: Learning (If Time Allows…)  Unfortunately, there is no known way to analytically find a global maximum, i.e., a model, such that  But it is possible to find a local maximum  Given an initial model, we can always find a model, such that

CIS Intro to AI 39 Forward-Backward (Baum-Welch) algorithm  Key Idea: parameter re-estimation by hill- climbing  From an arbitrary initial parameter instantiation, the FB algorithm iteratively re-estimates the parameters, improving the probability that a given observation was generated by

CIS Intro to AI 40 Parameter Re-estimation  Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: a i,j Emission probabilities: b i (o t )

CIS Intro to AI 41 Re-estimating Transition Probabilities  What’s the probability of being in state s i at time t and going to state s j, given the current model and parameters?

CIS Intro to AI 42 Re-estimating Transition Probabilities

CIS Intro to AI 43 Re-estimating Transition Probabilities  The intuition behind the re-estimation equation for transition probabilities is  Formally:

CIS Intro to AI 44 Re-estimating Transition Probabilities  Defining As the probability of being in state s i, given the complete observation O  We can say:

CIS Intro to AI 45 Re-estimating Initial State Probabilities  Initial state distribution: is the probability that s i is a start state  Re-estimation is easy:  Formally:

CIS Intro to AI 46 Re-estimation of Emission Probabilities  Emission probabilities are re-estimated as  Formally: where  Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!!

CIS Intro to AI 47 The Updated Model  Coming from we get to by the following update rules:

CIS Intro to AI 48 Expectation Maximization  The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters