Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Cognitive Computer Vision
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
… Hidden Markov Models Markov assumption: Transition model:
FSA and HMM LING 572 Fei Xia 1/5/06.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Albert Gatt Corpora and Statistical Methods Lecture 9.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Homework 1 Reminder Due date: (till 23:59) Submission: – – Write the names of students in your team.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Albert Gatt LIN3022 Natural Language Processing Lecture 7.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Presented by Qinmin Hu. 2 Outline Introduction Generating patterns Markov process Hidden Markov model Forward algorithm Viterbi.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models Teaching Demo The University of Arizona
Presentation transcript:

Albert Gatt Corpora and Statistical Methods

Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass

Talking about the weather Suppose we want to predict tomorrow’s weather. The possible predictions are: sunny foggy rainy We might decide to predict tomorrow’s outcome based on earlier weather if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it had been rainy all week how far back do we want to go to predict tomorrow’s weather?

Statistical weather model Notation: S: the state space, a set of possible values for the weather: {sunny, foggy, rainy} (each state is identifiable by an integer i) X: a sequence of random variables, each taking a value from S these model weather over a sequence of days t is an integer standing for time (X 1, X 2, X 3,... X T ) models the value of a series of random variables each takes a value from S with a certain probability P(X=s i ) the entire sequence tells us the weather over T days

Statistical weather model If we want to predict the weather for day t+1, our model might look like this: E.g. P(weather tomorrow = sunny), conditional on the weather in the past t days. Problem: the larger t gets, the more calculations we have to make.

Markov Properties I: Limited horizon The probability that we’re in state s i at time t+1 only depends on where we were at time t: Given this assumption, the probability of any sequence is just:

Markov Properties II: Time invariance The probability of being in state s i given the previous state does not change over time:

Concrete instantiation Day tDay t+1 sunnyrainyfoggy sunny rainy foggy This is essentially a transition matrix, which gives us probabilities of going from one state to the other. We can denote state transition probabilities as a ij (prob. of going from state i to state j)

Graphical view Components of the model: 1. states (s) 2. transitions 3. transition probabilities 4. initial probability distribution for states Essentially, a non-deterministic finite state automaton.

Example continued If the weather today (X t ) is sunny, what’s the probability that tomorrow (X t+1 ) is sunny and the day after (X t+2 ) is rainy? Markov assumption

Formal definition A Markov Model is a triple (S, , A) where: S is the set of states  are the probabilities of being initially in some state A are the transition probabilities

Part 2 Hidden Markov Models

A slight variation on the example You’re locked in a room with no windows You can’t observe the weather directly You only observe whether the guy who brings you food is carrying an umbrella or not Need a model telling you the probability of seeing the umbrella, given the weather distinction between observations and their underlying emitting state. Define: O t as an observation at time t K = {+umbrella, -umbrella} as the possible outputs We’re interested in P(O t =k|X t =s i ) i.e. p. of a given observation at t given that the underlying weather state at t is s i

Symbol emission probabilities weatherProbability of umbrella sunny0.1 rainy0.8 foggy0.3 This is the hidden model, telling us the probability that O t = k given that X t = s i We assume that each underlying state X t = s i emits an observation with a given probability.

Using the hidden model Model gives:P(O t =k|X t =s i ) Then, by Bayes’ Rule we can compute: P(X t =s i |O t =k) Generalises easily to an entire sequence

HMM in graphics Circles indicate states Arrows indicate probabilistic dependencies between states

HMM in graphics  Green nodes are hidden states  Each hidden state depends only on the previous state (Markov assumption)

Why HMMs? HMMs are a way of thinking of underlying events probabilistically generating surface events. Example: Parts of speech a POS is a class or set of words we can think of language as an underlying Markov Chain of parts of speech from which actual words are generated (“emitted”) So what are our hidden states here, and what are the observations?

HMMs in POS Tagging ADJNV DET  Hidden layer (constructed through training)  Models the sequence of POSs in the training corpus

HMMs in POS Tagging ADJ tall N lady V is DET the  Observations are words.  They are “emitted” by their corresponding hidden state.  The state depends on its previous state.

Why HMMs There are efficient algorithms to train HMMs using Expectation Maximisation General idea: training data is assumed to have been generated by some HMM (parameters unknown) try and learn the unknown parameters in the data Similar idea is used in finding the parameters of some n-gram models, especially those that use interpolation.

Part 3 Formalisation of a Hidden Markov model

Crucial ingredients (familiar) Underlying states: S = {s 1,…,s N } Output alphabet (observations): K = {k 1,…,k M } State transition probabilities: A = {a ij }, i,j Є S State sequence: X = (X 1,…,X T+1 ) + a function mapping each X t to a state s Output sequence: O = (O 1,…,O T ) where each o t Є K

Crucial ingredients (additional) Initial state probabilities: Π = { π i }, i Є S (tell us the initial probability of each state) Symbol emission probabilities: B = {b ijk }, i,j Є S, k Є K (tell us the probability b of seeing observation O t =k, given that X t =s i and X t+1 = s j )

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1 o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1 o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3 b 1,1,k b 1,2,k b 1,3,k

The fundamental questions for HMMs 1. Given a model μ = (A, B, Π ), how do we compute the likelihood of an observation P(O| μ )? 2. Given an observation sequence O, and model μ, which is the state sequence (X 1,…,X t+1 ) that best explains the observations? This is the decoding problem 3. Given an observation sequence O, and a space of possible models μ = (A, B, Π ), which model best explains the observed data?

Application of question 1 (ASR) Given a model μ = (A, B, Π ), how do we compute the likelihood of an observation P(O| μ )? Input of an ASR system: a continuous stream of sound waves, which is ambiguous Need to decode it into a sequence of phones. is the input the sequence [n iy d] or [n iy]? which sequence is the most probable?

Application of question 2 (POS Tagging) Given an observation sequence O, and model μ, which is the state sequence (X 1,…,X t+1 ) that best explains the observations? this is the decoding problem Consider a POS Tagger Input observation sequence: I can read need to find the most likely sequence of underlying POS tags: e.g. is can a modal verb, or the noun? how likely is it that can is a noun, given that the previous word is a pronoun?

Part 4 Finding the probability of an observation sequence

oToT o1o1 otot o t-1 o t+1 Simplified trellis diagram representation startn dhiyend  Hidden layer: transitions between sounds forming the words need, knee…  This is our model

oToT o1o1 otot o t-1 o t+1 Simplified trellis diagram representation startn dhiyend  Visible layer is what ASR is given as input

oToT o1o1 otot o t-1 o t+1 Computing the probability of an observation startn dhiyend

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

A final word on observation probabilities Since we’re computing the probability of an observation given a model, we can use these methods to compare different models if we take observations in our corpus as given, then the best model is the one which maximises the probability of these observations (useful for training/parameter setting)

Part 5 The forward procedure

Forward Procedure Given our phone input, how do we decide whether the actual word is need, knee, …? Could compute p(O| μ ) for every single word Highly expensive in terms of computation

Forward procedure An efficient solution to resolving the problem based on dynamic programming (memoisation) rather than perform separate computations for all possible sequences X, keep in memory partial solutions

Forward procedure Network representation of all sequences (X) of states that could generate the observations sum of probabilities for those sequences E.g. O=[n iy] could be generated by X1 = [n iy d] (need) X2 = [n iy t] (neat) shared histories can help us save on memory Fundamental assumption: Given several state sequences of length t+1 with shared history up to t probability of first t observations is the same in all of them

Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Probability of the first t observations is the same for all possible t+1 length state sequences. Define a forward variable: Probability of ending up in state s i at time t after observations 1 to t-1

Forward Procedure: initialisation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Probability of the first t observations is the same for all possible t+1 length state sequences. Define: Probability of being in state s i first is just equal to the initialisation probability

Forward Procedure (inductive step) oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Looking backward The forward procedure caches the probability of sequences of states leading up to an observation (left to right). The backward procedure works the other way: probability of seeing the rest of the obs sequence given that we were in some state at some time

Backward procedure: basic structure Define: probability of the remaining observations given that current obs is emitted by state i Initialise: probability at the final state Inductive step: Total:

Combining forward & backward variables Our two variables can be combined: the likelihood of being in state i at time t with our sequence of observations is a function of: the probability of ending up in i at t given what came previously the probability of being in i at t given the rest Therefore:

Part 6 Decoding: Finding the best state sequence

Best state sequence: example Consider the ASR problem again Input observation sequence: [aa n iy dh ax] (corresponds to I need the…) Possible solutions: I need a… I need the… I kneed a… … NB: each possible solution corresponds to a state sequence. Problem is to find best word segmentation and most likely underlying phonetic input.

Some difficulties… If we focus on the likelihood of each individual state, we run into problems context effects mean that what is individually likely may together yield an unlikely sequence the ASR program needs to look at the probability of entire sequences

Viterbi algorithm Given an observation sequence O and a model , find: argmax X P(X,O|  ) the sequence of states X such that P(X,O|  ) is highest Basic idea: run a type of forward procedure (computes probability of all possible paths) store partial solutions at the end, look back to find the best path

Illustration: path through the trellis S1S1 S2S2 1t= S3S3 S4S4 At every node (state) and time, we store: the likelihood of reaching that state at that time by the most probable path leading to that state (denoted ) the preceding state leading to the current state (denoted )

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: definitions x1x1 x t-1 j The probability of the most probable path from observation 1 to t-1, landing us in state j at t

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: initialisation x1x1 x t-1 j The probability of being in state j at the beginning is just the initialisation probability of state j.

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: inductive step x1x1 x t-1 xtxt x t+1 Probability of being in j at t+1 depends on the state i for which a ij is highest the probability that j emits the symbol O t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: inductive step x1x1 x t-1 xtxt x t+1 Backtrace store: the most probable state from which state j can be reached

Illustration S1S1 S2S2 1t= S3S3 S4S4  2 (t=6) = probability of reaching state 2 at time t=6 by the most probable path (marked) through state 2 at t=6  2 (t=6) =3 is the state preceding state 2 at t=6 on the most probable path through state 2 at t=6

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace x1x1 x t-1 xtxt x t+1 xTxT The best state at T is that state i for which the probability  i (T) is highest

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace Work backwards to the most likely preceding state x1x1 x t-1 xtxt x t+1 xTxT

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace The probability of the best state sequence is the maximum value stored for the final state T x1x1 x t-1 xtxt x t+1 xTxT

Summary We’ve looked at two algorithms for solving two of the fundamental problems of HMMS: likelihood of an observation sequence given a model (Forward/Backward Procedure) most likely underlying state, given an observation sequence (Viterbi Algorithm) Next up: we look at POS tagging