M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Advertisements

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
FSA and HMM LING 572 Fei Xia 1/5/06.
Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Word classes and part of speech tagging Chapter 5.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Word classes and part of speech tagging Chapter 5.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Dongfang Xu School of Information
CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Hassanin M. Al-Barhamtoshy
Natural Language Processing
Presentation transcript:

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Overview Markov Models Hidden Markov Models POS Tagging

Andrei Markov Russian statistician (1856 – 1922) Studied temporal probability models Markov assumption – State t depends only on a bounded subset of State 0:t-1 First-order Markov process – P(State t | State 0:t-1 ) = P(State t | State t-1 ) Second-order Markov process – P(State t | State 0:t-1 ) = P(State t | State t-2:t-1 )

4 Markov Assumption “The future does not depend on the past, given the present.” Sometimes this if called the first-order Markov assumption. second-order assumption would mean that each event depends on the previous two events. – This isn’t really a crucial distinction. – What’s crucial is that there is some limit on how far we are willing to look back.

5 Morkov Models The Markov assumption means that there is only one probability to remember for each event type (e.g., word) to another event type. Plus the probabilities of starting with a particular event. This lets us define a Markov model as: – finite state automaton in which the states represent possible event types(e.g., the different words in our example) the transitions represent the probability of one event type following another. It is easy to depict a Markov model as a graph.

What is a (Visible) Markov Model ? Graphical Model (Can be interpreted as Bayesian Net) Circles indicate states Arrows indicate probabilistic dependencies between states State depends only on the previous state “The past is independent of the future given the present.”

Markov Model Formalization SSSSS  {S,   S : {s 1 …s N } are the values for the hidden states     are the initial state probabilities  A = {a ij } are the state transition probabilities AAAA

Markov chain for weather

Markov chain for words

Markov chain = “First-order observable Markov Model” a set of states – Q = q 1, q 2 …q N; the state at time t is q t Transition probabilities: – a set of probabilities A = a 01 a 02 …a n1 …a nn. – Each a ij represents the probability of transitioning from state i to state j – The set of these is the transition probability matrix A Distinguished start and end states

Markov chain = “First-order observable Markov Model” Markov Assumption Current state only depends on previous state P(q i | q 1 … q i-1 ) = P(q i | q i-1 )

Another representation for start state Instead of start state Special initial probability vector  – An initial distribution over probability of start states Constraints:

The weather model using 

The weather model: specific example

Markov chain for weather What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm I.e., state sequence is P(3, 3, 3, 3) =  3 a 33 a 33 a 33 a 33 = 0.2 x (0.6) 3 =

How about? Hot hot hot hot Cold hot cold hot What does the difference in these probabilities tell you about the real world weather info encoded in the figure?

Set of states: Process moves from one state to another generating a sequence of states : Markov chain property: probability of each subsequent state depends only on what was the previous state: To define Markov model, the following probabilities have to be specified: transition probabilities and initial probabilities Markov Models

Rain Dry Two states : ‘Rain’ and ‘Dry’. Transition probabilities: P( ‘Rain’|‘Rain’ ) =0.3, P( ‘Dry’|‘Rain’ ) =0.7, P( ‘Rain’|‘Dry’ ) =0.2, P( ‘Dry’|‘Dry’ ) =0.8 Initial probabilities: say P( ‘Rain’ ) =0.4, P( ‘Dry’ ) =0.6. Example of Markov Model

By Markov chain property, probability of state sequence can be found by the formula: Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}. P( {‘Dry’,’Dry’,’Rain’,Rain’} ) = P( ‘Rain’ | ’Rain’ ) P( ‘Rain’ | ’Dry’ ) P( ‘Dry’ | ’Dry’ ) P( ‘Dry’ )= = 0.3*0.2*0.8*0.6 Calculation of sequence probability

Hidden Markov Model (HMM) Evidence can be observed, but the state is hidden Three components – Priors (initial state probabilities) – State transition model – Evidence observation model Changes are assumed to be caused by a stationary process – The transition and observation models do not change

Hidden Markov models. Set of states: Process moves from one state to another generating a sequence of states : Markov chain property: probability of each subsequent state depends only on what was the previous state: States are not visible, but each state randomly generates one of M observations (or visible states)

Hidden Markov models. Model is represented by M=(A, B,  ). To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(a ij ), a ij = P(s i | s j ), matrix of observation probabilities B=(b i (v m )), b i (v m ) = P(v m | s i ) and a vector of initial probabilities  =(  i ),  i = P(s i ).

What is an HMM? Graphical Model Circles indicate states Arrows indicate probabilistic dependencies between states HMM = Hidden Markov Model

What is an HMM? Green circles are hidden states Dependent only on the previous state “The past is independent of the future given the present.”

What is an HMM? Purple nodes are observed states Dependent only on their corresponding hidden state

HMM Formalism {S, K,  S : {s 1 …s N } are the values for the hidden states K : {k 1 …k M } are the values for the observations SSS KKK S K S K

HMM Formalism {S, K,     are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities Note : sometimes one uses B = {b ijk } output then depends on previous state / transition as well A B AAA BB SSS KKK S K S K

Low High Dry Rain Example of Hidden Markov Model

Two states : ‘Low’ and ‘High’ atmospheric pressure. Two observations : ‘Rain’ and ‘Dry’. Transition probabilities: P( ‘Low’|‘Low’ ) =0.3, P( ‘High’|‘Low’ ) =0.7, P( ‘Low’|‘High’ ) =0.2, P( ‘High’|‘High’ ) =0.8 Observation probabilities : P( ‘Rain’|‘Low’ ) =0.6, P( ‘Dry’|‘Low’ ) =0.4, P( ‘Rain’|‘High’ ) =0.4, P( ‘Dry’|‘High’ ) =0.3. Initial probabilities: say P( ‘Low’ ) =0.4, P( ‘High’ ) =0.6. Example of Hidden Markov Model

Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}. Consider all possible hidden state sequences: P( {‘Dry’,’Rain’} ) = P( {‘Dry’,’Rain’}, {‘Low’,’Low’} ) + P( {‘Dry’,’Rain’}, {‘Low’,’High’} ) + P( {‘Dry’,’Rain’}, {‘High’,’Low’} ) + P( {‘Dry’,’Rain’}, {‘High’,’High’} ) where first term is : P( {‘Dry’,’Rain’}, {‘Low’,’Low’} )= P( {‘Dry’,’Rain’} | {‘Low’,’Low’} ) P( {‘Low’,’Low’} ) = P( ‘Dry’|’Low’ )P( ‘Rain’|’Low’ ) P( ‘Low’ )P( ‘Low’|’Low ) = 0.4*0.4*0.6*0.4*0.3 Calculation of observation sequence probability

Evaluation problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the probability that model M has generated sequence O. Decoding problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the most likely sequence of hidden states s i that produced this observation sequence O. Learning problem. Given some training observation sequences O=o 1 o 2... o K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B,  ) that best fit training data. O=o 1...o K denotes a sequence of observations o k  { v 1,…,v M }. Main issues using HMMs :

HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Baltimore, MD for summer of 2008 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was

Hidden Markov Model For Markov chains, the output symbols are the same as the states. – See hot weather: we’re in state hot But in named-entity or part-of-speech tagging (and speech recognition and other things) – The output symbols are words – But the hidden states are something else Part-of-speech tags Named entity tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don’t know which state we are in.

Hidden Markov Models

POS Tagging : Current Performance How many tags are correct? – About 97% currently – But baseline is already 90% Baseline is performance of stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns 35 Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj

HMMs as POS Taggers Introduce Hidden Markov Models for part-of-speech tagging/sequence classification Cover the three fundamental questions of HMMs: – How do we fit the model parameters of an HMM? – Given an HMM, how do we efficiently calculate the likelihood of an observation w? – Given an HMM and an observation w, how do we efficiently calculate the most likely state sequence for w?

HMMs We want a model of sequences s and observations w Assumptions: – States are tag n-grams – Usually a dedicated start and end state / word – Tag/state sequence is generated by a markov model – Words are chosen independently, conditioned only on the tag/state – These are totally broken assumptions: why? s1s1 s2s2 snsn w1w1 w2w2 wnwn s0s0

38 Part of speech tagging 8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

POS examples Nnoun chair, bandwidth, pacing Vverb study, debate, munch ADJadjpurple, tall, ridiculous ADVadverbunfortunately, slowly, Pprepositionof, by, to PROpronounI, me, mine DETdeterminerthe, a, that, those

POS Tagging example WORDtag theDET koalaN put V the DET keysN onP theDET tableN

POS Tagging Words often have more than one POS: back – The back door = JJ – On my back = NN – Win the voters back = RB – Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. These examples from Dekang Lin

POS tagging as a sequence classification task We are given a sentence (an “observation” or “sequence of observations”) – Secretariat is expected to race tomorrow – She promised to back the bill What is the best sequence of tags which corresponds to this sequence of observations? Probabilistic view: – Consider all possible sequences of tags – Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w 1 …w n.

Getting to HMM We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest. Hat ^ means “our estimate of the best one” Argmax x f(x) means “the x such that f(x) is maximized”

Getting to HMM This equation is guaranteed to give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian classification: – Use Bayes rule to transform into a set of other probabilities that are easier to compute

Using Bayes Rule

Likelihood and prior n

Two kinds of probabilities (1) Tag transition probabilities P(t i |t i-1 ) – Determiners likely to precede adjs and nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be: – Compute P(NN|DT) by counting in a labeled corpus:

Two kinds of probabilities (2) Word likelihood probabilities p(w i |t i ) – VBZ (3sg Pres verb) likely to be “is” – Compute P(is|VBZ) by counting in a labeled corpus:

POS tagging: likelihood and prior n

An Example: the verb “race” Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag?

Disambiguating “race”

P(NN|TO) = P(VB|TO) =.83 P(race|NN) = P(race|VB) = P(NR|VB) =.0027 P(NR|NN) =.0012 P(VB|TO)P(NR|VB)P(race|VB) = P(NN|TO)P(NR|NN)P(race|NN)= So we (correctly) choose the verb reading

Transitions between the hidden states of HMM, showing A probs

B observation likelihoods for POS HMM

The A matrix for the POS HMM

The B matrix for the POS HMM

Viterbi intuition: we are looking for the best ‘path’ S1S1 S2S2 S4S4 S3S3 S5S5 Slide from Dekang Lin

Viterbi example

Motivations and Applications Part-of-speech tagging – The representative put chairs on the table – AT NN VBD NNS IN AT NN – AT JJ NN VBZ IN AT NN Some tags : – AT: article, NN: singular or mass noun, VBD: verb, past tense, NNS: plural noun, IN: preposition, JJ: adjective

Markov Models Probabilistic Finite State Automaton Figure 9.1

What is the probability of a sequence of states ?

Example Fig 9.1

The crazy soft drink machine Fig 9.2

Probability of {lem,ice} ? Sum over all paths taken through HMM Start in CP – 1 x x – 1 x x 0.7

Inference in an HMM Compute the probability of a given observation sequence Given an observation sequence, compute the most likely hidden state sequence Given an observation sequence and set of possible models, which model most closely fits the data?

oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Decoding

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Dynamic Programming

Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. Define:

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

Dynamic Programming

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Decoding Solution Forward Procedure Backward Procedure Combination

oToT o1o1 otot o t-1 o t+1 Best State Sequence Find the state sequence that best explains the observations Two approaches – Individually most likely states – Most likely sequence (Viterbi)

Best State Sequence (1)

oToT o1o1 otot o t-1 o t+1 Best State Sequence (2) Find the state sequence that best explains the observations Viterbi algorithm

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

oToT o1o1 otot o t-1 o t+1 HMMs and Bayesian Nets (1) x1x1 x t-1 xtxt x t+1 xTxT

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 HMM and Bayesian Nets (2) Conditionally independent of Given Because of d-separation “The past is independent of the future given the present.”

Inference in an HMM Compute the probability of a given observation sequence Given an observation sequence, compute the most likely hidden state sequence Given an observation sequence and set of possible models, which model most closely fits the data?

Dynamic Programming

oToT o1o1 otot o t-1 o t+1 Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method Given a model and observation sequence, update the model parameters to better fit the observations. A B AAA BBBB

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc Probability of being in state i

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

Instance of Expectation Maximization We have that We may get stuck in local maximum (or even saddle point) Nevertheless, Baum-Welch usually effective

Some Variants So far, ergodic models – All states are connected – Not always wanted Epsilon or null-transitions – Not all states/transitions emit output symbols Parameter tying – Assuming that certain parameters are shared – Reduces the number of parameters that have to be estimated Logical HMMs (Kersting, De Raedt, Raiko) – Working with structured states and observation symbols Working with log probabilities and addition instead of multiplication of probabilities (typically done)

oToT o1o1 otot o t-1 o t+1 The Most Important Thing A B AAA BBBB We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.