CSCE 771 Natural Language Processing

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
Angelo Dalli Department of Intelligent Computing Systems
Learning HMM parameters
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Apaydin slides with a several modifications and additions by Christoph Eick.
INTRODUCTION TO Machine Learning 3rd Edition
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
Learning, Uncertainty, and Information: Learning Parameters
Date: October, Revised by 李致緯
Structured prediction
Hidden Markov Models.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Hidden Markov Models (HMMs)
CSC 594 Topics in AI – Natural Language Processing
An INTRODUCTION TO HIDDEN MARKOV MODEL
CHAPTER 15: Hidden Markov Models
Hidden Markov Models - Training
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models (HMMs)
Hidden Markov Models Part 2: Algorithms
Lecture 9 The GHMM Library and The Brill Tagger
CHAPTER 15: Hidden Markov Models
Three classic HMM problems
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Hassanin M. Al-Barhamtoshy
Algorithms of POS Tagging
LECTURE 15: REESTIMATION, EM AND MIXTURES
Hidden Markov Models Teaching Demo The University of Arizona
Introduction to HMM (cont)
Hidden Markov Models By Manish Shrivastava.
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

CSCE 771 Natural Language Processing Lecture 8 Training HMMs: Learning the Model Forward- Backward Algorithm Topics Overview Readings: Chapter 6 February 11, 2013

Overview Last Time Today Hidden Markov Models revisited Chapter 6 The three problems Likelihood Decoding Training – learning the model NLTK book – chapter 5 tagging Videos on NLP on You Tube from Coursesa etc. Today Computation Complexity for Problems 1 and 2 Straight forward – sum over all possible state sequences O(NT) Dynamic Programming - Forward algorithm O(T*N2) Problem 3 - Learning the model Backward computation Forward-Backward Algorithm

Ferguson’s 3 Fundamental Problems Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B.

Problem 1: Computing Likelihood Computing Likelihood – Given an HMM λ = (A, B) and an observation sequence O, determine the likelihood P(O| λ). Example P(O | Q) = P(3 1 3 | H H C) = Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Likelihood Computation P(O,Q) = P(O | Q) P(Q) = Π P(oi | qi) * Π P(qi | qi-1) So P(313, hot hot cold) = In general Which sums over all sequences of states.

Likelihood Computation Performance Exponential sum over all sequences of states Dynamic programming to the rescue Forward algorithm O(N2T) Compute array αt(j) for each state j and each time t αt-1(i) - the previous forward probability A = aij – the state transition probabilities B = bj(ot) – the state obsevation likelihood of observation ot given the current state j

Forward Trellis – (Structure like Viterbi) . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Forward Algorithm . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Problem 2: The Decoding Problem The Decoding Problem– Given an HMM λ = (A, B) and an observation sequence O= o1, o2, … oT, find the most probable sequence of states Q = q1, q2, … qT. Viterbi Revisited matching tags to words Matching spectral features(voice) to phonemes Matching Hot/Cold to #ice creams eaten

Viterbi – the man Andrew James Viterbi, Ph.D. (Bergamo (Italy) March 9, 1935) is an Italian-American electrical engineer and businessman. BS, MS MIT, PhD Southern Cal. In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. Used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, Viterbi School of Engineering, So. Cal. his $52 million donation to the school Code Division Multiple Access (CDMA) wireless technology

Viterbi Backtrace Pointers . Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Problem 3: Learning the Model (HMM) Learning the Model (HMM) – Given an observation sequence and the set of possible states in the HMM, learn parameters A and B. Markov Chain case

Markov Chain case  

Forward-Backward Algorithm The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Two ideas: Iteratively estimate counts/probabilities Compute probability using forward algorithm then distribute probability mass over different paths that contributed

Forward-Backward Algorithm The Forward- Backward algorithm is sometimes called the Baum-Welch algorithm. Forward probabilities Backward probabilities Recall Bayes

Figure 6.13 Computation of backward Probabilities βt(i) Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Algorithm for backward Probabilities βt(i)  

6.14 Joint Probability qt = i and qt+1 = j Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Estimating aij Estimating aij aij http://www.greek-language.com/alphabet/

Calculating ξt(i, j) from not-quite ξt(i, j) Using we divide not-quite ξt(i, j) by P(O| λ)

Now P(O | λ) is The forward probability of the entire sequence αT(N) or The backward probability of the entire sequence β T(1) Thus yielding

Finally for aij

Probability γt(j) of being at state j at time t So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

Fig 6.15 Computation of γt(j) probability of being at state j at time t Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Probability γt(j) of being at state j at time t So we need to be able to estimate γt(j) = P(qt = j | O, λ) Again using Bayes and moving O into the joint prob. we obtain

Finally (Eq 6.43) .

Forward-Backward Algorithm Description Then the forward-backward algorithm has 0. An initialization of A = (aij) and B = (bt (j)) A loop with an Estimation step in which ξt(i, j) and γt(j) are estimated, and a Maximization step in which new estimates of A and B are computed Until convergence of ??? Return A, B

Fig 6.16 Forward-Backward Algorithm Speech and Language Processing, Second Edition Daniel Jurafsky and James H. Martin Copyright ©2009 by Pearson Education, Inc. Upper Saddle River, New Jersey 07458 All rights reserved.

Information Theory Introduction (4.10) Entropy is a measure of the information in a message Define a random variable X over whatever we are predicting (words, characters or …) then the entropy of X is given by

Horse Race Example of Entropy 8 horses: H1, H2, …H8, Want to send messages of which horse to bet on in race with as few bits as possible. We could use the bit sequences H1=000, H2=001, …H8=111, three bits per bet.

But now given a random variable B Assume our bets over the day, are modelled by a random variable B, following the distribution Horse Probability that we bet on it log2(prob) Horse 1 ½ log2(1/2) = -1 Horse 2 ¼ log2(1/4) = -2 Horse 3 1/8 log2(1/8) = -3 Horse 4 1/16 log2(1/16) = -4 Horse 5 1/64 log2(1/64) = -6 Horse 6 Horse 7 Horse 8

Horse Race Example of Entropy(cont.) Then the entropy is Horse Probability Encoding bit string Horse 1 ½ Horse 2 ¼ 10 Horse 3 1/8 110 Horse 4 1/16 1110 Horse 5 1/64 11110 Horse 6 111110 Horse 7 1111110 Horse 8 11111110

What if horses are equally likely

Entropy of Sequences

Jason (ice cream) Eisner http://www.cs.jhu.edu/~jason/papers/# 14 papers in 2012 http://videolectures.net/hltss2010_eisner_plm/video/2/

Speed of implementations of BW? The Baum-Welch algorithm for hidden Markov Models: speed comparison between octave / python / R / scilab / matlab / C / C++ http://perso.telecom-paristech.fr/~garivier/code/index.php