101035 中文信息处理 Chinese NLP Lecture 7.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Character Recognition using Hidden Markov Models Anthony DiPirro Ji Mei Sponsor:Prof. William Sverdlik.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
POS Tagging & Chunking Sambhav Jain LTRC, IIIT Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 8.
… Hidden Markov Models Markov assumption: Transition model:
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Word classes and part of speech tagging Chapter 5.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
Albert Gatt Corpora and Statistical Methods Lecture 9.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Homework 1 Reminder Due date: (till 23:59) Submission: – – Write the names of students in your team.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CSA3202 Human Language Technology HMMs for POS Tagging.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Dongfang Xu School of Information
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Algorithms of POS Tagging
Hidden Markov Models By Manish Shrivastava.
Presentation transcript:

101035 中文信息处理 Chinese NLP Lecture 7

词——词性标注(2) Part-of-Speech Tagging (2) 统计模型的训练(Training a statistical model) 马尔可夫链(Markov chain) 隐马尔可夫模型(Hidden Markov Model, or HMM) 隐马尔可夫标注算法(HMM POS tagging)

统计模型的训练 Training a Statistical Model Back to POS tagging Given a word sequence 𝑤 1 𝑛 = 𝑤 1 … 𝑤 𝑛 , decide its best POS sequence 𝑡 1 𝑛 among all 𝑡 1 𝑛 = 𝑡 1 … 𝑡 𝑛 . 𝑡 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑡 1 𝑛 𝑤 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) 𝑃( 𝑤 1 𝑛 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) Bayes Rule Likelihood Prior

Simplifying assumptions Computing Probabilities 𝑃( 𝑤 1 𝑛 | 𝑡 1 𝑛 )≈ 𝑖=1 𝑛 𝑃( 𝑤 𝑖 | 𝑡 𝑖 ) 𝑃( 𝑡 1 𝑛 )≈ 𝑖=1 𝑛 𝑃( 𝑡 𝑖 | 𝑡 𝑖−1 ) 𝑡 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑖=1 𝑛 𝑃( 𝑤 𝑖 | 𝑡 𝑖 ) 𝑃( 𝑡 𝑖 | 𝑡 𝑖−1 ) 𝑃 𝑡 𝑖 𝑡 𝑖−1 = 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖−1 , 𝑡 𝑖 ) 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖−1 ) 𝑃 𝑤 𝑖 𝑡 𝑖 = 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖 , 𝑤 𝑖 ) 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖 ) Simplifying assumptions Counts from corpus The above probability computation is oversimplified. Consult the textbook about deleted interpolation and other smoothing methods for better probability computation.

Using Probabilities for POS Tagging Example What POS is race in Secretariat is expected to race tomorrow?

Using Probabilities for POS Tagging Example Using the (87-tag) Brown corpus, we get P(NN|TO) = 0.00047 P(VB|TO) = 0.83 P(race|NN) = 0.00057 P(race|VB) = 0.00012 P(NR|VB) = 0.0027 P(NR|NN) = 0.0012 Compare P(VB|TO) P(NR|VB) P(race|VB) = 0.00000027 P(NN|TO) P(NR|NN) P(race|NN) = 0.00000000032

马尔可夫链 Markov Chain Definition A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through.

Graphical Model Representation A set of N states: Q = q1q2 … qN A transition probability matrix: A = a01a02 … an1 … ann 𝑗=1 𝑛 𝑎 𝑖𝑗 =1 ∀𝑖 A special start state and end state: q0, qF Alternately, we use an initial probability distribution over states. π = π1π2 … πN. πi is the probability that the Markov chain will start in state i. 𝑖=1 𝑛 𝜋 𝑖 =1

What is Special about Markov Chains A Markov chain can’t represent inherently ambiguous problems, so it is only useful for assigning probabilities to unambiguous sequences. A Markov chain is not suitable for POS tagging, because the states (POS) cannot be directly observed. Markov assumption: 𝑃 𝑞 𝑖 | 𝑞 1 … 𝑞 𝑖−1 =𝑃( 𝑞 𝑖 | 𝑞 𝑖−1 )

In-Class Exercise Using the following Markov chain, compute the probability of the sequence: {cold hot cold hot}

隐马尔可夫模型 Hidden Markov Model Markov Chain vs HMM A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

HMM Components A set of N states: Q = q1q2 … qN A transition probability matrix: A = a11a12 … an1 … ann 𝑗=1 𝑛 𝑎 𝑖𝑗 =1 ∀𝑖 A sequence of T observations: O = o1o2 … oT A sequence of observation likelihoods, or emission probabilities: B = bi(oT) A special start state and end state: q0, qF Alternately, we use an initial probability distribution over states. π = π1π2 … πN. πi is the probability that the Markov chain will start in state i. 𝑖=1 𝑛 𝜋 𝑖 =1

HMM Assumptions Fundamental Problems Markov assumption 𝑃 𝑞 𝑖 | 𝑞 1 … 𝑞 𝑖−1 =𝑃( 𝑞 𝑖 | 𝑞 𝑖−1 ) Output independence assumption 𝑃 𝑜 𝑖 | 𝑞 1 … 𝑞 𝑖 … 𝑞 𝑇 , 𝑜 1 … 𝑜 𝑖 … 𝑜 𝑇 =𝑃( 𝑜 𝑖 | 𝑞 𝑖 ) Fundamental Problems Computing likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ). Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

A Running Example Jason eating ice creams on some day There is some relation between weather states (hot, cold) and the number of ice creams eaten on that day. An integer represents the number of ice creams eaten on a given day (observed), and a sequence of H and C designates the weather states (hidden) that caused Jason to eat the ice cream.

Computing Likelihood Given an HMM model, what is the likelihood of {3, 1, 3}? Note that we do not know the hidden states (weather). Forward algorithm (a kind of dynamic programming) αt(j) represents the probability of being in state j after seeing the first t observations, given the model λ. 𝛼 𝑡 𝑗 =𝑃( 𝑜 1 , 𝑜 2 … 𝑜 𝑡 , 𝑞 𝑡 =𝑗|𝜆) = 𝑖=1 𝑁 𝛼 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) the state observation likelihood of the observation symbol ot given the current state j the previous forward path probability from the previous time step the transition probability from previous state qi to current state qj

Computing Likelihood Algorithm Initialization 𝛼 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 1 1≤𝑗≤𝑛 Recursion 𝛼 𝑡 𝑗 = 𝑖=1 𝑁 𝛼 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) 1≤𝑗≤𝑛, 1<𝑡≤𝑇 Termination 𝑃 𝑂 𝜆 = 𝛼 𝑇 𝑞 𝐹 = 𝑖=1 𝑁 𝛼 𝑇 (𝑖) 𝑎 𝑖𝐹 forward[s, t] = 𝛼 𝑡 𝑠

Computing Likelihood

Decoding Given an HMM model and an ice cream sequence {3, 1, 3}, what is the hidden weather states? Viterbi algorithm (a kind of dynamic programming) vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1, ...,qt−1,, given the model λ. 𝑣 𝑡 𝑗 = max 𝑞 0 , 𝑞 1 … 𝑞 𝑡−1 𝑃( 𝑞 0 , 𝑞 1 … 𝑞 𝑡−1 , 𝑜 1 , 𝑜 2 … 𝑜 𝑡 , 𝑞 𝑡 =𝑗|𝜆) = max 1≤𝑖≤𝑁 𝑣 𝑡−1 (𝑖) 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) the state observation likelihood of the observation symbol ot given the current state j the previous Viterbi path probability from the previous time step the transition probability from previous state qi to current state qj

Decoding Algorithm Initialization 𝑣 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 1 1≤𝑗≤𝑛 𝑏𝑡 1 𝑗 =0 𝑣 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 1 1≤𝑗≤𝑛 𝑏𝑡 1 𝑗 =0 Recursion 𝑣 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑣 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ), 1≤𝑗≤𝑛, 1<𝑡≤𝑇 𝑏𝑡 𝑡 𝑗 = argmax 1≤𝑖≤𝑁 𝑣 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ), 1≤𝑗≤𝑛, 1<𝑡≤𝑇 Termination 𝑃∗= 𝑣 𝑇 𝑞 𝐹 = max 1≤𝑖≤𝑁 𝑣 𝑇 𝑖 𝑎 𝑖𝐹 𝑞 𝑇 ∗= 𝑏𝑡 𝑇 𝑞 𝐹 = 𝑎𝑟𝑔max 1≤𝑖≤𝑁 𝑣 𝑇 𝑖 𝑎 𝑖𝐹

Decoding Algorithm

Decoding

Learning Given an ice cream sequence {3, 1, 3} and the set of possible weather states {H, C}, what are the HMM parameters (A and B)? Forward-Backward algorithm (a kind of Expectation Maximization) βt(j) represents the probability of seeing the observations from time t+1 to the end, given that we are in state j at time t, and the model λ. 𝛽 𝑡 𝑗 =𝑃( 𝑜 𝑡+1 , 𝑜 𝑡+2 … 𝑜 𝑇 | 𝑞 𝑡 =𝑗, 𝜆) Initialization 𝛽 𝑇 𝑖 = 𝑎 𝑖𝐹 1≤𝑖≤𝑛 Recursion 𝛽 𝑡 𝑖 = 𝑗=1 𝑁 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡+1 )𝛽 𝑡+1 𝑗 1≤𝑗≤𝑛, 1≤𝑡≤𝑇 Termination 𝑃 𝑂 𝜆 = 𝛼 𝑇 𝑞 𝐹 = 𝛽 1 0 = 𝑗=1 𝑁 𝛼 0𝑗 𝑏 𝑗 𝑜 1 𝛽 1 (𝑗)

the probability of being in state j at time t Learning Algorithm the probability of being in state j at time t the probability of being in state i at time t and state j at time t+1 Solving the learning problem is the most complicated. Consult your textbook to find more details.

隐马尔可夫标注算法 HMM POS Tagging Using Viterbi to solve the decoding problem An English example I want to race. Transition probabilities A Emission probabilities B

An English Example

In-Class Exercise Compute V3(3) on the previous page, using the given probabilities. Note that you need to first compute all the V2(*).

Other Tagging Methods CLAWS (a brute-force algorithm) In a word span (the beginning and end words have unique POS’s), calculate all the path possibilities and choose the maximum. VOLSUNGA (a greedy algorithm) As an improvement on CLAWS, it finds the optimal path step by step. In each step, it only considers the best path so far found. The ultimate optimal path is simply the sum of parts.

A Chinese Example In implementation, we often use log probabilities to prevent numerical underflows due to small probability products. If we take the negative log probabilities, finding maximum product becomes finding minimum sum. ,报道新闻了, Transition probabilities A / Transition Costs TC Emission probabilities B / Emission Costs EC

The result is 报道/v 新闻/n 了/v A Chinese Example CLAWS Path 1:w-n-n-u-w Cost1 = TC[w,n]+TC[n,n]+TC[n,u]+TC[u,w]=2.09+1.76+2.40+2.22=8.47 Path 2:w-n-n-v-w Cost2 = TC[w,n]+TC[n,n]+TC[n,v]+TC[v,w]=2.09+1.76+1.71+1.85=7.41 Path 3:w-n-n-y-w Cost3 = TC[w,n]+TC[n,n]+TC[n,y]+TC[y,w]=2.09+1.76+5.10+0.08=9.03 Path 4:w-v-n-u-w Cost4 = TC[w,v]+TC[v,n]+TC[n,u]+TC[u,w]=1.90+1.72+2.40+2.22=8.24 Path 5:w-v-n-v-w Cost5 = TC[w,v]+TC[v,n]+TC[n,v]+TC[v,w]=1.90+1.72+1.71+1.85=7.18 Path 6:w-v-n-y-w Cost6 = TC[w,v]+TC[v,n]+TC[n,y]+TC[y,w]=1.90+1.72+5.10+0.08=8.80 The result is 报道/v 新闻/n 了/v

The result is 报道/v 新闻/n 了/u A Chinese Example VOLSUNGA Step1:min{TC[w,n]+EC[报道|n], TC[w,v]+EC[报道|v]} = min{2.09+8.22,1.90+5.69} T[1] = v Step2:min{TC[v,n]+EC[新闻|n]} = min{1.72+6.55} T[2] = n Step3:min{TC[n,u]+EC[了|u], TC[n,v]+EC[了|v], TC[n,y]+EC[了|y]} = min{2.40+1.98,1.71+7.76,5.10+0.38} T[3] = u The result is 报道/v 新闻/n 了/u

The result is 报道/v 新闻/n 了/y A Chinese Example Viterbi Step1:k = 1 Cost[1, 报道/n] = (Cost[0, w] + TC[w,n]) + EC[报道|n] = 10.31 Cost[1, 报道/v] = (Cost[0, w] + TC[w,n]) + EC[报道|v] = 7.59 Step2:k = 2 Cost[2, 新闻/n] = min{(Cost[1, v] + TC[v,n]), (Cost[1, n] + TC[n,n])}+ EC[新闻|n] = 7.59 + 1.72 + 6.55 = 15.86 Step3:k = 3 Cost[3, 了/u] = (Cost[2, n] + TC[n,u]) + EC[了|u] = 20.24 Cost[3, 了/v] = (Cost[2, n] + TC[n,v]) + EC[了|v] = 25.33 Cost[3, 了/y] = (Cost[2, n] + TC[n,y]) + EC[了|y] = 21.34 Step4: k = 4 Cost[4, ,/w] = min{(Cost[3, u] + TC[u,w]), (Cost[3, v] + TC[v,w]), (Cost[3, y] + TC[y,w])}+ EC[,|w] = 21.34 + 0.08 + 0 = 21.42 Cost[k, t] comes from the negative log probability: Cost[k, t] = min{Cost[k-1, s] + TC[s, t]} + EC[wk|t] The result is 报道/v 新闻/n 了/y

Wrap-Up 统计模型的训练 马尔可夫链 隐马尔可夫标注算法 隐马尔可夫模型 Computing Likelihood Computing Probabilities 马尔可夫链 Definition Graphical Representation 隐马尔可夫模型 Components Assumptions Computing Likelihood Decoding Learning 隐马尔可夫标注算法 Viterbi Examples Vs Other Methods