101035 中文信息处理 Chinese NLP Lecture 7.

Name: 101035 中文信息处理 Chinese NLP Lecture 7.
Uploaded: 2017-08-26T04:06:34+00:00
Duration: PTM21S10
Channel: Garry Reeves
Description: 101035 中文信息处理 Chinese NLP Lecture 7.

中文信息处理 Chinese NLP Lecture 7

词——词性标注（2） Part-of-Speech Tagging (2)
统计模型的训练（Training a statistical model）马尔可夫链（Markov chain) 隐马尔可夫模型（Hidden Markov Model, or HMM）隐马尔可夫标注算法（HMM POS tagging）

统计模型的训练 Training a Statistical Model
Back to POS tagging Given a word sequence 𝑤 1 𝑛 = 𝑤 1 … 𝑤 𝑛 , decide its best POS sequence 𝑡 1 𝑛 among all 𝑡 1 𝑛 = 𝑡 1 … 𝑡 𝑛 . 𝑡 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑡 1 𝑛 𝑤 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) 𝑃( 𝑤 1 𝑛 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) Bayes Rule Likelihood Prior

Simplifying assumptions
Computing Probabilities 𝑃( 𝑤 1 𝑛 | 𝑡 1 𝑛 )≈ 𝑖=1 𝑛 𝑃( 𝑤 𝑖 | 𝑡 𝑖 ) 𝑃( 𝑡 1 𝑛 )≈ 𝑖=1 𝑛 𝑃( 𝑡 𝑖 | 𝑡 𝑖−1 ) 𝑡 1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑃 𝑤 1 𝑛 𝑡 1 𝑛 𝑃( 𝑡 1 𝑛 ) ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 1 𝑛 𝑖=1 𝑛 𝑃( 𝑤 𝑖 | 𝑡 𝑖 ) 𝑃( 𝑡 𝑖 | 𝑡 𝑖−1 ) 𝑃 𝑡 𝑖 𝑡 𝑖−1 = 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖−1 , 𝑡 𝑖 ) 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖−1 ) 𝑃 𝑤 𝑖 𝑡 𝑖 = 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖 , 𝑤 𝑖 ) 𝐶𝑜𝑢𝑛𝑡( 𝑡 𝑖 ) Simplifying assumptions Counts from corpus The above probability computation is oversimplified. Consult the textbook about deleted interpolation and other smoothing methods for better probability computation.

Using Probabilities for POS Tagging
Example What POS is race in Secretariat is expected to race tomorrow?

马尔可夫链 Markov Chain Definition
A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through.

Graphical Model Representation
A set of N states: Q = q1q2 … qN A transition probability matrix: A = a01a02 … an1 … ann 𝑗=1 𝑛 𝑎 𝑖𝑗 =1 ∀𝑖 A special start state and end state: q0, qF Alternately, we use an initial probability distribution over states. π = π1π2 … πN. πi is the probability that the Markov chain will start in state i 𝑖=1 𝑛 𝜋 𝑖 =1

What is Special about Markov Chains
A Markov chain can’t represent inherently ambiguous problems, so it is only useful for assigning probabilities to unambiguous sequences. A Markov chain is not suitable for POS tagging, because the states (POS) cannot be directly observed. Markov assumption: 𝑃 𝑞 𝑖 | 𝑞 1 … 𝑞 𝑖−1 =𝑃( 𝑞 𝑖 | 𝑞 𝑖−1 )

In-Class Exercise Using the following Markov chain, compute the probability of the sequence: {cold hot cold hot}

隐马尔可夫模型 Hidden Markov Model
Markov Chain vs HMM A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like POS tags).

HMM Components A set of N states: Q = q1q2 … qN
A transition probability matrix: A = a11a12 … an1 … ann 𝑗=1 𝑛 𝑎 𝑖𝑗 =1 ∀𝑖 A sequence of T observations: O = o1o2 … oT A sequence of observation likelihoods, or emission probabilities: B = bi(oT) A special start state and end state: q0, qF Alternately, we use an initial probability distribution over states. π = π1π2 … πN. πi is the probability that the Markov chain will start in state i 𝑖=1 𝑛 𝜋 𝑖 =1

HMM Assumptions Fundamental Problems Markov assumption
𝑃 𝑞 𝑖 | 𝑞 1 … 𝑞 𝑖−1 =𝑃( 𝑞 𝑖 | 𝑞 𝑖−1 ) Output independence assumption 𝑃 𝑜 𝑖 | 𝑞 1 … 𝑞 𝑖 … 𝑞 𝑇 , 𝑜 1 … 𝑜 𝑖 … 𝑜 𝑇 =𝑃( 𝑜 𝑖 | 𝑞 𝑖 ) Fundamental Problems Computing likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ). Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

A Running Example Jason eating ice creams on some day
There is some relation between weather states (hot, cold) and the number of ice creams eaten on that day. An integer represents the number of ice creams eaten on a given day (observed), and a sequence of H and C designates the weather states (hidden) that caused Jason to eat the ice cream.

Computing Likelihood Given an HMM model, what is the likelihood of {3, 1, 3}? Note that we do not know the hidden states (weather). Forward algorithm (a kind of dynamic programming) αt(j) represents the probability of being in state j after seeing the first t observations, given the model λ. 𝛼 𝑡 𝑗 =𝑃( 𝑜 1 , 𝑜 2 … 𝑜 𝑡 , 𝑞 𝑡 =𝑗|𝜆) = 𝑖=1 𝑁 𝛼 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) the state observation likelihood of the observation symbol ot given the current state j the previous forward path probability from the previous time step the transition probability from previous state qi to current state qj

Computing Likelihood Algorithm Initialization
𝛼 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 ≤𝑗≤𝑛 Recursion 𝛼 𝑡 𝑗 = 𝑖=1 𝑁 𝛼 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) 1≤𝑗≤𝑛, 1<𝑡≤𝑇 Termination 𝑃 𝑂 𝜆 = 𝛼 𝑇 𝑞 𝐹 = 𝑖=1 𝑁 𝛼 𝑇 (𝑖) 𝑎 𝑖𝐹 forward[s, t] = 𝛼 𝑡 𝑠

Computing Likelihood

Decoding Given an HMM model and an ice cream sequence {3, 1, 3}, what is the hidden weather states? Viterbi algorithm (a kind of dynamic programming) vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1, ...,qt−1,, given the model λ. 𝑣 𝑡 𝑗 = max 𝑞 0 , 𝑞 1 … 𝑞 𝑡−1 𝑃( 𝑞 0 , 𝑞 1 … 𝑞 𝑡−1 , 𝑜 1 , 𝑜 2 … 𝑜 𝑡 , 𝑞 𝑡 =𝑗|𝜆) = max 1≤𝑖≤𝑁 𝑣 𝑡−1 (𝑖) 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ) the state observation likelihood of the observation symbol ot given the current state j the previous Viterbi path probability from the previous time step the transition probability from previous state qi to current state qj

Decoding Algorithm Initialization 𝑣 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 1 1≤𝑗≤𝑛 𝑏𝑡 1 𝑗 =0
𝑣 1 𝑗 = 𝑎 0𝑗 𝑏 𝑗 𝑜 ≤𝑗≤𝑛 𝑏𝑡 1 𝑗 =0 Recursion 𝑣 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑣 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ), 1≤𝑗≤𝑛, 1<𝑡≤𝑇 𝑏𝑡 𝑡 𝑗 = argmax 1≤𝑖≤𝑁 𝑣 𝑡−1 𝑖 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡 ), 1≤𝑗≤𝑛, 1<𝑡≤𝑇 Termination 𝑃∗= 𝑣 𝑇 𝑞 𝐹 = max 1≤𝑖≤𝑁 𝑣 𝑇 𝑖 𝑎 𝑖𝐹 𝑞 𝑇 ∗= 𝑏𝑡 𝑇 𝑞 𝐹 = 𝑎𝑟𝑔max 1≤𝑖≤𝑁 𝑣 𝑇 𝑖 𝑎 𝑖𝐹

Decoding Algorithm

Decoding

Learning Given an ice cream sequence {3, 1, 3} and the set of possible weather states {H, C}, what are the HMM parameters (A and B)? Forward-Backward algorithm (a kind of Expectation Maximization) βt(j) represents the probability of seeing the observations from time t+1 to the end, given that we are in state j at time t, and the model λ. 𝛽 𝑡 𝑗 =𝑃( 𝑜 𝑡+1 , 𝑜 𝑡+2 … 𝑜 𝑇 | 𝑞 𝑡 =𝑗, 𝜆) Initialization 𝛽 𝑇 𝑖 = 𝑎 𝑖𝐹 ≤𝑖≤𝑛 Recursion 𝛽 𝑡 𝑖 = 𝑗=1 𝑁 𝑎 𝑖𝑗 𝑏 𝑗 ( 𝑜 𝑡+1 )𝛽 𝑡+1 𝑗 1≤𝑗≤𝑛, 1≤𝑡≤𝑇 Termination 𝑃 𝑂 𝜆 = 𝛼 𝑇 𝑞 𝐹 = 𝛽 1 0 = 𝑗=1 𝑁 𝛼 0𝑗 𝑏 𝑗 𝑜 1 𝛽 1 (𝑗)

the probability of being in state j at time t
Learning Algorithm the probability of being in state j at time t the probability of being in state i at time t and state j at time t+1 Solving the learning problem is the most complicated. Consult your textbook to find more details.

隐马尔可夫标注算法 HMM POS Tagging
Using Viterbi to solve the decoding problem An English example I want to race. Transition probabilities A Emission probabilities B

An English Example

In-Class Exercise Compute V3(3) on the previous page, using the given probabilities. Note that you need to first compute all the V2(*).

Other Tagging Methods CLAWS (a brute-force algorithm)
In a word span (the beginning and end words have unique POS’s), calculate all the path possibilities and choose the maximum. VOLSUNGA (a greedy algorithm) As an improvement on CLAWS, it finds the optimal path step by step. In each step, it only considers the best path so far found. The ultimate optimal path is simply the sum of parts.

A Chinese Example In implementation, we often use log probabilities to prevent numerical underflows due to small probability products. If we take the negative log probabilities, finding maximum product becomes finding minimum sum. ，报道新闻了， Transition probabilities A / Transition Costs TC Emission probabilities B / Emission Costs EC

The result is 报道/v 新闻/n 了/v
A Chinese Example CLAWS Path 1：w-n-n-u-w Cost1 = TC[w,n]+TC[n,n]+TC[n,u]+TC[u,w]= =8.47 Path 2：w-n-n-v-w Cost2 = TC[w,n]+TC[n,n]+TC[n,v]+TC[v,w]= =7.41 Path 3：w-n-n-y-w Cost3 = TC[w,n]+TC[n,n]+TC[n,y]+TC[y,w]= =9.03 Path 4：w-v-n-u-w Cost4 = TC[w,v]+TC[v,n]+TC[n,u]+TC[u,w]= =8.24 Path 5：w-v-n-v-w Cost5 = TC[w,v]+TC[v,n]+TC[n,v]+TC[v,w]= =7.18 Path 6：w-v-n-y-w Cost6 = TC[w,v]+TC[v,n]+TC[n,y]+TC[y,w]= =8.80 The result is 报道/v 新闻/n 了/v

The result is 报道/v 新闻/n 了/u
A Chinese Example VOLSUNGA Step1：min{TC[w,n]+EC[报道|n], TC[w,v]+EC[报道|v]} = min{ , } T[1] = v Step2：min{TC[v,n]+EC[新闻|n]} = min{ } T[2] = n Step3：min{TC[n,u]+EC[了|u], TC[n,v]+EC[了|v], TC[n,y]+EC[了|y]} = min{ , , } T[3] = u The result is 报道/v 新闻/n 了/u

The result is 报道/v 新闻/n 了/y
A Chinese Example Viterbi Step1：k = 1 Cost[1, 报道/n] = (Cost[0, w] + TC[w,n]) + EC[报道|n] = 10.31 Cost[1, 报道/v] = (Cost[0, w] + TC[w,n]) + EC[报道|v] = 7.59 Step2：k = 2 Cost[2, 新闻/n] = min{(Cost[1, v] + TC[v,n]), (Cost[1, n] + TC[n,n])} EC[新闻|n] = = 15.86 Step3：k = 3 Cost[3, 了/u] = (Cost[2, n] + TC[n,u]) + EC[了|u] = 20.24 Cost[3, 了/v] = (Cost[2, n] + TC[n,v]) + EC[了|v] = 25.33 Cost[3, 了/y] = (Cost[2, n] + TC[n,y]) + EC[了|y] = 21.34 Step4: k = 4 Cost[4, ，/w] = min{(Cost[3, u] + TC[u,w]), (Cost[3, v] + TC[v,w]), (Cost[3, y] + TC[y,w])}+ EC[，|w] = = 21.42 Cost[k, t] comes from the negative log probability: Cost[k, t] = min{Cost[k-1, s] + TC[s, t]} + EC[wk|t] The result is 报道/v 新闻/n 了/y

Wrap-Up 统计模型的训练马尔可夫链隐马尔可夫标注算法隐马尔可夫模型 Computing Likelihood
Computing Probabilities 马尔可夫链 Definition Graphical Representation 隐马尔可夫模型 Components Assumptions Computing Likelihood Decoding Learning 隐马尔可夫标注算法 Viterbi Examples Vs Other Methods

101035 中文信息处理 Chinese NLP Lecture 7.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 7."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

101035 中文信息处理 Chinese NLP Lecture 7.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 7."— Presentation transcript:

Similar presentations

About project

Feedback